TR2016-080

A Multi-Stream Bi-Directional Recurrent Neural Network for Fine-Grained Action Detection


Abstract:

We present a multi-stream bi-directional recurrent neural network for fine-grained action detection. Recently, twostream convolutional neural networks (CNNs) trained on stacked optical flow and image frames have been successful for action recognition in videos. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label. We test on two action detection datasets: the MPII Cooking 2 Dataset, and a new Shopping Dataset that we introduce and make available to the community with this paper. The results demonstrate that our method significantly outperforms state-ofthe-art action detection methods on both datasets.

 

  • Related News & Events

    •  NEWS   MERL researcher Tim Marks presents invited talk at University of Utah
      Date: April 10, 2017
      Where: University of Utah School of Computing
      MERL Contact: Tim Marks
      Research Area: Machine Learning
      Brief
      • MERL researcher Tim K. Marks presented an invited talk at the University of Utah School of Computing, entitled "Action Detection from Video and Robust Real-Time 2D Face Alignment."

        Abstract: The first part of the talk describes our multi-stream bi-directional recurrent neural network for action detection from video. In addition to a two-stream convolutional neural network (CNN) on full-frame appearance (images) and motion (optical flow), our system trains two additional streams on appearance and motion that have been cropped to a bounding box from a person tracker. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. Our method outperforms the previous state of the art on two action detection datasets: the MPII Cooking 2 Dataset, and a new MERL Shopping Dataset that we have made available to the community. The second part of the talk describes our method for face alignment, which is the localization of a set of facial landmark points in a 2D image or video of a face. Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression. To address this issue, we propose a cascade in which each stage consists of a Mixture of Invariant eXperts (MIX), where each expert learns a regression model that is specialized to a different subset of the joint space of pose and expressions. We also present a method to include deformation constraints within the discriminative alignment framework, which makes the algorithm more robust. Our face alignment system outperforms the previous results on standard datasets. The talk will end with a live demo of our face alignment system.
    •  
    •  NEWS   MERL presents three papers at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
      Date: June 27, 2016 - June 30, 2016
      Where: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV
      MERL Contacts: Michael Jones; Tim Marks
      Research Area: Machine Learning
      Brief
      • MERL researchers in the Computer Vision group presented three papers at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), which had a paper acceptance rate of 29.9%.
    •  
  • Related Research Highlights