Juan Carlos Niebles - Research


The goal of my research is to enable computers and robots to perceive the visual world by developing novel computer vision algorithms for automatic analysis of images and videos. From the scientific point of view, we tackle fundamental open problems in computer vision research related to the visual recognition and understanding of human actions and activities, objects, scene and events. From the application perspective, we develop systems that solve practical world problems by introducing cutting-edge computer vision technologies into new application domains.

List of publications

A complete list of publications to date can be found here.



In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on simple actions and movements occurring on manually trimmed videos. We introduce ActivityNet, a new large-scale video benchmark for human activity understanding. Our benchmark aims at covering a wide range of complex human activities that are of interest to people in their daily living. In its current version, ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. We illustrate three scenarios in which ActivityNet can be used to compare algorithms for human activity understanding: global video classification, trimmed activity classification and activity detection. Visit activity-net.org for more information.

Related publications:

  • Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem and Juan Carlos Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA. 2015. Project Page PDF
  • Fabian Caba Heilbron and Juan Carlos Niebles. Collecting and Annotating Human Activities in Web Videos. ACM International Conference on Multimedia Retrieval (ICMR). Glasgow, Scotland. 2014. PDF

Recognition of Spatio-Temporally Composable Activities

We propose a framework to describe human activities in a hierarchical discriminative model that operates at three semantic levels, encoding poses, simple actions and their compositions into complex activities. Our human activity classifier simultaneously models which body parts are relevant to the action of interest as well as their appearance and composition using a discriminative approach. 

Related publications:

  • Ivan Lillo, Juan Carlos Niebles, Alvaro Soto. Discriminative Hierarchical Modeling of Spatio-Temporally Composable Human Activities. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH. 2014. Project Page PDF Dataset

Spatio-Temporal Human-Object Interactions for Action Recognition in Videos

We propose a method for representing the dynamics of human-object interactions in videos. Previous algorithms tend to focus on modeling the spatial relationships between objects and actors, but ignore the evolving nature of this relationship through time. Our algorithm captures the dynamic nature of human-object interactions by modeling how these patterns evolve with respect to time. Our experiments show that encoding such temporal evolution is crucial for correctly discriminating human actions that involve similar objects and spatial human-object relationships, but only differ on the temporal aspect of the interaction, e.g. answer phone and dial phone.

Related publications:

  • Victor Escorcia and Juan Carlos Niebles. Spatio-Temporal Human-Object Interactions for Action Recognition in Videos. 1st Workshop on Human Understanding Activities: Interactions and Context. ICCV, Sydney, 2013. Project Page, PDF, Video

Automatic Analysis of Activities in Construction Operations

We apply computer vision techniques to the problem of automatic analysis of construction performance. We are interested in detecting and localizing all construction operation entities such as workers and machinery, as well as estimating their activities from video. This technology will enable continuos improvement of construction operations, by optimizing operation times and reducing their environmental impact.

Related publications:

  • Ardalan Khosrowpour, Juan Carlos Niebles and Mani Golparvar-Fard. Vision-based workface assessment using depth images for activity analysis of interior construction operations. Automation in Construction. 2014.
  • Ardalan Khosrowpour, Igor Fedorov, Aleksander Holynski, Juan Carlos Niebles and Mani Golparvar-Fard, Automated Worker Activity Analysis in Indoor Environments for Direct-Work Rate Improvement from long sequences of RGB-D Images.Construction Research Congress. 2014.
  • Mani Golparvar-Fard, Arsalan Heydarian and Juan Carlos Niebles. Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers. Advanced Engineering Informatics. 2013.
  • Milad Memarzadeh, Mani Golparvar-Fard and Juan Carlos Niebles. Automated 2D detection of construction equipment and workers from site video streams using histograms of oriented gradients and colors. Automation in Construction. 2013.
  • Victor Escorcia, Maria A. Dávila, Mani Golparvar-Fard, Juan Carlos Niebles. Automated Vision-based Recognition of Construction Worker Actions for Building Interior Construction Operations Using RGBD Cameras. Construction Research Congress. 2012.
  • Arsalan Heydarian, Mani Golparvar-Fard, Juan Carlos Niebles. Automated visual recognition of construction equipment actions using spatio-temporal features and multiple binary support vector machines. Construction Research Congress. 2012.
  • Milad Memarzadeh, Arsalan Heydarian, Mani Golparvar-Fard, Juan Carlos Niebles. Real-time and automated recognition and 2D tracking of Construction workers and equipment from Site video streams. Int. Workshop on Computing in Civil Engineering. 2012.

Modeling Temporal Structure of Simple Motion Segments for Activity Classification

We present a discriminative framework for modeling actions by exploiting the temporal structure of the human activities. In our framework, we represent activities as temporal compositions of motion segments. We train a discriminative model that encodes a temporal decomposition of video sequences, and appearance models for each motion segment. In recognition, a query video is matched to the model according to the learned appearances and motion segment decomposition. Classification is then made based on the quality of matching between the motion segment classifiers and the temporal segments in the query sequence.

Related publications:

Extracting Human Motion Volumes from Videos

We propose a fully automatic framework to detect and extract arbitrary human motion volumes from real-world videos collected from the Internet. We demonstrate the success of this framework both quantitatively and qualitatively by using a number of downloaded YouTube videos.

Related publications:

  • Juan Carlos Niebles, Bohyung Han and L. Fei-Fei. Efficient Extraction of Human Motion Volumes by Tracking. IEEE Computer Vision and Pattern Recognition (CVPR), San Francisco, 2010.
  • Juan Carlos Niebles, Bohyung Han, Andras Ferencz and Li Fei-Fei. Extracting Moving People from Internet Videos. In the 10th European Conference on Computer Vision, Marseilles, France, 2008. PDFProject Page.

A Hierarchical Model of Shape and Appearance for Human Action Classification

We present a novel model for human action categorization. A video sequence is represented as a collection of spatial and spatial-temporal features by extracting static and dynamic interest points. We propose a hierarchical model that can be characterized as a constellation of bags-of-features and that is able to combine both spatial and spatial-temporal features.

Related publications:

  • Juan Carlos Niebles and Li Fei-Fei. A Hierarchical Model of Shape and Appearance for Human Action Classification. CVPR 2007. Minneapolis, USA.

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Can computers automatically identify different human activities in a video? We address the problem by representing a video sequence as a "bag of video words", and applying a generative probabilistic model to learn and recognize different human actions in video. In the figure, the approach is used to classify three different motions from figure skating.

There is a description and some resources available at our Project Page.

Related publications:

  • Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. International Journal of Computer Vision. In press. 2008.
  • Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. British Machine Vision Conference (BMVC), Edinburgh, 2006. - Oral Presentation.
  • Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei. Unsupervised Learning of Human Action Categories. In Video Proceedings CVPR, New York, 2006.


Li Fei-Fei, Hongcheng Wang, Silvio Savarese, Andrey Del Pozo, Jia Li, Bohyung Han, Andras Ferencz, Bangpeng Yao, Chih-Wei (Louis) Chen, Hrishikesh Aradhye, Luciano Sbaiz, Mani Golparvar-FardAlvaro Soto and Bernard Ghanem.