Human pose estimation for recognizing manipulation actions

by Fang Wang, Francisco Barranco, and Yi Li


Human pose is one of the important cues in recognizing manipulation actions. Localizing body joints continuously over time facilitates the tracking of human hands, the detection of surrounding objects, and the classification of possible actions.

action1 action2

Fig. 1 Two examples for human pose estimation and manipulation action recognition


In this project, we employ our human pose estimation method [1]. This method used a mixed representation of single and combined parts in a Latent Tree model. Both color information [2] and Histogram of Oriented Gradient [3] are used as features, and Visual Categories of the parts are learned automatically by the Latent Support Vector Machine on the LSP dataset [4]. We further employed the method in [5] to handle the pose inconsistency over time in videos. Instead of generating just the “best” pose according to the model, top 20 hypotheses are generated, and the most consistent pose change over time are generated based on an objective function that measures the correlation between two poses in the adjacent frames.

Considering human poses as time series over time allows us to classify actions. We separate the videos into a training and a test set, and perform the human pose estimation on every frame in both sets. We then adopted a simple sliding window techniques, and manually labeled the video sequences in the training set. Each sliding window in training set is associated to a label (‘default', 'transfer', 'cross-move', 'draw-line', 'sawing', 'hammering'), and 1-nearest neighrbor (1-NN) is used to classify human poses in each frame in the test set to an action label.

Labeling Tool

We developed a labelling tool during the Neuromorphic Engineering Workshop for achieving better results for the manipulation action. The poses in the training sets are considered as positive samples and added to the LSP dataset.

The following is a clip that shows how the labelling tool works. The user only needs to click the body joints in the image in a certain order, and the whole skeleton is generated after all 12 body parts are labeled.


On our newly collected dataset, we achieved 82% accuracy.

The following video shows the pose estimation and the action recognition results for one of our manipulation actions ("Sawing"). The skeleton as well as the action labels are overlaid on the original color images.


[1] Fang Wang and Yi Li, Beyond Physical Connections: Tree Models in Human Pose Estimation, CVPR 2013

[2] Terrillon, J.-C.; Shirazi, M.N.; Fukamachi, H.; Akamatsu, S., "Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images," Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on ,

[3] Dalal and Triggs, Histograms of Oriented Gradients for Human Detection, CVPR 2005


[5] Dennis Park and Deva Ramanan, N-Best Maximal Decoders for Part Models, ICCV 2011