Action features

Participants: Yezhou Yang, 'Hui Ji', Ching Teo, Cornelia Fermuller, Yiannis Aloimonos

We are aware that we need to develop generative models of actions and we work towards that goal. For the demo however we decided to develop discriminative models that can differentiate among the actions in the repertoire. Our data was obtained from the OpenNI realtime skeleton tracking software from PrimeSense, which was applied to the Kinect data. This software provides the xyz coordinates of the joints of a human skeleton model over time. From this data we computed the following attributes.

Average distance between two hands: a discriminative feature to differentiate three classes of actions: actions with large manipulation ranges such as 'transfer', actions with average hand distances such as 'cut', and actions with small hand distances such as 'chop'. This feature is computed as the mean of the Euclidean distances between the actor’s two hand positions over all the frames in the sequence. The sequences were segmented automatically. All sequences start with the actor touching the tool, and end with the actor releasing the tool.

Average speed of hands: The speed is computed from the difference of the hand locations in consecutive frames, and we compute as feature the mean of speed from all the frames. This feature allows us to classify actions into two groups: actions with large hand manipulation speed such as 'chop' and actions with relatively small speed such as 'slice'.

Average speed of the hands along the x-, y-, and z- components: We compute the difference in hand locations in consecutive frames along the x-, y- and z- direction respectively. Again we compute the means over the total number of frames. This feature is helpful in discriminating actions with similar speed, but taking place along different directions. For example, in the action 'stir' the hand has a larger speed along the x- and z- direction while in the action 'peel' the hand has larger speed along the y- direction.

Average vertical position of the two hands: the mean of the y-locations of both hands over all frames. This allows to differentiate between actions in which the actor works close to the table, such as in 'chop', and actions in which the actor usually is further away such as in 'peel', where the actor usually raises the peeler.

Frequency components of the Fourier transform: We computed the fast Fourier transform of the hand trajectories and used the dominant coefficients as descriptor. Intuitively, this allows us to distinguish between repetitive and non-repetitive actions. For example, the action 'chop' (see Figure 1.) has relative large dominant frequency while the action 'pour' has a small dominant frequency.

Figure 1: Trajectory and FFT of action 'chopping'.