Recognizing and monitoring manipulation actions

by Tom Murray, Michael Pfeiffer, Suchi Saria, and Andreas Andreou

In this part of the project we were discussing and implementing methods to recognize and monitor manipulation actions from vision, using the low-level detectors developed by the other members of the project. The goal was to translate the verbal script for the activity into a probabilistic graphical model that could infer the most likely observed actions, deal with uncertainties in the feature detectors, and output the estimated progress in the sequence of manipulation actions, so that the next instruction could be displayed online.

We discussed various options and decided that the best suited model for this task would be a Hierarchical Hidden Markov Model (HHMM) (Fine et al. 1998). A HHMM extends the classical Hidden Markov Model for inference on time series data. An HMM is a graphical model characterized by a temporal sequence of hidden states, in which the next state depends only on the previous state, but not on the entire previous history. Every hidden state has an observation model, which gives a probability distribution over observable inputs. The main difference for HHMMs is that they use a nested structure, in which every state corresponds to a HHMM itself, i.e. a state in a lower level of the hierarchy produces a sequence of observation, whereas higher level states produce sequences of low-level sequences and so on.

This structure is ideally suited to the task, because we assume that longer activities can be split up into a sequence of steps, and those steps can be split up into a series of sub-steps corresponding to the low-level activities that the vision algorithms can detect. Steps of activities would correspond to high-level states, whereas low-level states would correspond to the execution of sub-steps. If input arrives frame-by-frame from the vision systems, we can introduce a further level of hierarchy, which takes into account that every recognized action has a characteristic duration. A HHMM model can therefore infer in which step of the activity sequence a person is, and whether the performed sequence of low-level actions follows the desired script.

We defined a model, as shown in the following figure:

HHMM illustration

This model has steps on the high level, sub-steps one level below, and a third level of hierarchy where every sub-step is characterized by a combination of Tool, Action and Object. Finally these states gives rise to a number of observations at the lowest level of the hierarchy. This is where the input from visual feature detectors feeds in. There are also termination states F_SS and F_X which indicate to the level above that a sequence has finished, and returns control to the next higher layer.

The advantage of such model is that it is general enough to model a wide variety of manipulation actions, and in particular, such a model can be learned with an extension of the Baum-Welch algorithm, given enough training data. It would also allow to model or learn alternative plans from observations, e.g. if a person decides to not stick to the script, but nevertheless manages to complete the task. The abstraction level of Tools/Actions/Objects? makes this easier, since only this triplet has to be recognized from the visual observations, which reduces the state-space for possible alternative plans, and also allows the re-use of observation models learned for a combination of the three variables. A further advantage is that this additional abstraction level allows for an almost seamless integration of additional sensors, e.g. if ultra-sound or neuromorphic vision sensors, or classifiers for hand trajectories are included to allow a more reliable recognition of the action sequence.

Since we did not have enough training data with the simultaneous outputs of all feature detectors available during this workshop, we decided to use a simpler model to obtain first results. In particular we decided to use a collection of HMMs, one for each step of the activity sequence, instead of a single Hierarchical HMM.

Illustration of a HMM

The HMMs were designed to have one state per sub-state of the original script, which could be associated to a verbal description of the current or next manipulation action. We designed the state-space to be feed-forward, with a constant transition and self-transition probability for every state. We had to define the observation models qualitatively by hand, due to the lack of training data. However, we found out that after moderately tuning the observation models, the HMMs were able to track the state sequence reasonably well. The probabilities for grasps, actions, tools, and objects in the observation model were based on the verbally defined script, assigning highest probabilities for the main grasps, actions, tools, and objects in the script, intermediate probabilities for similar objects or actions, and low but non-zero probabilities for all other entries. We then used the classic Viterbi algorithm for inferring the most likely state sequence in the HMM on the entire sequence of observations for one step (between 200 and 390 frames). Confusions occurred occasionally, either due to unexpected outputs of the feature detectors, or due to transitions between states that were hard to recognize due to the similarity of the observations. Also, because the probabilities were entered manually, we expect a significant improvement of the performance if the HMM parameters can be learned from training data, instead of being hand defined. For the demonstration we inferred the most likely state at every frame of the synchronized video between all feature detectors, and provided a verbal description of the state in which the model most likely is.

In conclusion, the results were satisfactory, although not always 100% correct. However, we see room for improvement both by going to a more general model in the described HHMM architecture, and by actually learning the HMM models from real training data, as soon as enough data has been collected. We also think that the HHMM model has the potential of being a general mechanism for learning hierarchical, flexible, and reusable models for recognizing and monitoring manipulation actions.

References: - Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine learning, 32(1), 41-62.