Recognizing Manipulation Actions in Cluttered Environments from Vision and Sound

Members: Alejandro Pasciaroni, Antonis Argyros, Ching Teo, Daniel Neil, Dimitra Emmanouilidou, Bert Shi, Francisco Barranco, Cornelia Fermuller, Greg Cohen, Amir Khosrowshahi, Laxmi Iyer, Michael Pfeiffer, Ryad Benjamin Benosman, Tomas Figliolia, Timmer Horiuchi, Thomas Murray, Will Constable, Yezhou Yang

Organizers:: Cornelia Fermuller (Univ. of Maryland) Andreas Andreou (Johns Hopkins University)

This projects aims at implementing a system that interprets actions of manipulation from visual and auditory input. Understanding and recognizing such actions is very challenging. It involves the recognition of many components: the recognition of humans, human body parts, actions, and objects, and each of these problems is currently of great interest in the studies of vision and sound. The challenge in recognizing manipulations is even greater, because there is large variation in the way humans can perform such actions, and because there are many occlusions in such scenes: hands occlude objects, and objects occlude each other. When we humans recognize such scenes, our high level processes, that carry knowledge, continuously interact with low level image and sound processes. The interaction happens at multiple levels of complexity. For once, they interact at the very high level. We know that certain quantities are likely to co-occur. For example, if we recognize a kitchen knife, we can expect that it likely will be used for ‘cutting’ or ‘slicing’ a food item. In this project, however, our focus will be at a lower level. We are interested in the interaction of high-level processes with signal processing. High level knowledge can guide the attention of vision and sound processes. It can guide the segmentation, the selection of features to be processed, and the merging of visual and auditory processes.

In this project we plan to put together a system that monitors actions for their correctness. This could be thought of as an assistant that monitors (and eventually helps) with manipulation actions.

The set-up will be as follows: A human will be in front of a table with many objects. He/she will be asked to perform a task specified in a script that involves a number of steps. That is, the script is a plan, describing the temporal sequence of elementary actions on objects. We plan to create scripts for different tasks. These could be, for example, ‘assembling a toy car’ (consisting of the steps: add the chassis on top of the bottom frame, screw on the four wheels, add the two lights). Different cameras (video cameras, RGB-depth camera, one or multiple DVS camera(s)) and acoustic sensors, mounted around the table, will be monitoring the set-up. As the human performs the task, possibly by making mistakes, the system has to recognize wrong action and give a warning, ideally in real-time.

List of specific projects

Segmentation of static objects

Segmentation of objects in motion (as they are assembled, dis-assembled and deformed)

Recognition of objects based on contour and shape

Hand tracking

Recognition of hand motion

Recognition of fully body motions

Developing attention mechanisms including possibly high level knowledge

Developing formalisms to decide where to look next

Designing hierarchical descriptions of objects and hand motions

Recognition and sounds of actions

Combining sound and vision signals for better recognition

Invited Presentations

* Bert Shi: A Unified Framework for the Joint Development of Eye Movements and Visual Perception Talk

* Yi Li: On Action Recognition Talk

Results Results