A cognitive robot detecting objects using sound, language, and vision
Members: Aleksandrs Ecins, Adam McLeod, Andreas Andreou, Austin Myers, Ching Teo, Daniel B. Fasnacht, Dimitris Pinotsis, Eirini Balta, Merve Kaya, Fabio Stefanini, Francisco Barranco, Cornelia Fermuller, John Harris, Je Hi An, Roi Kliper, Katerina Pastra, Kailash Patil, Malcolm Slaney, Mounya Elhilali, Ozlem Kalinli, Michael Pfeiffer, Ryad Benjamin Benosman, Samantha Adams, Shih-Chii Liu, Siddharth Joshi, Tomas Figliolia, Timmer Horiuchi, Tobi Delbruck, Troy Lau, Yan Wu, Yezhou Yang
- Organized by Cornelia Fermuller, Yiannis Aloimonos, & Andreas Andreou| Cornelia Fermuller | cornelia.fermuller@… | 26-Jun | 16-Jul |
| Yiannis Aloimonos | yiannis@… | 26-Jun | 16-Jul |
| Andreas Andreou | andreou@… | 29-Jun | 18-Jul |
| Ryad Benjamin Benosman | benjry.benos@… | 25-Jun | 16-Jul |
| Katerina Pastra | kpastra@… | 26-Jun | 16-Jul |
| Eirini Balta | ebalta@… | 26-Jun | 16-Jul |
| HUI JI | matjh@… | 5-Jul | 16-Jul |
| Ajay Mishra | mishraka@… | 1-Jul | 8-Jul |
| Douglas Summers-stay | dsummerstay@… | 30-Jun | 9-Jul |
| Austin Myers | amyers@… | 26-Jun | 3-Jul |
Related tutorial Please go to 2011/ros11 to download ROS-related software.
Problem description:
We propose to study the interaction between sound, high level knowledge (in form of language) and visual processes for solving the problem of object recognition for an embodied system. We envision a system that has the same major cognitive components that humans have to solve this problem. These include 1.) speech understanding, 2.) a high-level cognitive system (in form of language that can reasons about object properties 3.) a vision system which segments the image regions corresponding to objects and extracts visual properties of these regions based on 2D visual appearance and shape attributes 4.) an attention mechanism, which using information from language and vision decides where in the image/ video to focus on next and what information to extract 5.) a memory structure organizing object knowledge.
To demonstrate the ideas we would like to combine the different components into one system and solve the following problem: A robot is given in spoken language the names of objects, and finds these objects using his vision system in a room. We plan to bring our robot equipped with a laser range sensor, sonar ring, and a pan-tilt unit carrying a stereo rig of four colour cameras. The robot has the software for basic navigation capabilities (obstacle avoidance, path planning) and building a map of the place.
Relationship to previous work:
While visual object recognition is a heavily studied problem in Computer Vision, the current framework for this research is not anthropomorphic, but data-base driven. Current object recognition approaches, without segmentation of the scene into regions corresponding to objects, passively search the image with templates of appearance-based feature descriptors. Success largely is due to advances in learning techniques. A few studies recently considered additional information for recognition from labeled images transcripts and language resources, but they treated language simply as a contextual system. In contrast here, we would like to study the interaction between signal processing (vision and sound) and higher cognitive processes (language processing) and implement them for a system with an attention mechanism in an active approach.
List of specific topic area projects:
Speech processing: to understand instructions about objects
Natural Language Processing: Developing the tools to extract properties of objects to aid the visual processes. Such properties are visual attributes (color, texture, shape), object part descriptions, and information about the spatial relationship of objects in the scene
Visual processing: Segmentation of the scene into visual regions corresponding to objects and Computing 2D properties such as texture and contours, and 3D shape primitives
Attention system: Developing a framework (possibly using Information theory) for finding where to look next on the basis of higher level knowledge together with visual information
Memory: Studying which primitives of shape and 2d visual appearance characterize specific objects, and how we can organize this knowledge in a principled fashion.
Talks
Projects
In this project we plan to put together a whole system consisting of a robot with vision, sound and language (for reasoning) that recognizes human manipulation activities. A manipulation activity in this description consists of three quantities: the tool, the action, and the object. For example; “knife, cut tomato”. The robot with software, developed under ROS, looks at a scene consisting of a person performing a manipulation action and outputs a verbal description of the activity, such as: “A person cuts a tomato with a knife.”
The different components that will be developed:
1. Enriching the praxicon:
The Praxicon is a lexical resource that encodes information about the relationship of actions and objects, descriptions of actions, and descriptions of objects. In this project the Praxicon will be enhanced with information relating to the manipulation actions analyzed in this project.
2. The reasoner:
The reasoner gets as input the recognized quantities from the visual modules (tool object action) and verifies whether the combination is possible, or suggests that one of the quantities is recognized erroneously and makes suggestions. It also generates the sentence.
Output is in the form:
visual: ok
alternative type: objA probability, objB probability, objC probability etc.
verbalisation: 'sentence describing the scene'
Examples:
visual: ok
alternative type: none
verbalisation: 'cut the tomato with the knife'
or
visual: wrong
alternative tool: butter knife 1, slicer 0.5, etc.
verbalisation: none
3. Communication between the robot and the Praxicon through a web-service
4. Object and Tool segmentation
To segment objects and tools we use the fixation based algorithm by Mishra et al. 09, but using as input Kinect data (RGBD) and adapt it to run under ROS.
5. Hand detection and finding when the hand comes in contact with the tool
Hands will be located and tracked using an existing ROS package. Code needs to be developed to locate in time when the hand comes in contact with the tool and when the hand releases the tool.
6. Classifying grasp positions
Using as input point cloud data of the hand, extracted using an ROS package, the grasp pose will be classified using a small set of categories. The hand descriptor will be used as part of the action description.
7. Action description
Using as input Kinect data, an OpenNI ROS package extracts a skeleton model of the moving human. Action descriptions will be developed using as input the motion trajectory of the parts of the human skeleton along with hand pose descriptions.
8. The action parser.
A parser operating on the visual input to break the video into visual primitives.
9. Object shape descriptions
Compute a number of shape descriptors of objects and tools from the point clouds
10. Learning the mapping between language attributes and visual attributes
A mapping between visual attributes and language attributes will be learned. The language attributes are linked to objects and the actions using these objects to provide additional information used as feedback in the learning.
11. Visual Navigation
Using as input a map created with the laser sensor, odometry from the shaft encoders, and laser range data, a hippocampus model for localization will be developed.
12. Acquiring distance functions in image and sound space.
Using visual and sound recording of actions, and having a classification in image space,learn a distance function in sound space to cluster in sound space.
Attachments
-
A cognitive robot detecting objects and recognizing actions.pdf
(0.8 MB) - added by fermuller
11 months ago.
-
long.pdf
(4.6 MB) - added by fermuller
11 months ago.
-
Thursday-discussion-summary.docx
(14.0 KB) - added by fermuller
11 months ago.
-
minimalist_grammar_abstract.docx
(11.9 KB) - added by fermuller
11 months ago.
