Description of the system

Participants: Cornelia Fermuller, Yiannis Aloimonos, Andreas Andreou, 'Katerina Pastra', Eirini Balta, Ryad Benjamin Benosman, Michael Pfeiffer, Aleksandrs Ecins, 'Austin Myers', Ching Teo, 'Douglas Summer-stay', 'Ajay Mishra', 'Hui Ji', Yezhou Yang, Tomas Figliolia, 'Je Hi An', 'Katie Zhuang'

Hardware description

Our robot consists of the “erractic” wheeled base-platform from Videre design equipped with a laser range sensor (the Hokuyo ERC), a pan-tilt unit carrying a Kinect RGBD camera, and the sensorium of acoustic sensors: an array of microphones, a micro - Doppler ultrasound system and a vibration sensor array. The sensorium consists of acoustic sensors specially designed for acoustic scene analysis (Julian 2004, Zhang 2007,2008, Georgiou 2011). These sensors with fine time-domain synchronization provide a rich set of signals to parse the acoustic scene. Figure 1 shows the sensor setup (including active and passive vision sensors). As can be seen, the Kinect camera as well as the two active micro-Doppler sensors and four color cameras are mounted on the platform. The quad microphone array is located directly above the sensor platform, suspended from the ceiling. The laptop controlling the system is based on ROS (the robot operating system) running under Linux.

Figure 1: Hardware setup

System Overview

There are two parallel streams: the visual stream and the auditory stream.

The visual pathway

The visual stream, as shown in Figure 2, delivered from the Kinect sensor, is delivered to two processes: a process that recovers the skeleton of (any) human present and a process that begins the segmentation of objects on the table by applying the torque mechanism and the visual filters. The torque is an attention mechanism that produces a retinotopic map denoting the possible location of objects on the table (proto-objects). The visual filters are trained on the visual appearance of objects. Both the torque and the filters make sure that a number of locations in the image, where the objects exist, are selected. These locations are then given as candidate fixation points to the fixation based segmentation process, that segments the objects by putting a contour along their occluding boundary. Given this, we estimate a number of visual attributes, which together with the visual filters provide an initial recognition for tools and objects.

The skeleton is further given to the action interpretation process that will provide recognition of the human action taking place, by considering the movements as well as the identity of the objects that the hands touch. This generates a triplet (tool, action,object) that is given to the Praxicon (the Reasoner). The Reasoner, will either accept the triplet and produce a verbalization of what is taking place, or will reply to the system with alternative triplets. At this point the robot will have to acquire new data (active perception) by moving appropriately to investigate the alternative triplets provided by the Praxicon. This last step was not implemented in the final demo due to hardware problems with the robot’s motors.

Fig 2: The visual system system.

The auditory pathway

Auditory cognition involves the perception-reasoning-action loop as depicted below (see Figure 3), and follows analogous architecture as the visual counterpart. Perception begins at the sensorium with sensors from multiple active and passive acoustic modalities, uDoppler array, microphone array, and vibration sensor arrays. Following the sensorium there are also multiple auditory processing streams where features extracted in spectral and time-domain. Following the feature extraction, the auditory “scene” is segmented and classified to yield a quadruple data structure: < place, object, action, tool>. This broadly defined symbolic representation of the scene can be thought as the parsing of the acoustic environment to: < where, what, how, who> and feeds the reasoning system as manifested by the Praxicon. At the level of the Praxicon, the auditory and visual streams come together and the reasoning system produces two outputs, the first one is a verbal description of the scene and the actions to promote the dialogue between the human and the cognitive system artifact. As a result of this dialogue, the Praxicon directs the sensorium through its motor control to re-align so that it can improve the performance of the system. This is how the perception, cognition, action loop is closed.

Figure 3: Auditory perception-cognition-action architecture.

In the following sections we describe in more detail the information processing in the visual pathway, the auditory pathway and the Reasoner.

The data set

The following list of ‘tools - actions - affected objects’ were used in the demonstration:

= Tool == Action == Object =
knife slice tomato
masher mash tomato
hand put bowl
peeler peel cucumber
knife chop cucumber
hand put bowl
salt shaker sprinkle bowl
salad spoon toss bowl
ladle pour bowl
spoon stir bowl
pitcher pour mug
butter knife spread bread
bottle pour wine glass
bottle pour champagne glass
small bowl pour large bowl


(Julian 2004) P. Julian, A. G. Andreou, L. Riddle, S. Shamma, D. Goldberg, and G. Cauwenberghs, “A comparative study of sound localization algorithms for energy aware sensor network nodes,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, no. 4, pp. 640–648, Jul. 2004.

(Georgiou 2011) J. Georgiou et al., “A multimodal-corpus data collection system for cognitive acoustic scene analysis,” 45th Annual Conference on Information Sciences and Systems (CISS 2011), Mar. 2011.

(Zhang 2007) Zhaonian Zhang, P. Pouliquen, A. Waxman, and A. G. Andreou, “Acoustic {micro-Doppler} gait signatures of humans and animals,” 41st Annual Conference on Information Sciences and Systems (CISS 2007), pp. 627–630, 2007.

(Zhang 2008) Zhaonian Zhang and A. G. Andreou, “Close Range Bearing Estimation and Tracking of Slow moving vehicles using the microphone arrays in the Hopkins Acoustic Surveillance Unit,” 3rd Argentine School of Micro-Nanoelectronics, Technology and Applications, (EAMTA 2008), pp. 140–143, Aug. 2008.