Attention Driven Scene Analysis: Cocktail Party Simulation Results


  1. Francisco Barranco - Visual lead
  2. 'trevor' - Auditory lead
  3. Kailash Patil - Auditory co-lead
  4. 'merve' - Auditory co-lead
  5. Shih-Chii Liu - AV Mentor
  6. Barbara Shinn-Cunningham (Boston) - Auditory Mentor
  7. 'Vasconcelos' (UCSD) - Visual Mentor
  8. 'Jude Mitchel' (Salk) - Visual Mentor
  9. Mounya Elhilali - Auditory Mentor
  10. Malcolm Slaney - Auditory and Visual Mentor
  11. Tobi Delbruck - Software support with auditory localization


This project began as an investigation of combining auditory and visual saliency to help capture better information about the scene, and has evolved into the cocktail party simulation in the first two weeks. In the end, we did not count the visual information but attempted to demonstrate the play between top-down and bottom-up attention by using a "complex" auditory scene, and a task.

Initial Version

Our first thought was to implement a system in which we have simple objects moving around the scene and creating noise. The objects can overlap each other, but we want to follow one object by paying constant attention to it. The idea was that when there is overlap and the visual information is of no use, the auditory information will guide the tracking. An example of this scene can be seen in "initial stimuli.png" in the attachments. However we ended up leaving this idea behind, after much discussion, because we could not find a way to fit attention to this setup to make it different from just a tracking problem. The reason why we use attention in computation is to make more efficient use of computational resources, but we could not make this problem so that attention will help us, rather than increasing computation.

Design of the Task

We decided to change the task in a way to demonstrate attention. With a great deal of help from Barbara Shinn-Cunningham we decided on using the following task: A person hears a clip of numbers being said from two speakers, and they have to report the highest number they heard. The challenge is that the numbers are overlapping in time, so keeping track of all of them is hard, and the system has to decide which number to attend to at any given point. We decided to go with this setup, where we put two digits for one number, so when a person hears "nine one" they must interpret that as "ninety one". During the entire process of designing the task, we all saw that it is very difficult to design tasks for attention and saliency, especially if we want to combine auditory and visual information.

Programming the System

Our setup was initially two have 2+ speakers and 2 microphones. We wanted to be able to segregate the input streams so that when we want to pay attention to input coming from one direction, we could understand what is being said there. This is not a very difficult task for humans, we can block out sound from directions we are not interested in very well. Computationally, we wanted to do this with beamforming, a standard method. The key to beamforming is the knowledge that if we have two inputs to two microphones, the information from each channel to each microphone will be delayed by some time. Then we can delay and sum them in such a way that the information from one channel will exactly overlap and thus be boosted. This procedure works very well for simple inputs such as impulses, but we could not get the output we want with this, as the boosting ended up being only minimal, and we could not work with it. "originalbinaural.wav" has the necessary stimulus, coming from two directions. "reconstructed.wav" has the result after beamforming. There is not much difference between them, we can hear that the beamforming isn't doing a very good separation. So we decided to go for a simple template matching algorithm to get numbers directly from the mixed stream.


The demo of the final system has the building blocks shown here:

The picture of the setup is as shown:

It uses the following hardware module:

1. A binaural  silicon cochlea, which outputs spikes

and the following software modules:

2. Binaural receiver: This module implements a spike-based algorithm in jAER that extracts interaural time differences (ITDs) in real-time based on the spikes from the hardware cochlea. The results are shown here. The first panel shows the unnormalized histogram of possible ITD values in the presence of one speaker. The second panel shows the same plot but normalized. The third panel shows the saliency output as explained in the saliency section.

3. Automatic Speech Recognizer: This module extracts the digits spoken by the two speakers. The recognition is performed by extracting the spectrogram from the audio stream, finding the vector of MFCC features, and comparing it with the stored templates of the digits.

4. Saliency Module: This module uses the ITD maps extracted by the binaural receiver to extract novelty of a sound source based on spatial information.

5. Cognition: This module determines the source/speaker to attend based on the outputs of the ASR and the novelty detector. Attention directs the module to focus on the digits coming from one of the 2 speakers. It also reports the correct answer (the largest pair of digits).

The different blocks are executed on different laptops and each module communicates to the cognition module through UDP packets.