Real-time multi-talker speech recognition using automated attention from the ITD information of a binaural silicon cochlea


  1. Shih-Chii Liu
  2. 'Malcolm'
  3. Manu Rastogi
  4. Vikram Ramanarayanan
  5. Guillaume Garreau
  6. Victor Benichoux
  7. Tobi Delbruck

Project idea

This project is an extension of a multi-talker speech recognition system project carried out at the 2011 Telluride workshop. In the 2011 workshop, we aim to implement human-like attention in computers by focusing on modelling attention in a relatively simple task, knowing that the same principles could be applied to more complex tasks.

Experimental setup: The listeners will be presented with 2 overlapping non-recorded human talkers, each of which will be spatially separated. Each talker will say several pairs of digits (e.g., “one three, five four, nine nine”). The listeners’ task is to report the largest two-digit number (i.e., “nine nine” = ninety-nine, in this example). For the project, we also built a model system to perform the same task as the human.

System description

Our model system has the building blocks shown here:

It uses the following hardware module:

1. A binaural  silicon cochlea, which outputs spikes and sampled audio data

and the following software modules:

2. Binaural receiver: This module implements a spike-based algorithm in jAER that extracts interaural time differences (ITDs) in real-time based on the spikes from the hardware cochlea. The results are shown here. The first panel shows the unnormalized histogram of possible ITD values in the presence of one speaker. The second panel shows the same plot but normalized. The third panel shows the saliency output as explained in the saliency section.

3. Automatic Speech Recognizer: This module extracts the digits spoken by the two speakers. In this case, we explored the use of several real-time speech recognition toolbox including the Microsoft Recognition System, the Google App, and the Sphinx toolbox.

4. Saliency Module: This module uses the ITD maps extracted by the binaural receiver to extract novelty of a sound source based on spatial information.

5. Cognition: This module determines the source/speaker to attend based on the outputs of the ASR and the novelty detector. Attention directs the module to focus on the digits coming from one of the 2 speakers. It also reports the correct answer (the largest pair of digits).

The different blocks are executed on different laptops and each module communicates to the cognition module through UDP packet consisting of ITD information and ADC samples.

Real-time implementation

Several components were adapted to work in realtime, specifically, the audio input, ITD calculations, beam forming, and the speech recognition.

The key difference between a realtime implementation and its original offline implementation is that the realtime version operates on small sections of audio ("frames") at a time, whereas the offline version can wait until the whole sound is presented, then calculate it in one go. Here, each of the frames was just 15 ms long (because of the settings for sending the udp packets in jAER), meaning that we were generally working with very short latencies despite the complexity of the processing.

Because we need real-time stereo audio samples, we had two choices. 1) Sampling the sound from two mono microphones using the laptop. This turned out to be quite difficult, we could only get mono even with two microphones and using different software packages such as Audacity. The lack of audio input connectors in many laptops also make it difficult to do the stereo sampling. 2) Sending ADC samples from the AER-EAR2 board. The ADC samples was not as clean as recorded speech on the laptop. The sampling rate was low so the recorded speech was not clear. We finally could get the sampling rate up to 9kHz leading to an understandable speech.


Setup: ITD and binaural ADC samples are sent using UDP packets from the AER-EAR2 board. The ITD information is used then for demixing the audio mixture from each of the two microphones.