Automated multi-talker speech recognition using automated attention

Description: In our group project, we aim to implement human-like attention in computers. This could primarily be useful when nearing the maximum available memory, computation time, or (in hardware) power consumption. We will focus on modelling attention in a relatively simple task, knowing that the same principles could be applied to more complex tasks.

The project will generate multi-talker audiovisual output (or a simplified version of it), and then combine auditory and visual features in hardware and software to allow speech recognition for one of the talkers.

Experimental setup: The listeners will be presented with 2 overlapping talkers, each of which will be spatially separated (either with different ITDs over headphones or different loudspeakers in freefield). Each talker will say several pairs of digits (e.g., “one three, five four, nine nine”). The listeners’ task is to report the largest two-digit number (i.e., “nine nine” = ninety-nine, in this example). For the project, we also built a model system to perform the same task as the human.

System description

Our model system has the building blocks shown here:

It uses the following hardware module:

1. A binaural  silicon cochlea, which outputs spikes

and the following software modules:

2. Binaural receiver: This module implements a spike-based algorithm in jAER that extracts interaural time differences (ITDs) in real-time based on the spikes from the hardware cochlea. The results are shown here. The first panel shows the unnormalized histogram of possible ITD values in the presence of one speaker. The second panel shows the same plot but normalized. The third panel shows the saliency output as explained in the saliency section.

3. Automatic Speech Recognizer: This module extracts the digits spoken by the two speakers. The recognition is performed by extracting the spectrogram from the audio stream, finding the vector of MFCC features, and comparing it with the stored templates of the digits.

4. Saliency Module: This module uses the ITD maps extracted by the binaural receiver to extract novelty of a sound source based on spatial information.

5. Cognition: This module determines the source/speaker to attend based on the outputs of the ASR and the novelty detector. Attention directs the module to focus on the digits coming from one of the 2 speakers. It also reports the correct answer (the largest pair of digits).

The different blocks are executed on different laptops and each module communicates to the cognition module through UDP packets.

Realtime implementation

Several components were adapted to work in realtime, specifically, the audio input, the calculation of the MFCCs, and the speech recognition.

The key difference between a realtime implementation and its original offline implementation is that the realtime version operates on small sections of audio ("frames") at a time, whereas the offline version can wait until the whole sound is presented, then calculate it in one go. Here, each of the frames was just 10 ms long, meaning that we were generally working with very short latencies despite the complexity of the processing.

Often, each program needs to use information from preceding frames, for example, speech identification takes place at the scale of whole words (here a maximum of approximately half a second), which is considerably longer than the single frame (10 ms) at which calculations are occurring. Here, the solution was that each program would have its own " persistent" memory for as many frames as it required. Thus each of the components independently takes a single frame as input and gives a single frame of output, making them easy to string together in realtime programs.

Another important factor for realtime computation is speed. Here, each program needed to be initialised in advance so that (a) sufficient memory would be allocated and (b) any preliminary calculations could be performed just once and their results stored in  persistent memory.

Realtime auditory input to Matlab is not trivial. Several pre-existing solutions were written for earlier versions of Matlab and Windows, and no longer work on recent versions. As such, anyone who tackles this problem is likely to spend many wasted hours. In order to save others this effort, we have made available put the results of our search and our eventual solution at Matlab File Central (no link yet, as it is currently under review, but searching "tgrabaudio" should locate it if this hasn't been updated).

Planning Notes


  1. Francisco Barranco - Visual lead
  2. 'trevor' - Auditory lead
  3. Kailash Patil - Auditory co-lead
  4. 'merve' - Auditory co-lead
  5. Shih-Chii Liu - AV Mentor
  6. Barbara Shinn-Cunningham (Boston) - Auditory Mentor
  7. 'Vasconcelos' (UCSD) - Visual Mentor
  8. 'Fred' () - Visual Mentor
  9. 'Jude Mitchel' (Salk) - Visual Mentor