Sensory Fusion for Lip Reading

Team: Soumyajit Mandal, Shih-Chii Liu, arindam basu, Tobi Delbruck, Bruno Umbria Pedroni

The overall goal of this project is to improve speech recognition performance from silicon cochlea spike data by adding lip movement information captured simultaneously by a silicon retina (DVS or DAVIS sensor). It consists of the following sub-tasks:

1. Acquire a reference dataset of speech and video samples from various speakers. We will run three sensors simultaneously on the same computer for this purpose: a 128-channel binaural silicon cochlea with spiking outputs, a conventional stereo microphone, and a 128 x 128 DVS silicon retina with spiking outputs.

2. Generate lip dynamics (smoothed motion profiles) from the DVS outputs.

3. First use Arindam's ELM chip (or a comparable ELM implementation) to classify digits, single words, and short sentences using only cochlear outputs. Then add lip dynamics information and quantify the resultant improvement in classification performance.

Matlab GUI for simultaneously acquiring audio and video data

Matlab GUI for acquiring simultaneous audio and video data from a subject

Example Matlab GUI for simultaneously plotting audio and video data

Matlab GUI for plotting simultaneously acquired audio and video data

Simple lip-tracking algorithm

The following image shows the lip tracking algorithm which splits the mouth contour (user-defined rectangle around lips) into 2 horizontal segments (upper and lower lips), each formed by multiple vertical boxes. The algorithm bins the data in time, with overlapping time bins, and selects the mean (x,y)-pair inside each box at each frame (time bin).

Simple lip tracking algorithm

Results for 2 different subjects speaking the 10 digits

1. The 5 lines in the vertical axis in each image are the distances from the right corner (line 1) to the left corner (line 5) of the mouth. The middle line (line 3) is the center of the mouth.

2. The horizontal axis is time.

3. The distances are calculated by splitting the image into 10 sections (2 rows, 5 columns). The significant measure (point) in each section is obtained by computing the mean spiking activity inside the respective section.

HeatMap of 10 digits spoken by 2 different subjects