Speech Recognition on ELM IC.

Team: arindam basu, David Anderson, Brandon Carroll, Yi Chen, Emina Alickovic

System Description

The overall goal of this project is to build a speech recognition system combining spike-based silicon cochlea and the ELM (Extreme Learning Machine) chip. As illustrated by system diagram below, the silicon cochlea listens to speech and convert it into multi-channel spike trains that represents energy components in different frequency band. The ELM chip receives spikes trains from the silicon cochlea and does classification based on features embedded in the spikes trains

system_diagram.png

The main project goals in Telluride workshop are:

1. Adjusting the ELM algorithm and training on pre-recorded spike data from cochlea listening to TI spoken digits data base.

2. Feeding pre-recorded spike data into the ELM chip to realize real-time spoken digits recognition.

3. Connecting Cochlea with the ELM chip to realize classify spoken digits in real-time when someone speaks a digit to cochlea on site.

Silicon Cochlea

The spiking audio data we used was generated by playing files from the TI Digits dataset to a silicon cochlea supplied by Shih-Chii Liu. The outputs for each file were recorded and saved so that they could later be played back to the ELM chip. The data consists of 80 training files and 42 testing files for each digit, all from adult, male speakers.

The silicon cochlea operates by first passing the microphone signals through a bank of cascaded second order filter sections to separate the signal into different frequency bands. The output of each filter section then goes through a half-wave rectifier and into an integrate-and-fire neuron that generates the spike output. The cochlea is described in detail in  this paper.

Extreme Learning Machine (ELM)

Extreme learning machine is a two-layer feed-forward neural network that has fixed random input weights and does a linear regression type of training on 2nd stage weight to achieve good generalization performance and fast training speed. As shown in the photo below, an ELM chip is designed and fabricated in 0.35-μm CMOS to process spike input and realize the random projection of processed input vectors. A FPGA is used to connect the ELM chip to PC, on which a GUI gives control for users and displays results. The photo below shows the ELM algorithm and the spike-based ELM system that consists of the ELM chip and FPGA. Proceed to  this paper for more info about the ELM chip.

Algorithm

The random projections coming out of the hidden layer of the ELM must be interpreted to determine which digit is being spoken. A 50-ms window counter gives a moving average of input spikes at each input channel. A delay channel of 50 ms is added for each spike input channel in the input feature vector, creating a equivalent window of 100 ms. The outputs of the ELM chip contain information about a 100ms wide window of the input sound. To give the learning algorithm more context, we stack the previous two and the subsequent two hidden layer output vectors together with the current vector. The figure below depicts this (each colored square represents a hidden layer output vector, and five of these are stacked together into a single vector used for classification). Thus, the algorithm makes classifications based on information from a 500ms window centered on the moment under consideration.

Two classifiers were trained (using the ELM pseudoinverse method) to label the data. The first was trained as a binary speech detector to distinguish between speech and silence. The second was trained to distinguish between the 11 digits, using only the parts of each recording that contained speech. The final classification is done by using the binary speech detector to mask the output of the digit classifier, and then taking a majority vote.

Results

The table below lists the test accuracy and word error rate for the speech detector, the digit classifier, and the combination of the two. The test accuracy represents the percent of feature vectors that were classified correctly according to the target values. The word error rate is the percent of digits that were classified incorrectly after taking a majority vote across all the feature vectors in an utterance. The word error rate for the digit classifier was determined using the target speech detection values.

Test AccuracyWord Error Rate
Speech Detector96.6%n/a
Digit Classifier81%13.4%
Combined80%14.7%

The confusion matrix for the combined classifier is shown below.

GUI

A screenshot of the Matlab GUI for interacting with the ELM chip is shown below. It allows the user to load recorded cochlea spike outputs for the different digits and send them to the ELM chip. The hidden layer outputs return from the ELM chip to the computer and are classified by our algorithm. The GUI displays the majority vote classification as well as the predicted digit at each time step. The cochlea spikes are also graphed below the digit classification.

Attachments