Results from the High Level Saliency Subgroup

Subproject 1: Top-down attention

Aim: To automatically learn task specific attention fields for complex tasks like phoneme recognition.

Method: For this project we chose the task for broad phoneme classification. The features we used for this are the basic auditory spectrogram representation as described in (1). This models the processing of speech that happens all the way from the ear/periphery to the mid-brain. This model transforms a one dimensional time varying signal into a two dimensional frequency(128 channel) vs time representation. To model the cognitive processing of sound and the top-down attentional processes involved we chose a 3 layer multilayer perceptron. The hidden layer has 1500 nodels with sigmoid nonlinearity . The output layer has soft-max nonlinearity.

The audio database used is the TIMIT database , and the phone classes were mapped to 5 broad phone classes for this task namely vowels,nasals,fricatives,stops and silence. We use 3400 utterences for training, 296 for cross validation and 1344 for testing. For noise we use the recording or real noise from the Noisex database. We use factory floor, fighter jet f16, and leopard tank noises as additive noise at 20,10,0 db SNRs.

Once the multilayer perceptron is trained for the purpose of broad phoneme classification we learn the attention fields for different phoneme classes. To do that the output layer activation is fixed at 1 for that particular phoneme class and the rest are fixed to zero as shown in Fig 1. This activation is later back propagated to the input layer. An example of the attention field for vowels is shown below in Fig 2

The attention field derived is used as weights on the input auditory spectrum according to the formula W = 1+alpha*A. Where W is the weights and A is the Attention field(normalized to zero mean and unit variance), and alpha was set to be 0.1.

Results: The average (over the different noise types and SNRs) recall rates for the phoneme classes with and without the application of attention fields are shown below. Except for stops, the rest of the classes are show significant improvements with the application of attention field.

Without AttentionWith Attention

Subproject 2: High Level Saliency

Description: We explore the possibility of using spectrotemporally local features from salient regions for auditory scene analysis.

Method: Unlike previous studies on auditory saliency, we aim to not just extract temporal slices of salient regions but to extract spectrotemporal regions of saliency. We use Kayser saliency models to determine the saliency map for the given audio. We use the bag of words model with 500 clusters to get the histogram of features and a SVM classifier with linear kernels.

Experiment 1 : We use the traditional MFCC features where a 13 dimensional feature vector is extracted every 10ms.

Experiment 2: We extract the histogram of gradient features typically used in visual object recognition (2). These features are extracted for pixels sampled uniformly from the entire auditory spectrum. The spacing between the feature extraction points was 5 pixels.

Experiment 3: We extract again histogram of gradient features but only in the salient regions given by the saliency map.

The task here is Audio Scene Analysis. The data base used was the BBC sound effects database. We used 5 classes namely natural sounds, machine sounds,music,crowd and animals. We used 1800 1s segments randomly taken from all the files for training and 200 segments for testing.


The classification accuracy for the three experiments is given below.

MFCCLocal FeaturesSalient Features

Future Work: 1) Look into deriving optimal local features for audio scene analysis tasks. 2) Improve the saliency maps to include discriminative abilities. 3) Investigate better ways of integrating saliency detection into feature extraction and classification.


(1) X. Yang, K. Wang, and S. Shamma, “Auditory representations of acoustic signals,” IEEE Trans. Inf. Theory, vol. 38, pp. 824–839,1992.

(2) O. Ludwig, D. Delgado, V. Goncalves, and U. Nunes, “Trainable Classifier-Fusion Schemes: An Application To Pedestrian Detection,” 12th International IEEE Conference On Intelligent Transportation Systems, St. Louis, V. 1. P. 432-437,2009