Exploiting the statistics of an auditory scene to enhance the salience of novel acoustical objects

The basic strategy attempted here is to build a model of previous audio (in this case, the previous 2 secs), and compare the current audio (.25 sec) to this model to detect deviations. Here, the model is simply to use the mean and standard deviation of the cortical representation based on spatio-temporal receptive fields (Mesgarani, Shamma) for each frequency, rate, and scale. Deviations are quantified by the distance of means between the current audio and the existing model, normalized by the standard deviation of the existing model. Further work in this project will see more complicated and higher order model developed that may more robustly model more complicated types of background noises and audio textures.

Figure 1. - Spectrogram of an auditory scene, consisting of a boat motor and the sentence "She had your dark suit in greasy wash water all year" starting at about 6s

Figure 2. - Salience results using the cortical model, and using the same type of processing on the cochleagram (spectrogram). When the target is relatively loud, both representations can pull out the novel event. Here, the peak SNR, computed over 16 50% overlapping windows of the target, is roughly -6 dB.

Figure 3. - Salience results when the target is attenuated by an additional 9 dB for a peak SNR or -15 dB. The spectrogram representation can no longer reliably pull out the novel event. In contrast, the cortical representation still yields a clear peak over the target onset.

Figure 4. - Results using Kaiser's (2005) saliency map. The target is buried in the background, and the novelty is not detected.

Figure 5. - Reconstruction of the audio spectrogram after thresholding the cortical representation by magnitude of the saliency measure. This is effectively another form of "salience map." A few events from the background get through at this thresholding level, but the target onset is well represented.


Nima Mesgarani, 'Andrew Schwartz'