A realtime auditory saliency model

Description: The project will use as a basis Kayser's auditory saliency model, and adapt it as necessary to work in realtime.

Expected Outcome: Real-time demo of auditory saliency. Useful component for other projects, including a realtime implementation of the novel saliency-based speech-recognition method and the attention-driven two-talker speech-recognition demo.


In principle, this should be as similar to Kayser's auditory saliency model as possible.

One essential change to the available code is that it should be able to operate on short audio samples -- for realtime, we cannot wait until the end of the audio section. Since the saliency model is calculated on the basis of larger time windows (320 ms), it must have a built-in memory for enough of the preceding frames (including 10 ms of audio each).

Additionally, we have fed the saliency model a reconstruction of the spectrogram based of MFCCs. This should serve to de-emphasize the harmonics found in voiced speech.

Model simplification

The offline model consists of various stages:

1. The spectrograms are mirrored at the edges to avoid edge artifacts 2. A moving time window is used for the calculation of saliency over time 3. Three features are calculated (intensity, onset detectors, and center-surround frequencies) centered at each time and frequency, on four different scales. 4. The scales interact to generate a composite map for each feature 5. Each feature map is normalized in such a way that exaggerates relatively large peaks and diminishes uniformly large peaks. 6. The normalized feature maps are added together.

The most computationally intensive section is stage 3, since it involves scaling the image and convolving with filters based on 2D Gaussian filters. For the purposes of minimising calculation, the model was adapted as follows: (a) The Gaussian-shaped features were replaced by rectangular features so that the extent of each feature could be calculated on the basis of a simple mean or difference of two means, which is much faster. (b) A single feature size was used, at least initially. This not only simplifies the feature detection, but also greatly simplifies the programming. (c) There was only one level of features, no interaction of features was required (i.e., stage 4 was omitted) (d) The realtime implementation naturally incorporates a moving window, so only one calculation per frequency per feature was required per frame.


The following figures show a 10-second spectrogram (top panel) and its simplified saliency map (bottom panel), both calculated in realtime, based on a live news report on the radio, including general background noise.

Realtime spectrogram Realtime saliency map


The saliency map looks primarily rather like a low-quality version of the spectrogram. There is some emphasis of onsets, but the most noticeable effect is the normalization. Even quiet background sounds are considered to be salient if there is little else of interest.

Now, the key flaw here is the oversimplification that only one feature level was used. The feature size used was rather smalle, and an important part of the Kayser model seems to be the interaction of different feature sizes, although it is currently unclear to the writer exactly what effects this interaction has.

However, the normalization of 320-ms windows is a part of the Kayser normalization model which will always exaggerate the saliency of background sounds that occur in the silences of the foreground sound.

As this feature is undesirable for our current purposes (focusing on key features of speech in a background noise and detecting novel sounds), we decided that this particular saliency model was not going to satisfy our requirements, and that it was not worth completing the implementation with the multiple feature levels.


The code is available on request (trevor.agus@…). It has not been provided here because it is based on Kayser's original code, which is not publicly available.


  1. 'Malcolm' (Yahoo!) - Lead
  2. 'Trevor' (CNRS & LPP & ENS) - Participant