Humans, other biological systems, robots, and computers all have limited processing resources. Engineers are well used to optimizing systems, often simplifying the task in order to achieve acceptable computation times. “Attention” refers to the biological equivalent of this resource allocation. Human attention is often described as “bottom-up” or “top-down”. In bottom-up attention, resources gets automatically focused on particularly “salient” features. We have access to two computational models of auditory salience (Kalinli/Kayser?) and one model of visual salience (Itti-Koch). In terms of top-down attention, there is the divisive-normalization model (Reynolds-Heeger), which describes a wide range of physiological data, or models of wandering attention (e.g., Moreno-Bote).

In our group project, we aim to implement human-like attention in computers. This could primarily be useful when nearing the maximum available memory, computation time, or (in hardware) power consumption. We will focus on modelling attention in relatively simple tasks, knowing that the same principles could be applied to more complex tasks.

Stage 1: An auditory stimulus

We want to develop an attentional task that humans find challenging, as a challenge for subsequent models of attention implemented in software. In this task, listeners will be presented with several overlapping talkers, each of which will be spatially separated (either with different ITDs over headphones or different loudspeakers in freefield). Each talker will say several pairs of digits (e.g., “one three, five four, nine nine”). The listeners’ task is to report the largest two-digit number (i.e., “nine nine” = ninety-nine, in this example). The parameters should be adjusted so that humans find the task challenging without any prior information, but relatively easy when they know which location to attend to.

Stage 2: An audio-visual stimulus

Primarily for the benefit of the computational model, we would like to provide further information through the visual domain that could be used to direct the auditory attention.

This could be in the form of LEDs of variable brightness, reflecting the locations and envelopes of the simultaneous talkers. Alternatively the visual stimuli could be presented on a computer screen, varying in brightness or (like the human mouth) in area.

Stage 3: Any computational solution

We need to set up sensors and software to

  • record and auditory and visual stimuli
  • bind the auditory and visual stimuli to some extent
  • enhance the auditory stimulus by attending to its location (e.g., through beam-forming)
  • identify the digits and report the correct answer (the largest pair of digits)

Potential bonuses:

  • use the silicon cochlea
  • add noise (or cope with environmental noise in freefield)

Stage 4: A neuromorphic solution

Here, we can improve on our original solution by using biologically inspired models of attention, e.g.

  • pre-filter the sound and vision using models of saliency
  • attend to the appropriate stimuli using the divisive-normalisation model of attention or another.
  • recognize the digits based on the Kailash-Roi-Malcolm saliency-based speech recognition.

Stage 5: Evaluation of the neuromorphic solutions

We should try variations of the paradigm to explore the pros and cons of human-like attention, perhaps answering some of the following questions: Do biologically inspired models of attention bring anything useful to the worlds of hardware and software? What are the advantages of human-like attention? (e.g., attention being grabbed by low-level saliency cues; not wasting resources on less interesting stimuli)

What are the disadvantages of human-like attention (e.g., being grabbed by low-level saliency cues). Potential applications of computational attention:

  • Monitoring many streams of telephone calls for specific content (e.g., “… make bomb…”) or large numbers of CCTVs
  • Directing directional microphones to the locations of salient auditory cues, rather than simply pointing ahead all the time.