Final Report

Note: This is unpublished work in progress.

Project Participants

Adriano Claro Monteiro, Alain de Cheveign, Anahita Mehta, Andreas Andreou, Alejandro Pasciaroni, Jesus Armando Garcia Franco, Byron Galbraith, Christian Denk, Daniel Neil, Dimitra Emmanouilidou, Bert Shi, Deniz Erdogmus, Greg Cohen, James OSullivan, Jonathan Tapson, Mehmet Ozdas, Amir Khosrowshahi, Lakshmi Krishnan, Mahmood Amiri, Michael Crosse, Jose L Pepe Contreras-Vidal, Qian Liu, Ryad Benjamin Benosman, Sergio Davies, Shih-Chii Liu, Thusitha Chandrapala, Timmer Horiuchi, Tobi Delbruck, Will Constable

Organizers: Shihab Shamma (Univ. of Maryland), Malcolm Slaney (Microsoft Research), Barbara Shinn-Cunningham (Boston University), Edmund Lalor (Trinity College, Dublin)


The goal of our project is to measure and understand neural responses to imagined sounds. In our three weeks in Telluride we have looked at three different types of imagined sounds (modulated tones, music and speech), measured much new EEG data, developed three types of decoder/classifiers, and investigated the spatial distribution of these signals.


This topic area aims to measure neuronal signals that reflect the perceptual, attentional and imaginative state of an individual brain. Specifically, we sought to develop reliable on-line decoding algorithms that extract from the EEG signal the sensory cortical responses corresponding to an auditory source amongst many in a complex scene, or to an imagined music and speech signal. The goal is to understand how the perception and coding of such complex signals are represented and shaped by top-down cognitive functions (such as attention and recall).

The basic scientific approaches needed are highly interdisciplinary, spanning development of signal-analysis algorithms and models of cortical function, to experimental EEG recordings during performance of challenging psychoacoustic tasks. While there is considerable research that touches upon issues addressed by this project, there are nevertheless several unique aspects to this work. For example, animal neurophysiological and imaging studies of auditory cortical activity cannot easily replicate the sophisticated behavioral tasks possible with humans, especially with speech and music. And since fMRI approaches in humans lack the temporal acuity necessary to track and extract auditory sensory responses, this leaves techniques such as MEG, EEG, and ECoG as the only scientifically feasible options. However, MEG requires expensive elaborate laboratory setups, and ECoG is obviously restricted to a few groups in the world with access to patients and interested surgical teams. Consequently, EEG is an accessible alternative for studying human auditory cognition. The biggest obstacle has been the perceived difficulty of recording clean and sustained signals that can be reliably associated with ongoing speech and music audio. This is especially important if one is to detect and interpret the relatively small response perturbations due to cognitive influences such as imagining sound or changing the attentional focus. We conducted pilot studies at the 2012 Telluride Workshop where we focused on demonstrating the feasibility of extracting the signals to which listeners attended in a complex mixture of sounds. This preliminary demonstration is described in more detail at  http://neuromorphs.net/nm/wiki/2012/att12

During our time together in Telluride we looked for evidence of imagined sounds in three related domains: speech, music and modulated tones. Speech and music both consisted of longer, time-varying signals, and we performed similar analysis techniques for these two signals. We saw good evidence, using a number of different analysis techniques that we could decode imagined speech and music sounds. Modulated tones have a simpler temporal structure and we thought that we might have more success with these highly repetitive signals. That turned out to not be the case. We will talk about each experiment in turn.

We looked at three different approaches for decoding imagined speech or music: kernel-based support vector machine (SVM) for binary classification of imagined speech and imagined music, a machine-learning approach based on capturing the modulation patterns, and an approach based on decoding the EEG signal to recover the original (imagined) stimulus. The specific experiments are described below.

  1. Imagined Speech or Music via an SVM
  2. Imagined Speech/Music via Nearest Neighbors using DCT of Modulated Envelope
  3. LDA based classification using low pass filtered data
  4. Evidence for the representation of an imagined speech envelope in auditory cortex
  5. Steady-State Auditory Evoked Potentials (SSAEP)
  6. Musical Examples

Experimental Paradigm

The aim of the experiment was to determine a reliable measure of imagined audition using electroencephalography (EEG). The experimental paradigm was set up in a paired fashion where the first trial consisted of the sound stimulus and the subsequent trial required the participant to imagine the stimulus that was played to them in the previous trial. This paradigm was used for all the 3 different types of stimuli. One of the main challenges in recording EEG to imagined stimulus (especially for music and speech) is that it is not possible to know when exactly the listener started to imagine the sound stimulus. One can imagine that without an actual cue, the listener may start imagining the target stimulus at different onset times in each trial, which would lead to a smeared effect during the average.

Our initial paradigm consisted of a visual countdown (4, 3, 2, 1) followed by a crosshair preceding each trial (both perceived and imagined). All trial triggers coincided with the onset of the crosshair. The time interval between each digit of the countdown was kept at the same tempo as the stimulus (in most cases, the time interval was 0.53 seconds). All trials were then averaged from the onset of the trigger that coincided with the onset of the crosshair. However, after a couple of pilot experiments, it was noticed that the results in the EEG were mainly driven by the rhythm of the preceding countdown stimulus. Hence, a visual progress bar replaced the digit countdown where the trigger came on at the end of the progress bar. Listeners were instructed to start imagining at the end of the progress bar. Before the progress bar came on the screen, it was briefly visually indicated whether the ensuing trial required the subject to listen or imagine (Fig. 1).

The experimental paradigm was set up in this manner to get as close an estimate as possible to when the listeners started imagining the required stimulus, as that would facilitate averaging across trials and cause minimal smearing due to the onset timing jitter. The stimuli were presented using a MacBook through PsychToolbox in Matlab (Brainard, 1997; Pelli, 1997). All audio samples were played through Klipsch S4 noise-isolating in-ear headphones. The EEG data was collected using the BrainVision 64 channel EEG recording system. The signal was digitally sampled at an A/D rate of 500 Hz. Listeners were fitted with an electrode cap fitted with 64 silver/silver chloride scalp electrodes positioned in an electrode ‘Easy Cap’ (Falk Minow Services, Herrsching-Breitbrunn, Germany).

The stimuli could be divided into 3 types, each consisting of various samples:

  • For Steady State Auditory Stimulation:
    • 500 Hz sawtooth modulated 100% at 4 Hz
    • 769.231 Hz sawtooth modulated 100% at 6 Hz
    • 300 Hz sawtooth modulated 100% at 40 Hz

Each of these stimuli were played for a duration of 20 seconds after which the listeners were instructed to imagine them for the same duration in the same paradigm described earlier. This stimulus was primarily used to exclude the possibility of motor artefacts. It is highly unlikely that the listener could produce a motor response to a 6 Hz modulated tone. Hence, we can be fairly confident that the imagined EEG to these steady state evoking stimuli would be purely due to auditory imagination.

  • Music:
    • Midi version of the Imperial March (Star wars theme).
    • Simple musical tone sequences

We used two types of musical tone sequences, one slightly more rapid than the other. All music stimuli were approximately 3-4 seconds long. The time allowed for imagining the music was 5 seconds per trial.

  • Speech:
    • “The whole maritime population of Europe and America.”
    • “Twinkle-twinkle little star.”
    • “London bridge is falling down, falling down, falling down.”

All the speech stimuli were neutral, well-known phrases and were also approximately 3-4 seconds long and the time allowed for imagining the speech was 5 seconds per trial as well.

Fig. 1: Schematic diagram illustrating the experimental paradigm for each trial.

For the music and speech stimuli, the trials were presented in blocks of 5, for e.g. 5 trials of a single speech stimulus, both perceived and imagined. Each 5 trial block had the same stimulus. However, the different music and speech blocks were interleaved. The listener initiated each block of 5 trials by pressing any key on the keyboard. The preliminary data runs resulted in approximately 75 trials per music and speech stimulus for some subjects.

EEG preprocessing

EEG pre-processing, epoching and averaging was carried out using the Fieldtrip toolbox (Oostenveld et al., 2011), the EEGLAB toolbox (Delorme & Makeig, 2004) and the  NoiseTools MATLAB Toolbox (de Cheveigne and Simon, 2008) . The data was downsampled to 100 Hz, filtered using a zero phase shift band pass filter from 0.1 Hz to 30 Hz and averaged across the 75 trials per condition.

Next, Independent Component Analysis (ICA) was carried out to remove the eye and head artifact components from the EEG data.

Imagined Speech or Music via an SVM

Methods - Experimental Paradigm

This section describes EEG recording session #2 (subject: Ed). The subject was required to listen to a 3 second trial of speech or music and then prompted to imagine it immediately afterwards. The trials consisted of one of two different speech passages (Twinkle Twinkle and London Bridge) or one of two different musical passages (Imperial March and Generic Melody). To promote accurate entrainment to the tempo, rhythm and pitch of each of the four passages, each passage was presented 25 times in a row, with an imagined trial in between each perceived trial. To ensure the imagined trial was well time-locked to each trigger, a visual count in (4, 3, 2, 1) was presented before each perceived and imagined trial.

Methods - EEG data acquisition and pre-processing

EEG data were recorded at 64 scalp locations using the  BrainVision system (Brain Products, Munich, Germany) with an online ground reference and digitized at a rate of 500 Hz. The raw data was read into MATLAB ( MathWorks, Natick, USA) using  EEGLAB (Delorme and Makeig, 2004) and epoched into 4 second windows starting from the trigger (i.e. no pre-stimulus data was included). This was done for only imagined trials which were labelled with their corresponding class. No offline re-referencing, re-sampling or filtering were applied to the data thereafter.

Methods - SVM Classification analysis

SVM classification was implemented using the  LIBSVM toolbox (Chang and Lin, 2011). A radial basis function (RBF) kernel was chosen to utilise the high dimensionality of the data. A 10-fold cross-validation was used for the parameter search and also to test the SVM classifier. The trial order was randomised each time the cross-validation was run so as not to bias the model and to validate the success rate of decoding.

To reduce the dimensionality of the data, a kernelised version of the raw, unfiltered EEG data was computed as follows:

  1. For each trial, the data from every electrode was concatenated into a single vector.
  2. The vectors from all trials were grouped into a single matrix.
  3. The auto-covariance of this matrix is calculated and along with a column vector of class labels is used to train the SVM model.
  4. The cross-covariance of the training data matrix and the testing data matrix is calculated and used to test the model (see Fig. 1).

Figure 1.

The following section explains mathematically how the data is transformed.

For N trials and for E electrodes, the neural response from each electrode e at time t = 0…T for trial n is represented as Rn(t, e) where

The responses from every electrode of trial n are concatenated into a vector which is represented as D(e, n, t) where

The auto-covariance of the neural responses CDD is then calculated to compute the covariance between every trial, while simultaneously reducing dimensionality by averaging across electrodes and time.

It is also possible to think about this system in terms of correlations. In effect, a new EEG signal (a 64D vector over time, unwrapped into a 1-d signal) is correlated against each of the training signals (each is also an unwrapped 64-dimensional signal). The job of the SVM is to look at all the correlation scores, and learn whether high or low correlation values with these different memorized examples says the new signal is in class A or class B.

The best results from the SVM parameter search are displayed in Table 1 below.

Parameter Value
Cost (C) 4.5×103
Gamma () 5.5×10-8

Table 1.

Test results for 3 randomly picked 10-fold cross-validations are displayed in Fig. 2 below along with the average decoding accuracy. These represent how well the SVM classifier could differentiate between raw unfiltered EEG data recorded during imagined speech vs imagined music, imagined Twinkle Twinkle vs imagined London Bridge and imagined Imperial March vs imagined Generic Melody. For the Speech vs Music test, the trials from the two speech passages and the trials from the two music passages were mixed together.

Figure 2.

SVM Discussion

The ability of the SVM classifier to decode the imagined passage with an accuracy of 80–90% is quite an exciting result. Although SVM classification cannot tell us anything scientific about imagined speech/music, it does tell us that there is something detectable by EEG that is reproducible across multiple trials. This gives hope to future work which may seek to explore imagined speech/music through EEG. Furthermore, the ease with which SVM classification can be implemented, and the speed at which it can compute non-linear binary classification of multi-dimensional data makes it a very suitable tool analysing raw data (or even for implementing in real time!!).

We did not expect the SVM classifier to perform as well as it did, purely because I did not think there would be enough information in the data to allow it to classify individual trials. However, part of the reason it performs so well is the way in which the data is kernelised before it is used to train and test the model. By calculating the covariance between every combination of trials in the training data, the SVM can learn the relationship between these covariance’s and the corresponding class labels. Then when we give the SVM the cross-covariance between every combination of trials in the training and testing data, it can predict the class label based on previous knowledge of what label mapped to a particular covariance for a particular trial in the training data.

Imagined Speech/Music via Nearest Neighbors using DCT of Modulated Envelope

In this approach we used features based on a whole trial as a binary classification problem. The inputs to the system are the EEG data recorded during the two speech/music either imagined or perceived. The goal is to classify a given trial into one of the two speech/music streams.

The DSS (Denoising Source Separation) and DCT (Discrete Cosine Transform) algorithms were used for dimensionality reduction. A model is made by averaging the features obtained by the training data. When classifying the test data samples, they are subjected to the same dimensionality reduction methods, and the features are used to do a nearest neighbor classification.

DSS for enhancing repeatable signals and dimensionality reduction

DSS is a tool with multiple applications. It is used here to find a signal subspace that enhances the repeatable part of the EEG signal across the trials. It is calculated by using an averaged version of the signal across trials as the source for the bias covariance. Furthermore, by selecting from the DSS output a limited number of signals with higher power, the number of signal channels could be effectively reduced which amount for a dimensionality reduction as well. It should be noted that when creating the covariance matrix of the bias function, signals from both classes should be used. The number of channels carried forward for the next level is a variable that affect classification performance. [What did we USE????]

DCT as a feature extracting mechanism

This is a simple DCT operation carried out for the signals across time for all the input channels. The number of DCT coefficients of the different channels stacked together form the feature vector. A classification model is created by averaging the EEG signals across all the trials in the training data.

The DCT was used to capture the overall shape of the EEG responses, over time and channel. The raw EEG data was low-pass filtered and downsampled to 25Hz. We formed a signature for a trial by assembling the EEG data into a matrix (nChannels by nTimes) and then computing the two-dimensional DCT. We could then select the most important components, zeroing out the remaining components, and then reconstruct a smoothed version of the original trial data.

This processing chain is illustrated in the next figure. The first and second row show the processing for two different speech signals.

The first column shows the spectrogram for each signal. The second column shows the same type of spectrogram data, but this after processing with DSS to find the most consistent signals over trials. The final column shows the result of the DCT model. This is a highly smoothed version of the original signals.

We formed a model for each condition (speech 1 or speech 2) by averaging the data for all trials in each condition. Then we computed the DCT model, using a range of channel and time counts. This gave us one (averaged) model for each condition.

To recognize the imagined (test) condition, we performed the same DCT calculation on the test data, and then compared the result to the two models. The model that was closest determined the imagined class. This is a simple nearest-neighbor classifier. Interestingly, comparing the trial result to on the DCT of all training examples gave us worse results.


A parameter sweep was carried out to identify the performance of the classification based on the number of DSS channels and DCT coefficients used to form the feature vector. Classification accuracy about 80% was obtained for certain parameters. It should be noted that the test and training data is separated randomly at the beginning of every test run. This result is for the two imagined speech signals.

LDA based classification using low pass filtered data

The input data and goal of the system is same as stated under the previous method. In this approach the input signal was subsampled to 20 Hz, and then passed through several moving average filters with different window sizes. A feature vector is created by augmenting the filtered signal at each sample point. An LDA classifier is trained using all the feature vectors generated from different trials. For each test trial a similar augmented feature vector is created. Classification is done using LDA based on the trained model. Results This method had the ability to classify the two types of data with a reasonable significance level (p<0.0001). But it lacked the level of performance suitable for practical use.

Evidence for the representation of an imagined speech envelope in auditory cortex

It is well established that neural activity tracks the slow (< 15 Hz) amplitude envelope of natural speech (Eggermont 2002, Luo and Poeppel 2007), and that this activity can be measured using electroencephalography (EEG) data (Lalor and Foxe, 2010). It is unknown however, whether or not similar activity is induced in auditory cortex when imagining speech. In an effort to answer this question, we attempted to decode imagined speech from EEG data by employing the method of Stimulus-Reconstruction, a method which has proven successful in several previous studies for analysing neural responses to continuous speech (Mesgarani et al 2009, Ding and Simon 2012, Mesgarani and Chang 2012).

The experimental paradigm consisted of a perceived speech trial, in which a subject passively listened to a speech stimulus, followed two seconds later by an imagined speech trial, in which they imagined hearing the same speech stimulus. Our stimuli consisted of two, three second speech segments, both of which were read by the same male speaker. Pairs of perceived and imagined trials were presented 75 times for both speech segments.

The method of stimulus-reconstruction utilizes linear-regression to create a multivariate model of the input-output relationship between EEG data and the amplitude envelope of the speech stimuli. We will refer to this multivariate model as a decoder. A leave-one-out cross-validation approach was used, whereby each decoder was trained on 74 trials in order to reconstruct an estimate of the speech envelope using the EEG data from the remaining trial. Reconstruction-accuracy was determined by obtaining a correlation coefficient (Pearson’s r-value) between the actual and reconstructed speech envelopes for each three second trial. For imagined speech, reconstruction-accuracy was significantly above zero for both speech segments (p = 0.001, p = 0.034), with a mean r-value of 0.062 and 0.035, respectively. Unsurprisingly, reconstruction-accuracy for perceived speech was also significantly above chance (p < 0 .005), with a greater mean r-value of 0.124 and 0.092, respectively. Taking the average of all 75 reconstructions produced stronger correlation values for both imagined speech (r = 0.30, r = 0.19), and for perceived speech (r = 0.51, r = 0.40; Figure 1). By analysing the topographic distribution of the weights for the parameters of each decoder, it is possible to ascertain which electrodes are contributing most to the reconstruction-accuracy across time. For perceived speech, as expected, two bilateral foci were present over the temporal cortex of both hemispheres, consistent with activation of auditory cortex. For imagined speech, a similar pattern was also evident (Figure 2).

These results provide evidence for the representation of a speech envelope in auditory cortex during imagined speech.

Figure 1. Perceived and Imagined reconstructions after averaging across all 75 trials.

Figure 2. Topographic distribution of decoder weights, indicating areas of neuronal activation across time.

Steady-State Auditory Evoked Potentials (SSAEP)

Steady state auditory evoked potentials (SSAEP) are elicited when amplitude modulated tones are presented to a listener. The auditory cortex entrains to the signal, phase-locking to the modulation frequency. These potentials are then detectable in EEG by looking for increases in power at the modulation frequency. The best modulation frequency to elicit SSAEPs is at 40Hz, though they have been detected with frequencies as low as 4Hz (Alaerts et al. 2009).

SSAEPs were chosen as a paradigm to investigate imagined auditory stimuli for three reasons. First, they don't require precise triggering of stimuli onset, so the hardware demands were less than continuously varying time signals such as from music or speech. Second, the feature space can be constrained to just changes in frequency power. Third, whereas imagined speech may contain a strong motor component related to the muscle activity from speech production, imagined modulated tones should not contain any motor signals.


The SSAEP task consisted of five repetitions of perceived-imagined pairs for each presented tone. First the subject was instructed to listen to a presented tone for 20 seconds, then, after a two second cue period, imagine hearing that tone for an additional 20 seconds. The presented tones were a

  1. 500Hz sawtooth sinusoid modulated at 4Hz,
  2. a 769.231Hz sawtooth sinusoid modulated at 6Hz, and
  3. a 300Hz sawtooth sinusoid modulated at 40Hz.

All modulation was to 100% of signal amplitude. The first two carrier frequencies were chosen based on work by Hill and Scholkopf (2012), while the third was chosen arbitrarily to be different form the other two. The modulation frequencies of 4Hz and 6Hz were based on the hypothesis that subjects would not be able to successfully imagine modulations of higher frequencies. The 40Hz modulation was selected as a baseline given its the best frequency to elicit SSAEPs in perceived audio.

A total of six subjects performed the SSAEP experiment. All six performed the 4Hz and 6Hz tasks, while two subjects also performed the 40Hz task.


The attached figure shows the averaged frequency response of the raw EEG across all electrodes and trials for subject 5 for the three signals. The columns correspond to the perceived and imagined tasks, while the rows correspond to 3Hz windows centered around the SSAEP frequency of interest for the three signals. In each perceived case, the presented modulation frequency can be seen to have greater energy than the others at the expected frequency, especially in the case of 40Hz. In the imagined case, there may be evidence of a related response. In the 4Hz task, additional power in the 3.25-3.75 range could be the result of the imagined sound. Likewise in the 6Hz task, increased power is seen around 5.5Hz. The spike at 6.5Hz is present in all signals, and unlikely to be related to the specific task. Finally, in the 40Hz task, while less visible, a difference in 40Hz signal relative to the others may be indicative of some imagined activity. Without additional trials and subjects, nothing definitive can be claimed about the imagined results, though the result of this pilot study does suggest additional data collection is warranted.

Musical Examples

a) Designing musical samples according to experimental design constraints: I) being simple and easy for non-musicians memorize and reproduce; II) being short (about 3-4s); III) having uncorrelated and pronounced envelope.

Two approaches were used:

- Using fragments of a well known musical theme: Imperial March from Star Wars movie. It was created using synthesized piano and MIDI control.

- creating simple melodic fragments with acoustic samples from a pizzicato violin:

1. Repeated upward melodic fragment containing three notes (Db-Eb-F) at 120 BPM (beats per minute) .

2. Repeated downward melodic fragment containing three notes (B- Bb- A) at 90 BPM (beats per minute).


We appreciate the support we received from  BrainVision. They provided the EEG amplifier and supplies that were key to this work.  MathWorks provided a number of Matlab licenses which enabled us to run experiments.


Alaerts J, Luts H, Hofmann M, Wouters J. Cortical auditory steady-state responses to low modulation rates. Int J Audiol. 2009 Aug;48(8):582-93.  PubMed

Brainard, D. H. (1997) The Psychophysics Toolbox, Spatial Vision 10:433-436.  PDF

C.-C. Chang and C.-J. Lin. LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.

Delorme A & Makeig S (2004) EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics. Journal of Neuroscience Methods 134:9-21.

de Cheveigné, A. and Simon, J. Z. (2008). "Denoising based on spatial filtering." Journal of Neuroscience Methods 171: 331-339.  PDF

Ding, N., and Simon, J.Z.. Emergence of neural encoding of auditory objects while listening to competing speakers. Proceedings of the National Academy of Sciences of the United States of America 109, 11854-11859. (2012)

Eggermont JJ. Temporal modulation transfer functions in cat primary auditory cortex: separating stimulus effects from neural mechanisms. Journal of Neurophysiology 87:305-321. (2002)

Hill NJ, and Schölkopf B (2012) An online brain–computer interface based on shifting attention to concurrent streams of auditory stimuli, Journal of Neural Engineering 9(2)  PDF Copy of Article.

Lalor EC, Foxe JJ. Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution. European Journal of Neuroscience 31:189-193. (2010)

Luo H, Poeppel D. Phase Patterns of Neuronal Responses Reliably Discriminate Speech in Human Auditory Cortex. Neuron 54:1001-1010.(2007)

Mesgarani, N., David, S.V., Fritz, J.B., and Shamma, S.A.. Influence of Context and Behavior on Stimulus Reconstruction From Neural Activity in Primary Auditory Cortex. Journal of Neurophysiology 102, 3329-3339. (2009)

Mesgarani, N., and Chang, E.F.. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485, 233-U118. (2012)

Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip?: Open Source Software for Advanced Analysis of MEG, EEG, and Invasive Electrophysiological Data. Comput Intell Neurosci. 2011

Pelli, D. G. (1997) The VideoToolbox? software for visual psychophysics: Transforming numbers into movies, Spatial Vision 10:437-442.  HTML