Tuesday, September 15, 2009

A multisensory cortical network for understanding speech in noise

It’s kind of ironic that we spend so much time and effort trying to eliminate the noise problem in fMRI research on auditory speech perception when we do most of our everyday speech comprehension in noisy environments. In fact, one could argue that we are getting a more veridical picture of the neural networks supporting speech perception when we use standard pulse sequences than when we use sparse or clustered acquisition. (I am TOTALLY going to use this argument the next time a reviewer says my study is flawed because I didn’t use sparse sampling!) This is why I think it is a brilliant strategy to develop a research program to study speech processing in noise using fMRI, as Lee Miller at UC Davis has done. Not only is it the case that speech is typically processed in noisy environments (ecological validity) and that processing speech in noise is often disproportionately affected by damage the auditory/speech system (clinical applicability), but fMRI is really noisy. Brilliant!

A recent fMRI study by Bishop and Miller tackled this issue by adding even more noise to speech stimuli. They asked subjects to judge the intelligibility of meaningless syllables (the signal) presented in a constant babble (the noise) created by mixing speech from 16 talkers. They manipulated the intensity (loudness) of the signal to create a range of signal to noise ratios. In addition, they presented visual speech in the form of a moving human mouth that was either synchronous with the speech signals or temporally offset. The visual speech information was blurred to preserve the gross temporal envelope information but obscure the fine details so that syllable identification could not be achieved using visual information alone. They also had an auditory-only condition.

They did a couple of analyses. The one that would seem most interesting and the one that was emphasized in the paper was the contrast between stimuli that were judged intelligible verses those that were judged unintelligible, collapsed across the visual speech conditions. Oddly, no region in the superior temporal lobes in either hemisphere showed an effect of intelligibility. This is completely unexpected given the very high-profile finding by Scott and her colleagues who claim to have identified a pathway for intelligible speech in the left anterior temporal lobe (Scott et al. 2000). Instead, Bishop & Miller found an intelligibility effect bilaterally in the temporal-occipital boundary (in the vicinity of the angular gyrus), in the left medial temporal lobe, right hippocampus, left superior parietal lobule, left posterior intraparietal sulcus, left superior frontal sulcus, bilateral postcentral gryus, and bilateral putamen -- not your typical speech network!



They then assessed audiovisual contributions to intelligibility by looking for regions that show both an intelligibility effect (intelligible > unintelligible) and an audiovisual effect (synchronous AV > temporally offset AV). This conjunction led to activation in a subset of the intelligibility network including left medial temporal lobe, bilateral temporal-occipital boundary, left posterior inferior parietal lobule, left precentral sulcus, bilateral putamen, and right post central gyrus.



The authors discuss the possible role of medial temporal lobe structures in speech “understanding” (e.g., “evaluating cross-modal congruence at an abstract representational level”) as well as the role of the temporal-occipital boundary (e.g., “object recognition … based on features from both auditory and visual modalities”).

But what interests me most about this study, and what I think is the most important contribution, is not the intelligibility contrast but their acoustic control. Recall that they parametrically manipulated the signal to noise ratio (SNR). They ran an analysis to see what correlated with this SNR variable. The goal was to see if SNR could explain their intelligibility effect. The answer was no, “the BOLD time course for our understanding network was not adequately explained by SNR variance.” But the regions that did correlate with SNR turned out to be a familiar set of speech-related regions: bilateral STG and STS, and left MTG!



What I think this study has actually shown is that phonological perceptibility is strongly correlated with activity in a bilateral superior temporal lobe network (SNR variable) and that the “understanding network” reflects those top-down (or higher-level) factors that influence how the phonological information is used (e.g., to make an intelligibility decision). Of interest in this respect is the high degree of overlap in the distribution of SNR values judged to be intelligible (red) versus unintelligible (blue).



Because there was so much overlap in these distributions, the contrast between intelligible and unintelligible yielded no effect in regions that were responding to the phonemic information in the signal.

In sum, I think this study nicely supports the view that phonemic aspects of speech perception are bilaterally organized in the superior temporal lobe, but goes further to outline a network of regions that provide top-down/higher-level constraints on how this information is used in performing, in this case, a cross-modal integration task.

References

Bishop, C., & Miller, L. (2009). A Multisensory Cortical Network for Understanding Speech in Noise Journal of Cognitive Neuroscience, 21 (9), 1790-1804 DOI: 10.1162/jocn.2009.21118

Scott, S. (2000). Identification of a pathway for intelligible speech in the left temporal lobe Brain, 123 (12), 2400-2406 DOI: 10.1093/brain/123.12.2400

4 comments:

mcgyrus said...

Hi Greg,

I also found this paper interesting in terms of the use of the 'Understood'>'Heard' (U>H) contrast. The post-test run with a subset of participants shows that they were able to identify the 'heard' stimuli with mean 66% accuracy - pretty good going for a chance level of 25%. The 'understood' stimuli were identified with mean 87% accuracy. What this indicates to me (as the authors also note) is that the participants were responding 'Understood' in the scanner only when they could almost certainly identify the syllable, which in turn is reflected in a network of atypical activations for a speech study in the U>H contrast that could represent post-perceptual decision-making processes.

I should, however, point out that the authors do also report an activation in left anterior STS for the U>H contrast, albeit for auditory-only trials. This is very much in line with Scott et al. (2000). It's mentioned in the caption for Figure 2.

Finally, I think it is a bit too strong to interpret the bilateral activations in the SNR contrast as reflective of 'phonemic perceptibility', as the regressor hasn't been related to any behavioural measure of perception. I would rather have seen the authors modulate the SNR regressor for VCV identification scores (if these were available) instead of the more fuzzy understood/heard judgement. This would have perhaps yielded a contrast image that could be more reflective of the participant's perceptual experience, and furthermore more interpretable in terms of the left-right balance of task-related processing.

Greg Hickok said...

Regarding the left anterior STS U>H result, it is puzzling that they found it only in the auditory-only condition. I don't know how to interpret that, but I don't think it supports Scott et al.'s position. Don't you think that a region that is critical to the perception of intelligible speech should activate whether there is visual information or not? Further, it is interesting that the aSTS showed up in the contrast that you suggest highlights "post-perceptual decision-making processes." No matter how you look at it, the result does not support Scott et al.'s view.

You are right that the SNR regressor was not related to any behavioral measure of perception, so my claim is probably too strong. However, I think it is a very safe bet to assume that if you asked Ss to rate perceptibility on an x-point scale (rather than a binary judgment), SNR would be strongly correlated with perceptibility ratings. Assuming this is true, the SNR variable does reflect phonemic perceptibility. Hey that's a good idea of a study. Lee, you on it? :-)

Lee said...

Hi Greg and mcgyrus,
These are great suggestions, and we'll certainly keep them in mind for subsequent studies. Thanks so much for the input!
-Lee

Chris said...

Greg and mcgyrus,

Thank you so much for the encouraging comments! I apologize for the delay in my post!

Greg, I generally agree with you that the SNR regressor is a good indicator of available information or where target information is represented. However, how much of the SNR regressor activity has to do with 'phonemic perceptibility' or is specific to speech is difficult to say. We'd probably see similar results for target detection of non-speech stimuli (i.e., same paradigm, but a non-speech target), but the results would probably vary qual/quantitatively in their spatial extent in earlier auditory areas, depending on which features are maintained in the non-speech target. In fact, it might be fun to use a paradigm like this to compare spatial extent of activity in early auditory areas using speech and non-speech stimuli (not terribly novel, maybe even a half-baked idea, but it might be fun). We took a stab at this using bird calls, but the data were underpowered and ill suited to clarify this point (the experiment was meant to clarify other points), so we omitted them from the manuscript.

mcgyrus, thank you for pointing out the audio-only panel in figure 2. I'd like to add that we did see the left anterior STS activity in the U>H contrast across all conditions prior to including the SNR regressor, but the activity was small (a handful of voxels) and borderline significant at FDR p<0.05. The SNR regressor accounted for enough variance to drop STS below significance across all conditions. However, we were interested in how our findings in anterior STS compared to other auditory only studies and reported those data at a relaxed threshold.
Thank you again!
-Chris