It’s kind of ironic that we spend so much time and effort trying to eliminate the noise problem in fMRI research on auditory speech perception when we do most of our everyday speech comprehension in noisy environments. In fact, one could argue that we are getting a more veridical picture of the neural networks supporting speech perception when we use standard pulse sequences than when we use sparse or clustered acquisition. (I am TOTALLY going to use this argument the next time a reviewer says my study is flawed because I didn’t use sparse sampling!) This is why I think it is a brilliant strategy to develop a research program to study speech processing in noise using fMRI, as Lee Miller at UC Davis has done. Not only is it the case that speech is typically processed in noisy environments (ecological validity) and that processing speech in noise is often disproportionately affected by damage the auditory/speech system (clinical applicability), but fMRI is really noisy. Brilliant!
A recent fMRI study by Bishop and Miller tackled this issue by adding even more noise to speech stimuli. They asked subjects to judge the intelligibility of meaningless syllables (the signal) presented in a constant babble (the noise) created by mixing speech from 16 talkers. They manipulated the intensity (loudness) of the signal to create a range of signal to noise ratios. In addition, they presented visual speech in the form of a moving human mouth that was either synchronous with the speech signals or temporally offset. The visual speech information was blurred to preserve the gross temporal envelope information but obscure the fine details so that syllable identification could not be achieved using visual information alone. They also had an auditory-only condition.
They did a couple of analyses. The one that would seem most interesting and the one that was emphasized in the paper was the contrast between stimuli that were judged intelligible verses those that were judged unintelligible, collapsed across the visual speech conditions. Oddly, no region in the superior temporal lobes in either hemisphere showed an effect of intelligibility. This is completely unexpected given the very high-profile finding by Scott and her colleagues who claim to have identified a pathway for intelligible speech in the left anterior temporal lobe (Scott et al. 2000). Instead, Bishop & Miller found an intelligibility effect bilaterally in the temporal-occipital boundary (in the vicinity of the angular gyrus), in the left medial temporal lobe, right hippocampus, left superior parietal lobule, left posterior intraparietal sulcus, left superior frontal sulcus, bilateral postcentral gryus, and bilateral putamen -- not your typical speech network!
They then assessed audiovisual contributions to intelligibility by looking for regions that show both an intelligibility effect (intelligible > unintelligible) and an audiovisual effect (synchronous AV > temporally offset AV). This conjunction led to activation in a subset of the intelligibility network including left medial temporal lobe, bilateral temporal-occipital boundary, left posterior inferior parietal lobule, left precentral sulcus, bilateral putamen, and right post central gyrus.
The authors discuss the possible role of medial temporal lobe structures in speech “understanding” (e.g., “evaluating cross-modal congruence at an abstract representational level”) as well as the role of the temporal-occipital boundary (e.g., “object recognition … based on features from both auditory and visual modalities”).
But what interests me most about this study, and what I think is the most important contribution, is not the intelligibility contrast but their acoustic control. Recall that they parametrically manipulated the signal to noise ratio (SNR). They ran an analysis to see what correlated with this SNR variable. The goal was to see if SNR could explain their intelligibility effect. The answer was no, “the BOLD time course for our understanding network was not adequately explained by SNR variance.” But the regions that did correlate with SNR turned out to be a familiar set of speech-related regions: bilateral STG and STS, and left MTG!
What I think this study has actually shown is that phonological perceptibility is strongly correlated with activity in a bilateral superior temporal lobe network (SNR variable) and that the “understanding network” reflects those top-down (or higher-level) factors that influence how the phonological information is used (e.g., to make an intelligibility decision). Of interest in this respect is the high degree of overlap in the distribution of SNR values judged to be intelligible (red) versus unintelligible (blue).
Because there was so much overlap in these distributions, the contrast between intelligible and unintelligible yielded no effect in regions that were responding to the phonemic information in the signal.
In sum, I think this study nicely supports the view that phonemic aspects of speech perception are bilaterally organized in the superior temporal lobe, but goes further to outline a network of regions that provide top-down/higher-level constraints on how this information is used in performing, in this case, a cross-modal integration task.
Bishop, C., & Miller, L. (2009). A Multisensory Cortical Network for Understanding Speech in Noise Journal of Cognitive Neuroscience, 21 (9), 1790-1804 DOI: 10.1162/jocn.2009.21118
Scott, S. (2000). Identification of a pathway for intelligible speech in the left temporal lobe Brain, 123 (12), 2400-2406 DOI: 10.1093/brain/123.12.2400