Tuesday, January 26, 2010

Intelligible speech and hierarchical organization of auditory cortex

It has been suggested that auditory cortex is hierarchically organized with the highest levels of this hierarchy, for speech processing anyway, located in left anterior temporal cortex (Rauschecker & Scott, 2009; Scott et al., 2000). Evidence for this view comes from PET and fMRI studies which contrast intelligible speech with unintelligible speech and find a prominent focus of activity in the left anterior temporal lobe (Scott et al., 2000). Intelligible speech (typically sentences) has included clear speech and noise vocoded variants which are acoustically different but both intelligible, whereas unintelligible speech has included spectrally rotated versions of these stimuli. The idea is that regions that respond to the intelligible conditions are exhibiting acoustic invariance, i.e., responding to the higher-order categorical information (phonemes, words) and therefore reflect high levels in the auditory hierarchy.

However, the anterior focus of activation contradicts lesion evidence which shows that damage to posterior temporal lobe regions is most predictive of auditory comprehension deficits in aphasia. Consequently, we have argued that the anterior temporal lobe activity in these studies is more a reflection of the fact that subjects are comprehending sentences -- which are known to activate anterior temporal regions more than words alone -- than intelligibility of speech sounds and/or words (Hickok & Poeppel, 2004, 2007). Therefore, our claim has been that the top of the auditory hierarchy for speech (regions involved in phonemic level processes) is more posterior.

To assess this hypothesis we fully replicated previous intelligibility studies using two intelligible conditions, clear sentences and noise vocoded sentences, and two unintelligible conditions, rotated versions of these. But instead of using standard univariate methods to examine the neural response, we used multivariate pattern analysis (MVPA) to assess regional sensitivity to acoustic variation within and across intelligibility manipulations.

We did perform the usual general linear model subtractions: intelligible [(clear + noise vocoded) - (rotated + rotated noise vocoded)] and found robust activity in the left anterior superior temporal sulcus (STS), but also in the left posterior STS, and right anterior and posterior STS. This finding shows that intelligible speech activity is not restricted to anterior areas, or even the left hemisphere. A broader bilateral network is involved.

Next we examined the pattern of response in various activated regions using MVPA. MVPA looks at the pattern of activity within a region rather than the pooled amplitude of the region as a whole. If different patterns of activity can be reliably demonstrated in a region, this is an indication that the manipulated features (e.g., acoustic variation in our case) are being coded or processed differently within the region.

The first thing we looked at was whether the pattern of activity in and immediately surrounding Heschl's gyrus was sensitive to intelligibility and/or acoustic variation. This is actually an important prerequisite for claiming acoustic invariance, and therefore higher-order processing, in downstream auditory areas: If you want to claim that invariance to acoustic features downstream reflects higher levels of processing in the cortical hierarchy, you need to show that earlier auditory areas are sensitive to these same acoustic features. So we defined early auditory cortex independently using a localizer scan, AM noise modulated at 8Hz relative to scanner noise. The figure below shows the location of this ROI (roughly that is, as this is a group image and for all MVPA analyses ROIs are defined in individual subjects) and the average BOLD amplitude to the various speech conditions. Notice that we see similar levels of activity for all conditions, especially clear speech and rotated speech which appear to yield identical responses in Heschl's gyrus. This seems to provide evidence that rotated speech is indeed a good acoustic control for speech.

However, using MVPA, we found that the pattern of activity in Heschl's gyrus (HG) could easily distinguish clear speech from rotated speech (it is responding to these conditions differently). In fact, HG could distinguish each condition from the other, including the within intelligibility contrasts such as clear vs. noise vocoded (both intelligible) and rotated vs. rotated noise vocoded (both unintelligible). It appears that HG is sensitive to the acoustic variation between our conditions. The figure below shows classification accuracy for the various MVPA contrasts in left and right HG. The dark black line indicates chance performance (50%) whereas the thinner line indicates the upper bound of the 95% confidence interval determined via a bootstrapping method.

Again this highlights the fact that standard GLM analyses obscure a lot of information that is contained in those areas that appear to be insensitive the manipulations we impose.

So what about the STS? Here we defined ROIs in each subject using the clear minus rotated condition, i.e., the conditions that showed no difference in average amplitude in HG. ROIs where anatomically categorized in each subject as being "anterior" (anterior to HG), "middle" (lateral to HG), or "posterior" (posterior to HG). In a majority of subjects, we found peaks in anterior and posterior STS in the left hemisphere (but not in the mid STS), and peaks in the anterior, middle, and posterior STS in the right hemisphere. ROIs were defined using half of our data, MVPA was performed using the other half -- this ensured complete statistical independence.

Here are the classification accuracy graphs for each of the ROIs. The left two bars in each graph show across-intelligibility contrasts (clear vs. rotated & noise vocoded vs. rotated NV). These comparisons should classify if the area is sensitive to the difference in intelligibility. The right two bars show within-intelligibility contrasts (clear vs. NV, both intell; rot vs. rotNV, both unintell). These comparisons should NOT classify if the ROI is acoustically invariant.

Looking first at the left hemisphere ROIs, notice that both anterior and posterior regions classify the across intelligibility contrasts (as expected). But the anterior ROI also classifies clear vs. noise vocoded, two intelligible conditions. The posterior ROI does not classify either of the within intelligibility contrasts. This suggests that the posterior ROI is the more acoustically invariant region.

The right hemisphere shows a different pattern in this analysis. The right anterior ROI shows a pattern that is acoustically invariant whereas the mid and posterior ROIs classify everything, every which way, more like HG.

If you look at the overall pattern within the graphs across areas you'll notice a problem with the above characterization of the data. It categorizes a contrast as classifying or not and doesn't take into account the magnitude of the effects. For example, notice that as one moves from aSTS to mSTS in the right hemisphere, classification accuracy for the across intelligibility contrasts rises (as it does in the left hemi), and that in the right aSTS clear vs. NV just misses significance, where as in the mSTS clear vs. NV barely passes significance. We may be dealing with thresholding effects. This suggests that we need a better way of characterizing acoustic invariance that uses all of the data.

So what we did is calculate an "acoustic invariance index" which basically measures the magnitude of the intelligibility effect (left two bars compared with right two bars). This difference should be large if an area is coding features relevant to intelligibility. This measure was then corrected by the "acoustic effect" (the sum of the absolute difference in classification accuracy within intelligibility conditions). When you do this, here is what you get (acoustic invariance = positive values, range -1 to 1):

HG is the most sensitive to acoustic variation across conditions and more posterior areas (pSTS in left, mSTS in right) are the least sensitive to acoustic variation. aSTS fall in between these extremes. So left pSTS and right mSTS as we've defined it anatomically appear to be functionally homologous and represent the top of the auditory hierarchy for phoneme-level processing. I don't know what is going on in right pSTS.

What features are these areas sensitive to? My guess is that HG is sensitive to any number of acoustic features within the signals, aSTS is sensitive to suprasegmental prosodic features, and pSTS is sensitive to phoneme level features. Arguments for these ideas are provided in the manuscript.


Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I., Saberi, K., Serences, J., & Hickok, G. (2010). Hierarchical Organization of Human Auditory Cortex: Evidence from Acoustic Invariance in the Response to Intelligible Speech Cerebral Cortex DOI: 10.1093/cercor/bhp318

Hickok, G., & Poeppel, D. (2004). Dorsal and ventral streams: A framework for understanding aspects of the functional anatomy of language. Cognition, 92, 67-99.

Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nat Rev Neurosci, 8(5), 393-402.

Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nat Neurosci, 12(6), 718-724.

Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. S. (2000). Identification of a pathway for intelligible speech in the left temporal lobe. Brain, 123, 2400-2406.


Ellen Lau said...

very cool study!

so in the end of your post you hypothesize about which features these areas are sensitive to--acoustic, prosodic, phonological. what about semantic features? besides the sort of trend towards classification of rotated conditions in pSTS, is there any reason from these data to conclude that pSTS is processing phonological rather than semantic information?

Greg Hickok said...

Hi Ellen,

Well if you look at the classification accuracy in the left pSTS you see that the rotated vs. rotated noise vocoded contrasts (both unintelligible speech) just missed the 95% CI which suggests that this region may be responding somewhat differently to these conditions. It has been argued that rotated speech contains some degree of phonemic information. Thus the rotated versus rotated noise vocoded conditions may differ mildly in their phonological content, but not in their semantic content, and this may be driving the trend toward classification. Notice that left sSTS areas don't show this trend (p-value wasn't even close). This is consistent with the idea that pSTS is coding phonological information.

Can we rule out some sort of semantic processing on the basis of this data? No, not really. But the fact that "semantic processing" seems to implicate more ventral and posterior areas in others studies whereas phonological effects appear in STS help constrain our interpretation.

tom said...

Hi Greg,

Congrats on the paper! Lots of really interesting stuff here. Will need to read the paper a few times, but a couple of questions/comments popped up in my mind on first reading, and I hoped you'd be happy to discuss them;

1) I read (and really liked) the recent von Kriegstein JoN paper that you mentioned a couple of weeks ago. They show some interesting effects of speaker-related vocal tract parameters (i.e. speaker identity information) on some of the ROIs you identify here in your study (specifically pSTS). In their paper they suggest that Vocal Tract Length (VTL) information, extracted in part by the right pSTS, is used to help constrain processing of faster speech dynamics in the left pSTG/S. It's my understanding that speaker identity info (including VTL) is abolished with spectrally rotated speech, but is present in vocoded speech. I'd like to suggest that your findings in the right pSTS might reflect processing of this information. What is particularly neat about this hypothesis is that is explains both (i) Why R pSTS distinguishes between Clear Speech and Vocoded speech and their rotated counterparts - i.e. Clear Speech and Vocoded speech contain this info, rotated speech doesn't; and explains (ii) why classification accuracy in Right pSTS is greater for Vocoded > rotated Vocoded than it is for Clear Speech > Rotated speech. This is because the fast dynamic information present in vocoded speech is degraded relative to clear speech, so therefore more use must be made of VTL info (i.e. right pSTS) to constrain processing in the left hemisphere. I can't think of another explanation for this finding, would be interested to hear your ideas.

2) The peak for the group-level intelligibility contrast is very anterior (y = 0), and seems pretty close to STS (some of your individual subject aSTS ROIs are within a couple of mms of this peak) - assuming that the intelligibility contrasts captures acoustic invariance, how does this fit with the posterior STS being at the top of the acoustic invariance hierarchy?

Best wishes,


Greg Hickok said...

Hi Tom,
Thanks for the thoughtful comments. Regarding your first point, I think your idea about VTL (or something similar) is a reasonable one. This is the kind of thing we need to look at in order to sort out what might be going on. One complication: I don't think rot vs. rotNV should classify in an area that is coding this kind of information, yet it does in right pSTS.

Regarding your second point, one thing that this study shows is that group level intelligibility contrasts are not necessarily capturing acoustic invariance in an optimal manner. We argued that MVPA does a better job and this approach points to posterior areas.