Sunday, April 5, 2009

Neural Models of Speech Recognition

There still seems to be some confusion about what is exactly being claimed by various neuro-theorists regarding the functional architecture of speech recognition. This goes for the Dual Stream model as well. I just got back from a great visit to the University of Chicago where I had the opportunity to spend a lot of time talking to Steve Small, Howard Nusbaum, and Matt Goldrick (who came down from Northwestern to hang out for awhile). We had some great discussions and I learned a lot. One issue that came up in these discussions was that it is not clear what everyone's position is on how speech recognition happens, particular in regard to the relative role of the sensory and motor systems. So here is my attempt to clarify this.

There are at least three types of models out there: 1. auditory models, 2. motor models, and 3. sensory-motor models.

Here's my simplified cartoon of an auditory model:

This is closest to my view. The access route from sound input to the conceptual system does not flow through the motor system although the motor system can modulate activity in the sensory system.

Here's a cartoon of a motor theory:

Something like this has been promoted by Liberman in the form of the Motor Theory of speech perception, as well as by Fadiga. One comment I'm getting a lot lately (including from Luciano) is that no one really believes in the motor theory. So here's a quote from the Fadiga & Craighero, Cortex, (2006) 42, 486-490:

According to Liberman’s theory … the listener understands the speaker when his/her articulatory gestures representations are activated by the listening to verbal sounds. p. 487

Liberman’s intuition … that the ultimate constituents of speech are not sounds but articulatory gestures that have evolved exclusively at the service of language, seems to us a good way to consider speech processing in the more general context of action recognition. p. 489

On this view, the route from acoustic speech input to the conceptual system flows through the motor system.

Here is my cartoon of a sensory-motor model:

This seems to be what Fadiga has in mind based on his comments on this blog, namely that it is in the "matching" of the sensory and motor systems that is critical for recognition to happen.

A Brad Buchsbaum pointed out, both a motor theory and a sensory-motor theory would predict that damage to the motor-speech system should produce substantial deficits in speech recognition. As this prediction doesn't hold up empirically, these theories in their strong forms are wrong.


Yisrael said...

Maybe part of the confusion lies in differences about that little cloud 'conceptual network'. It is indeed a somewhat amorphous item. Some opinions (Pulvermuller for instance) have concept representation intrinsically built into the cortices that process them. The concept 'kick' being represented in the motor cortex that controls foot movement. Looking at concepts of this type and coming to the conclusion that the motor cortex is essential in speech perception is viewing only part of the picture. Obviously not all concepts will be represented in the motor cortex.
How do the opinions that contend that speech perception involves the motor cortex understand the 'conceptual network'. Leaving speech aside, how do they understand conceptual representation?

Karthik Durvasula said...

I think there is a misunderstanding here of what Liberman & Mattingly's (1985) stance was. And I think a huge part of that misunderstanding is (unfortunately) terminological.

The term "gesture" used by L&M does not stand for actual articulatory movement. For them, gesture is more abstract, and it means "abstract vocal tract configuration" - which is quite removed from the actual articulatory event. To take an example, the gesture "labial stop" could be made in a variety of ways, by cotrolling the upper lip/lower lip/jaw... So, for L&M, what is being perceived (for linguistic purposes, at least) is the abstract gesture "labial stop", and not the actual articulatory event.

That this is infact the intended meaning is clear from "...the articulatory movements-the peripheral realizations of the gesture" (pg. 4 of their paper.)

The view of speech perception they had, according to me, was this:

conceptual network
vocal-tract gesture
/ \
acoustics articulation

The above diagram is maybe a little simplistic (and crude, given there are no fancy graphics). Because, for them the object of perception is the "vocal tract gesture", so any sensory source should, in theory, be informative (including visual, tactile in the extreme case). So, this view in the end sounds a lot more like option 3 - the sensory motor model, but with the additional claim that sensory-motor integration results in the perception of abstract vocal tract gestures.

Why does all this make a difference: because as it stands, the point of view put forward in "the motor theory of speech perception revised" is not at odds with the aphasia data. Loss of motor control does not mean loss of perception ability. This point was highlighted in a short paper by Mattingly "In defence of the Motor Theory".

To be sure, L&M's theory has many other problems, but the one that is being raised consistently in this blog here, isn't one of them.

Another cause for confusion is the unfortunate title "motor theory of speech perception", what they clearly mean is "gestural theory of speech perception", where the term "gesture" has a very specific (abstract) definition.

Greg Hickok said...

I think people are actually a bit confused about the relation between a theory of speech sound perception (at issue here) and a theory of conceptual representation. These are independent issues. Let's clarify.

Motor theories of speech perception and motor (embodied) theories of conceptual semantics are talking about two different stages of processing. The former deals with the processing of speech sound patterns, and the latter with the meanings that are (arbitrarily) associated with a those sound patterns. It is logically possible for to have an auditory theory of speech perception and a motor theory of action semantics. Likewise, it is possible to have a motor theory of speech perception an a non-motor (non-embodied) theory of semantics. I drew a cloud to indicate that the organization of the conceptual system is whatever it is, and for the sake of this discussion, doesn't matter.

Too often, it seems that people talk about motor theories of speech and embodied theories of language semantics in the same breath as if the two were necessarily related. Even Rizzolati and co-authors' leap from mirror neurons theories of action understanding (a theory of semantics) to motor theories of speech perception (a perceptual theory) confuses this issue. For example, in some of their writings on mirror neurons and action understanding they argue that non-mirror neuron systems can process the lower level visual information associated with gestures, but it is the mirror system that affords "understanding" (i.e., semantics). But the motor theory of speech perception is all about this lower level of perceptual processing and NOT about the understanding. Note that Liberman et al. tended to study mostly the perception of non-meaningful CV stimuli. So the association between mirror neuron theory of action understanding and the motor theory is based on a failure to notice that the two theories are talking about different levels of processing.

So, let's tackle one question at a time. Does the perceptual (~phonemic-level) stage of speech recognition go through the motor system?

Greg Hickok said...

Great discussion. Thanks Karthik. L&M did in fact use the term "intended gestures" and did not subscribe to the view that low level motor programs were the targets of speech perception. But whether we define my "motor" label in the box diagrams as truly motor or as abstract speech gestures, in either case, they were not auditory for L&M, so (i) the model architectures I depicted are still valid characterizations of the positions, and (ii) L&M are still wrong (unless "intended gesture" is actually an auditory representation, which is precisely what I believe!).

Brad Buchsbaum said...

I still think that one of the issues here is in one of translation from "cognitive terms" to "neural terms".

The motor theory of speech perception is really about the "code" for speech perception, not its neuroanatomical locus. It would seem strange to say that there is a motor code for speech perception in auditory cortex, but it's not necessarily a contradiction. More to the point, Liberman et al. do not (or do they?) make any specific functional-neuroanatomical predictions about "where" such a code lives in the brain.

It is reasonable to assume that a motor code should reside in motor cortex, but this an additional assumption -- a cognitive neuroscientific heuristic, a "linking proposition" -- that is added to the model so that it can be discussed in neural terms.

Is it plausible, that a gestural code in auditory cortex might form the basis of speech perception? Such a model would be consistent with the motor theory of speech perception, but inconsistent with neuroanatomically-based motor theories, where the "motor code" underlying speech perception is truly thought to reside in motor or premotor cortex.

Karthik Durvasula said...

I think Brad is asking exactly the right question. From what I can make out from L&M's work, this interpretation/model is not at odds with the revised motor theory view. The following quote from Mattingly ("In defence of Motor Theory") clearly implies something similar:

"This view of the relation between the module and the motor systems that control the articulators also suggests an interpretation of the patient described by MacNeilage, Rootes and Chase (1967). This patient had severe congenital impairment of somesthetic perception and articulatory control. She could not organize the movements of her tongue or lips and had severe deficits in speech production. Yet, she was able to understand speech and perceived it categorically. Contrary to MacNeilage's assertion, the perceptual abilities of this patient pose no problem for the Revised Motor Theory. The Motor Theorist would say that although her somesthetic system was disabled, and her motor-control systems therefore functioned poorly, her language module, and so her ability to perceive phonetic gestures, were intact."

Mattingly clearly claims a separation between the motor system and the "module" for (abstract) gesture perception. This is why I think the name "Motor Theory of speech perception" is unfortunate. The "gestures" that they refer to do not seem "motoric" at all - they are more (abstract) positions in the vocal tract space - which are very different from actual motor actualisations.

On a side note, my reading of the word "gesture" (in the context of Motor Theory) is this - they are neither motoric nor auditory. They are a transform from motoric/sensory information to a much more abstract vocal tract position space.

Yisrael said...

This discussion has clarified some points yet confused me on others.
Greg, I agree entirely with your well stated distinction between embodied semantics and speech perception. Your last point is critical here. The level of speech that is under discussion is speech sounds. The conceptual system is really largely irrelevant to the model except for perhaps its role in identifying speech sounds as 'human'.
As for 'motor code' or 'gestural code', 'abstract vocal tract position space' and various 'modules', these are terms that may be helpful for theoretical discussions particularly in cognitive psychology (what does 'code' here mean anyway? Some pattern of action potentials? Do we even know how to address this experimentally?). However, what kind of hypothesis do they generate? Aphasia study is helpful up to a point. We now have the ability to study real time neural processing. Where are these abstract representations and modules and how do we distinguish them from the analog representation. For instance how can we separate 'motor code' in the auditory cortex from 'auditory code' in the auditory cortex?

Greg Hickok said...

Thank you for clarifying Mattingly's view on cases like that reported by MacNeilage et al. I hadn't yet come across a direct comment from the MT camp on such data. It helps to see how they think about it.

So here then, is how I understand the revised motor theory:

1. It's not auditory
2. It's not motor
3. It is the perception of abstractly represented vocal tract "gestures"
4. The system is modular
5. The system is innate

#5 of course is needed to explain how an individual who has never produced a speech gesture in their life can nonetheless access the abstract neural representation of such gestures. This is a bit odd given that even supporters of innate mental modules typically assume that the system needs some external environmental trigger to get them organized. (Note that if the representation is auditory, the problem goes away.)

So now the question boils down to what is an abstract vocal tract gesture? You (Karthik) suggest it is abstract positions in vocal tract space. Why can't it be abstract positions in auditory space? This is consistent with lesion evidence: damage to auditory cortex (bilaterally) produces profound speech sound recognition deficits; damage anywhere else, doesn't.

Karthik Durvasula said...

Hi Greg,

The MTist I am guessing wouldn't be opposed to "external environmental triggers" - I am not clear on what exactly you mean by this. I am hazarding a guess here about the MTist position: while they do say the system is "innate", I am guessing they wouldn't be opposed to "fine-tuning" thru experience. So, while the original system is innate, what experience does is fine tune the system (narrow ranges...).

With respect to "abstract positions in vocal tract space" - this is the standard revised MT position (not my personal interpretation). This especially shows up in work done by Louis Goldstein & Catherine Browman under the theoretical banner of "Articulatory Phonology". I was just clarifying the actual MT view on the matter.

As far as I am concerned, I am not convinced of it one way or another (abstract auditory/vocal tract space/both)- but the evidence I have from my own theoretical phonology work on nasals leads me to believe at the moment that (at least) the abstract positional information about nasal segments is absolute crucial in understanding their (phonological) behaviour across languages. This specific point, I think, raises a bigger point (at least for me) that neurolinguistic research needs to take phonological research / results more seriously.

I didn't say it before, but I love reading your blog. It has been extremely educational for me till now!

Karthik Durvasula said...

I was also thinking about the same point as your (Greg's) last comment for the last few days:

"This is consistent with lesion evidence: damage to auditory cortex (bilaterally) produces profound speech sound recognition deficits; damage anywhere else, doesn't."

And I was wondering if this necessarily shows that auditory information is primary. Isn't this also consistent with the statement that auditory information is most information-rich? So, it is not that auditory information is given a pride-of-place, it is just the nature of the auditory input is such that it is much more information-rich than the other sources. Therefore, inferring the correct representations (whatever their nature is - gestural/featural...), and hence the correct lexical representations, is much more likely in the presence of such information.

So, the lesion results might be explained by this: there is a profound deficit not because your primary source (the only one you turn to) is damaged, but because your most informative source is damaged (and the others while equally important are just not that informative).

Does what I am saying make sense? If it does, then what I am saying boils down to this - I don't think the evidence is as strongly in favour of a purely auditory model (as opposed to a multi-sensory integration model) as has been thought.

Greg Hickok said...

The innate module comment was based on this: there are cases in which the ability to produce speech fails to develop. These folks have never produced a speech gesture. Yet they still capable of perceiving speech normally. If speech perception relies on the activation of an abstract speech gesture representation, then it follows that this gesture representation must be innate because it is functioning in people who have never had the opportunity shape their speech gesture system. If you buy that though, you then have to wonder how chinchillas can perceive speech sounds with human-like ability -- certainly THEY can't have innate speech-gesture representation systems. You might say, 'well, they do it differently, with ordinary auditory perceptual systems.' But then if the auditory system can handle speech all on its own, what evolutionary pressure led to the development of an innate speech gesture module. It all seems a bit of a stretch to me.

Your point about phonology is important and interesting. Yes, neuroscientists who study language need to pay more attention to linguistics! You suggest that data from phonology leads you to believe that gestural information is critical. I don't doubt that. But here's an important point (correct me if I'm wrong because I'm not a phonologist!): the data that drives phonological theory comes from how people produce speech sounds. It doesn't come from how people hear speech sounds. You are assuming that the phonology uncovered via studies of production, also applies to the "phonological processing" in speech perception. This may be true, but I don't think so. My guess is that most of speech perception involves recognizing chunks of speech on the syllable scale, not individual segments. In other words, while you clearly need to represent speech at the segmental (and even featural) level for production, you don't need to do this for perception. So it doesn't surprise me that phonologists find gesture or motor-related information relevant to understanding phonology: it is based on data from speech production!

This could be a completely naive view given that I am no phonologist, so correct me if I'm wrong. But I guess the main point is that we don't necessarily need to assume that in perception we have to analyze the signal in all it's minute featural/segmental detail to access the mental lexicon. Like "whole word" reading, we might be able to just process speech in larger (syllabic) chunks.

Greg Hickok said...

Regarding the fact that auditory cortex lesions lead to speech perception deficits, you said:

"there is a profound deficit not because your primary source (the only one you turn to) is damaged, but because your most informative source is damaged (and the others while equally important are just not that informative)."

Hmm. If it is the more informative source, isn't that kind of like saying it is the primary source? How can a source be equally important if it is not that informative? Here's the facts that any theory of speech recognition has to explain. (1) damage to auditory areas produce profound speech recognition deficits. (2) damage to motor cortex, Broca's area, the entire left frontal lobe, the entire left hemisphere... don't produce significant speech recognition deficits. There is clearly an asymmetry here.

Karthik Durvasula said...

RE: "You are assuming that the phonology uncovered via studies of production, also applies to the "phonological processing" in speech perception."

This must be true (at least, to some extent).

If u assume the following simplified view of communication.

Speaker 1 ---------> Listener 2
| |
Listener 1 <----X--- speaker 2
phonological data

Yes, we are collecting the data immediately after speaker 2's production, but you do expect the data to be "shaped" by both the listeners' perception.

And clearly, (at least) some patterns in phonological data have imprints of perception asymmetries. But, the specific biases observed in the data with respect to nasals cannot be ascribed to perceptual reasons.

Karthik Durvasula said...

typo: the vertical line above "phonological data" had to connect to "X", not to "Listener 1"