Thursday, October 8, 2009

The motor theory of speech perception makes no sense

The motor theory was born because it was found that the acoustic speech signal is ambiguous. The same sound, eg /d/, can be cued by different acoustic features. On the other hand, the speech gesture that produces the sound /d/, it was suggested, is not ambiguous, it always involves placement of the tip of the tongue on the roof of the mouth. Therefore we must perceive speech by accessing the invariant motor gestures that produce speech.

There is one thing that never made sense to me though. If the acoustic signal is ambiguous how does the motor system know which motor gesture to access? As far as I know, no motor theorist has proposed a solution to this problem. Am I missing something?

9 comments:

Matt Goldrick said...

This problem is explicitly discussed in Liberman and Mattingly 1985. p. 26:

"given the many-to-one relation between vocal-tract configurations and acoustic signals, a purely analytic solution to the problem of recovering movements from the signal seems to be impossible…the alternative to an analytic account of speech perception is, of course, a synthetic one, in which case the module compares some parametric description of the input signal with candidate signal descriptions."

At least in this version of motor theory, then, a critical component of the implausibility of an acoustic theory of speech perception is the (reported) non-existence of a function mapping from acoustic signals to intended targets. The analysis-by-synthesis account assumes that there is a well-formed forward model (from articulation to acoustics) but no inverse.

Greg Hickok said...

Matt, thanks for pointing me to this quote! So the problem is acknowledged at least, which means that the motor theory doesn't solve anything right? It just shifts the problem to a different system.

Matt Goldrick said...

I wouldn't say this is "shifting the problem." The motor system is what's used to generate the possible acoustic signals.

Liberman + Mattingly are claiming that it's impossible to identify sounds based purely on acoustics. That's not a problem specific to a motor theory; it's a problem for any theory of perception. The motor theory addresses this by focusing on a problem that is well-formed--the mapping of articulation to acoustics.

Greg Hickok said...

This is a nice way of putting it Matt. It clarifies the MT position for me a bit, except I still don't see how it helps.

First -- and I'm not expert here so correct my errors -- don't see how the mapping between articulation and acoustics is well-formed. The same articulatory gesture that produces /d/, results in different acoustics in the context /di/ vs. /du/. The mapping that is well-formed is the one from articulation to perceived *phoneme*. And isn't it the case that to get to a perceived phoneme you have to run it through those pesky acoustics?

Second, even if the mapping from articulation to acoustics is a well-formed problem, you still have to get from an ambiguous acoustic signal to one or another gesture.

I must still be missing something so keep trying Matt! I'm sure I'll get it eventually. :-)

Matt Goldrick said...

Hopefully this will be less confusing!

By 'well-formed' I mean that there is a one-to-one mapping from articulatory states to sounds. Give me a vocal tract and I can tell you what sound will be produced. But the mapping is not invertible; given sounds, I can't tell you what articulatory state produced it (without making some additional assumptions).

Stepping back to speech sound categories, Liberman + Mattingly's analysis is that the distinctions between these categories can be represented in terms of articulatory features. Given the lack of an inverse mapping from acoustics to these features, one cannot recover speech sounds from acoustics.

They believe the point holds for any theory of speech sound representation. There is a one-to-one 'forward' mapping from speech sound category labels to acoustics, but there is not a one-to-one inverse mapping from acoustics to speech sound categories. So regardless of your theory of how sound category distinctions are represented, the acoustic signal presents a difficult identification problem.

A potentially helpful analogy is the inverse problem in ERP generation. Given sources, we can determine the pattern of potentials at the scalp--but we cannot unambiguously map from potential patterns to sources (without additional constraining information).

L&M's solution is to say that articulation provides the right set of constraints on the inverse mapping problem. Articulation provides a set of candidate gestures (the person could have said bi, di, or gi). These gestures and then used to generate potential acoustic signals; the incoming signal is matched against this. But one could imagine alternative theories of how you generate candidate sound representations (not based on the motor system). This would be much in the same spirit of their solution (analysis by synthesis) but relying on a different mechanism to generate (synthesize) new candidates. I'm sure their response would be that it's simpler to just 're-use' one's motor system to generate candidates.

It's important to distinguish this from another type of motor theory--Fowler's Direct Realism perspective. This does not involve any analysis by synthesis. DR in fact denies (at some level) L&M's claim that the acoustic signal is truly ambiguous--its proponents claim there is enough in the acoustic speech stream to perform the inverse mapping to articulatory gestures. At least, I think so; I must admit I am confused by how this is supposed to work.

Greg Hickok said...

Brilliantly clear Matt. Thank you! So we are still in a situation though where we have an ambiguous acoustic signal that cannot be uniquely mapped back onto a speech gesture. The MT doesn't solve this problem, it just provides a possible source of constraint, just like word context, or sentence context provides constraint. We would all agree that a sentence context like "Hey, while you're up will you grab me a *eer" is going to bias an ambiguous sound * toward /b/. This information provides a constraint on perception. But this doesn't lead us to propose that the objects of speech perception are phrase-level propositions.

I have to admit I don't understand Fowler's theory either.

Fred said...

@Greg: there are some people---namely people who espouse something like an acoustics-based exemplar approach to phonetics/phonology/the lexicon---who DO think that the objects of perception (and probably storage) are "phrase-level propositions", or rather phrase-level acoustic trajectories, given a suitable definition of "phrase-level". A good reference is Coleman (2002) in a volume by Durand & Laks whose name escapes me at the moment.

@Matt: You say "a critical component of the implausibility of an acoustic theory of speech perception is the (reported) non-existence of a function mapping from acoustic signals to intended targets", but the problem only arises if you assume the targets are articulatory, no? If you have acoustic targets, and manage to hit/approximate them in whatever way you can, then it seems to me that that's fine. The fact that people can do reasonably well with bite-blocks shows that we acquire multiple strategies to hit acoustic targets, it seems to me.

Greg Hickok said...

Great discussion! Fred, I was not aware anyone actually claimed such a thing. Interesting. I'm on board with the idea that objects of perception are not phonemes but something higher level. I'm not so sure about phrasal level though. My current favorite is the syllable.

I think raising the question of what are the targets of speech production is important. I believe they are indeed auditory targets. From this perspective, when non-auditory theorists (MT and Fowler style models for example) say that the objects of perception are not strictly motor but the "intended gestures" of the speaker, we can counter that the intended "gestures" (being not motor) are in fact sounds.

Fred said...

@Greg: if we're serious about the objects of perception (and maybe storage) being acoustic, then I'm not sure how syllables can do the work you want them to. I'll admit to not being up on current theory-of-syllables, but my impression is that syllables are typically defined articulatorily, and there are no reliable acoustic correlates of syllables. Unless you're using the term "syllable" to mean "approx this many milliseconds".

Also, something that's not clear to me about the MT view. If the object of speech perception (which presumably is what allows me to perceive speech AS speech and respond to it) is some kind of quasi-abstract "intention", then why is it that I perceive synthetic speech accurately? I presume the MTists don't want to attribute "articulatory intentions" to machines. But perhaps (probably?) I'm just confused about what's entailed by the motor theory.