Thursday, April 23, 2009

Motor influence of speech perception: The view from Grenoble

We had a nice visit with Grenoble's own Jean-Luc Schwartz at TB West this past week. Jean-Luc has been working on motor influences on speech perception for years (decades even) and has a very thoughtful and empirically solid perspective on the issue. Here is the structure of one argument that I found particularly interesting and compelling (Jean-Luc, I'm going to steal a couple of our ppt file images; I'm hoping your won't mind. And please correct my errors in summarizing your points!):

1. A classic problem in speech perception research, and one that led to the development of the motor theory, is the acoustic variability associated with individual phonemes caused by coarticulation: the /d/ in /di/ and /du/ have different acoustic signatures (but the same articulation).

2. However, as J-L noted, if you look in acoustic space for the whole syllable (e.g., in a plot of F1 vs. F2, I believe) one can capture the distinction between /di/ and /du/ quite nicely. In other words, you can solve the lack of invariance problem acoustically just by widening your temporal window from segment to syllable.

3. However' -- and here is J-L's motor influence argument -- there is no reason why we should hear /di/ and /du/ as containing the same onset sound. If it were all just auditory, why wouldn't we just hear, /di/ /da/ /du/ and /bi/ /ba/ /bu/ as six different acoustic categories instead of the two onset categories indicated in the figure below? Answer: the categorization comes from the way the phonemes are articulated, not from their acoustic consequences.

He also presented similarly structured arguments (i.e., that generalizations can be made over motor but not perceptual systems) using data from the distribution of vowels in the world's languages and from perceptual tendencies in the verbal transformation effect.

Jean-Luc is not arguing here for a hardcore motor theory. In fact, he argues that a pure motor theory is indefensible. Rather, the claim is that acoustic categories are modified by the motor system. I think this is a perfectly reasonable conclusion, and one that is consistent with my basic position -- that access to the lexicon is from auditory-phonological systems. One issue I did raise however, is that while it seems clear that phonological categories (phonemes) are influenced by motor systems, there really is not any evidence that this information actually modifies perceptual categories. For example, maybe in our perceptual system all we really have is six different categories for di da du bi ba bu? It is only when you need to map these sounds onto articulatory gestures that the system needs to pick up on the fact that there are commonalities between the first three vs. the last three.

You might want to argue that this can't be right because we obviously hear di da du as all starting with /d/. But I'm not so sure. I think this may be a consequence of the fact that we have been taught, for the purpose of learning to read, that words are composed of individual phonemes. Again, I think it is critical to remember that when we listen to speech under ecologically valid conditions, we don't hear speech sounds, we hear words (i.e., meanings).

Here's a few recent papers by Jean-Luc and colleagues. Mark Sato, who has contributed to this blog, is among these colleagues, by the way. These folks are doing some really good work and definitely worth following.

Sato M, Schwartz JL, Abry C, Cathiard MA, Loevenbruck H. Multistable syllables as enacted percepts: a source of an asymmetric bias in the verbal transformation effect. Percept Psychophys. 2006 Apr;68(3):458-74.

Ménard L, Schwartz JL, Boë LJ. Role of vocal tract morphology in speech
development: perceptual targets and sensorimotor maps for synthesized French vowels from birth to adulthood. J Speech Lang Hear Res. 2004 Oct;47(5):1059-80.

Sato M, Baciu M, Loevenbruck H, Schwartz JL, Cathiard MA, Segebarth C, Abry C. Multistable representation of speech forms: a functional MRI study of verbal transformations. Neuroimage. 2004 Nov;23(3):1143-51.

Rochet-Capellan A, Schwartz JL. An articulatory basis for the
labial-to-coronal effect: /pata/ seems a more stable articulatory pattern than /tapa/. J Acoust Soc Am. 2007 Jun;121(6):3740-54.

Sato M, Vallée N, Schwartz JL, Rousset I. A perceptual correlate of the
labial-coronal effect. J Speech Lang Hear Res. 2007 Dec;50(6):1466-80.

Sato M, Basirat A, Schwartz JL. Visual contribution to the multistable
perception of speech. Percept Psychophys. 2007 Nov;69(8):1360-72.

Basirat A, Sato M, Schwartz JL, Kahane P, Lachaux JP. Parieto-frontal gamma band activity during the perceptual emergence of speech forms. Neuroimage. 2008 Aug 1;42(1):404-13. Epub 2008 Apr 16.

12 comments:

Matt Goldrick said...

One issue I did raise however, is that while it seems clear that phonological categories (phonemes) are influenced by motor systems, there really is not any evidence that this information actually modifies perceptual categories. For example, maybe in our perceptual system all we really have is six different categories for di da du bi ba bu? It is only when you need to map these sounds onto articulatory gestures that the system needs to pick up on the fact that there are commonalities between the first three vs. the last three.This seems like an eminently testable claim. Suppose we find a paradigm in which perceptual learning but not production learning occurs (i.e., discrimination increases but production is completely unchanged relative to baseline). We then train someone on a contrast in one vowel context and see if generalizes to a novel (acoustically distinct) vowel context. On your theory, such generalization shouldn't occur.

One piece of data that might go against this prediction is Tanya Kraljic's work (with Brennan and Samuel) showing perceptual changes without production changes. Of course, in this situation it's adjustments to existing categories--in which case motor influences have already had time to kick in.

A perhaps better case would be to examine the learning of novel categories. Jessica Maye and Dan Weiss have shown that you can get generalization across place of articulation in learning a novel contrast. The question that hasn't been looked at is whether this paradigm involves any production learning. Fortunately someone's on top of that--my student Melissa Baese-Berk. So, perhaps soon we will know if perceptual learning in isolation can lead to generalization to other occurrences of the same segment in acoustically distinct contexts.

Greg Hickok said...

Hi Matt, I'm very happy to see that someone is investigating this question. I've thought previously about trying something myself along these lines, but never had the energy to do it. I believe Howard Nusbaum mentioned to me some existing (perhaps unpublished) data along these lines as well. Howard, if you are "listening" please fill us in.

Rajeev Raizada said...

The figure with /du/, /da/ and /di/ all sitting on a straight line is very similar to Harvey Sussman's locus equation model. The two dimensions are F2 vowel and F2 onset. e.g. this figure from this 1998 BBS paper. Sussman argues that the fact that different stop consonants sit on different vowel/onset lines means that there is an acoustic invariance after all, thus undercutting the need to postulate access to some kind of motor invariance. It's just that the acoustic invariance is in the relation between two F2 features, rather than being a function of just one feature.

Harvey Sussman said...

I have been working on the non-invariance issue in stop + vowel perception for over 15 years. My algorithm derives 'locus equations' from CV productions (e.g. beat, bet, bought, boot bait ... deet, debt, dought, dote, date. doot,...geet, get, got, goat, gate, etc.). When F2 onsets (Hz) (y-axis) and their respective F2 vowel midpoints (x-axis)are plotted in a scatter plot for a given stop, uniquely linear and tight clusters of x,y data points emerge. A linear regression through these points characterizes each stop place category by slope and y-intercept (100% correct classification in discriminate analysis!). The variable F2 transitions, that Motor Theorists have been agonizing over for 60+ years, no longer presents a problem because they have been normalized by virtue of displaying them as a phonetic equivalence class, not token by token as usually displayed.

Harvey Sussman said...

I have been studying the non- invariance issue in stop + vowel perception for almost two decades. My algorithm derives 'locus equations' (LEs)from productions of words beginning with [bdg] followed by 10 different vowels (e.g. beat, bet, bit, date, debt, dote, gate, get, git, got..).
Onset frequencies of the F2 transition are plotted on y-axis, and F2 midpoints of following vowel on x-axis, shown for each stop consonant. Resulting data points in scatterplot are linear, tightly clustered, with R-squared values usually exceeding .90. Slopes and y-intercepts of regression lines fit to the data points can correctly classify CVs into stop categories at 100% accuracy. LEs demonstrate that the variable F2 transitions, that led Motor Theorists to abandon the auditory signal in favor of motor gestures, have been normalized in a self-organized fashion, at the level of the stop place category.
These orderly and contrastive distributions of FM onsets/offsets
are suggestive of neural columns that encode an invariant emergent feature across lawful variations of an input signal (see barn owl ITD columnns).This orderliness of CVs is
seen with only F2 transitions. With burst and F3 information, the auditory signal is rich enough to account for perception without the 'phonetic voodoo' of motor theorists
clouding the issue.

Greg Hickok said...

Thanks Rajeev and Harvey! Sound like great work. Is the BBS article the best reference to this work? Would you like to post a bibliography?

Harvey Sussman said...

Greg:

Here is another reference to the locus equation work in addition to the BBS article:
Sussman (2002) "Representation of phonological categories: A functional role for auditory columns" Brain & Language, 80, 1-13.

Thanks

Harvey

daniel kislyuk said...

From the point of view of Levelt's model of speech perception (e.g. Cholin et al., 2006) there is no need, it seems, to pick up the commonality between the syllables even during production, for the production might be mediated by a mental syllabary. However, if the ability to extract patterns is present in the articulatory networks as well as elsewhere in the brain, the common movement patterns might be extracted during the course of the system functioning and then affect the perception. I.e. the ability to group /bi/ and /bu/ vs. /di/ and /du/ might be not a condition but a natural consequence of production.

Matt Goldrick said...

The assumption of a syllabary in Levelt's model doesn't preclude a sub-syllabic level of representation (involved in syllabary access). The WEAVER++ architecture, like all extant production theories, assumes a segmental level of representation is accessed during lexical retrieval. Among other patterns this accounts for: single segment exchanges in errors (e.g., barn door->darn bore) and single segment priming effects (e.g., A. Meyer, 1991, JML).

Greg Hickok said...

There is good evidence for the existence of segment-level representation as Matt points out. It is an open question to what extent this information is used during perception of speech.

Jean-Luc Schwartz said...

Hi everybody,

Some comments to the comments (and thanks again, Greg, for letting our arguments be discussed here).

- About Matt’s point: this is quite well taken. We have two arguments there. One is indirect, the other one is incomplete (sorry!). The first one is evidence that the way you have structured your vowel system in production influences the way you structure it in perception (ongoing work with Lucie Ménard). Just showing that repertories interfere. The generalization problem is different, and if new data appear it is great. The other argument is functional. In the display that I presented to Greg and that he put on his blog, I mentioned that if you are able to learn the link between the 3 ellipses for a “b”, then you will hopefully better identify a novel stimulus. For example, a stimulus between “bi”, “ba” and “di” (say “be”) might be more or less equidistant between the three ellipses. However, once you have learnt the global “category”, identification improves, and you will know it is a “b” rather than a “d”.

- This is where I differ from Harvey’s point (that I know well of course, and cited in my talk). I find the “linearity” argument too specific to address the general point, which is: how do you link various pieces of acoustic information into a common class? This is the basic question for Motor theory in perception. It is also the ground for phonemes, hence the claim that phonemes cannot be something else than perceptuo-motor objects. Then it can be claimed that actually perception does not need phonemes (Matt’s point), hence my first comment. But the general problem of binding different instances of one single phoneme (variability due to coarticulation) needs in my view more than the linearity-for-locus displayed by Sussman.

- For this “binding by production” question, I would not say things very differently from Daniel, I guess.

Jean-Luc

Karthik Durvasula said...

"maybe in our perceptual system all we really have is six different categories for di da du bi ba bu"

This assumption might be unnecessary. It is not clear that such a categorisation is impossible through auditory means alone. It depends on how strongly you want to interpret Kluender et al's work:

"Japanese quail can learn phonetic categories." (http://www.citeulike.org/user/kapfelba/article/513490)

They argue that Japanese quail can extract the category of 'd' or 'b' from syllable presentations, and recognise them in novel syllable contexts.

Even Kluender et al aren't sure of what acoustic properties/stragtegies the quail were using, and do not propose anything clear except "complex mapping". But, the point is categorisation of the different 'd' tokens (in the context of different vowels) is at least "possible" through auditory means alone.