1. A classic problem in speech perception research, and one that led to the development of the motor theory, is the acoustic variability associated with individual phonemes caused by coarticulation: the /d/ in /di/ and /du/ have different acoustic signatures (but the same articulation).

2. However, as J-L noted, if you look in acoustic space for the whole syllable (e.g., in a plot of F1 vs. F2, I believe) one can capture the distinction between /di/ and /du/ quite nicely. In other words, you can solve the lack of invariance problem acoustically just by widening your temporal window from segment to syllable.

3. However' -- and here is J-L's motor influence argument -- there is no reason why we should hear /di/ and /du/ as containing the same onset sound. If it were all just auditory, why wouldn't we just hear, /di/ /da/ /du/ and /bi/ /ba/ /bu/ as six different acoustic categories instead of the two onset categories indicated in the figure below? Answer: the categorization comes from the way the phonemes are articulated, not from their acoustic consequences.

He also presented similarly structured arguments (i.e., that generalizations can be made over motor but not perceptual systems) using data from the distribution of vowels in the world's languages and from perceptual tendencies in the verbal transformation effect.
Jean-Luc is not arguing here for a hardcore motor theory. In fact, he argues that a pure motor theory is indefensible. Rather, the claim is that acoustic categories are modified by the motor system. I think this is a perfectly reasonable conclusion, and one that is consistent with my basic position -- that access to the lexicon is from auditory-phonological systems. One issue I did raise however, is that while it seems clear that phonological categories (phonemes) are influenced by motor systems, there really is not any evidence that this information actually modifies perceptual categories. For example, maybe in our perceptual system all we really have is six different categories for di da du bi ba bu? It is only when you need to map these sounds onto articulatory gestures that the system needs to pick up on the fact that there are commonalities between the first three vs. the last three.
You might want to argue that this can't be right because we obviously hear di da du as all starting with /d/. But I'm not so sure. I think this may be a consequence of the fact that we have been taught, for the purpose of learning to read, that words are composed of individual phonemes. Again, I think it is critical to remember that when we listen to speech under ecologically valid conditions, we don't hear speech sounds, we hear words (i.e., meanings).
Here's a few recent papers by Jean-Luc and colleagues. Mark Sato, who has contributed to this blog, is among these colleagues, by the way. These folks are doing some really good work and definitely worth following.
Sato M, Schwartz JL, Abry C, Cathiard MA, Loevenbruck H. Multistable syllables as enacted percepts: a source of an asymmetric bias in the verbal transformation effect. Percept Psychophys. 2006 Apr;68(3):458-74.
Ménard L, Schwartz JL, Boë LJ. Role of vocal tract morphology in speech
development: perceptual targets and sensorimotor maps for synthesized French vowels from birth to adulthood. J Speech Lang Hear Res. 2004 Oct;47(5):1059-80.
Sato M, Baciu M, Loevenbruck H, Schwartz JL, Cathiard MA, Segebarth C, Abry C. Multistable representation of speech forms: a functional MRI study of verbal transformations. Neuroimage. 2004 Nov;23(3):1143-51.
Rochet-Capellan A, Schwartz JL. An articulatory basis for the
labial-to-coronal effect: /pata/ seems a more stable articulatory pattern than /tapa/. J Acoust Soc Am. 2007 Jun;121(6):3740-54.
Sato M, Vallée N, Schwartz JL, Rousset I. A perceptual correlate of the
labial-coronal effect. J Speech Lang Hear Res. 2007 Dec;50(6):1466-80.
Sato M, Basirat A, Schwartz JL. Visual contribution to the multistable
perception of speech. Percept Psychophys. 2007 Nov;69(8):1360-72.
Basirat A, Sato M, Schwartz JL, Kahane P, Lachaux JP. Parieto-frontal gamma band activity during the perceptual emergence of speech forms. Neuroimage. 2008 Aug 1;42(1):404-13. Epub 2008 Apr 16.