AV speech elicits in the listener a motor plan for the production of the phoneme that the speaker might have been attempting to produce, and that feedback in the form of efference copy from the motor system ultimately influences the phonetic interpretation. -Skipper et al. 2007
The other is that AV integration is achieved without the motor system, via cross-sensory integration in the STS (Nath & Beauchamp, 2012).
I came across a 15-year-old study recently that make a pretty strong case against the motor-based account. Rosenblum et al. (1997) decided to assess whether individuals who do not know how to produce speech, nonetheless show a McGurk effect. Their study population? 5-month-old infants. The paradigm? Habituation of looking time (present the same thing over and over and see how long it takes the kid to get bored and stop looking). Basic result from four experiments? Habituation to auditory syllables was modulated by visual speech information: pre lingual infants show a McGurk effect.
AV integration seems to be primarily sensory-, not motor-driven.
Nath, A.R. and M.S. Beauchamp, A neural basis for interindividual differences in the McGurk effect, a multi sensory speech illusion. Neuroimage, 2012. 59(1): p. 781-7.
Rosenblum, L.D., M.A. Schmuckler, and J.A. Johnson, The McGurk effect in infants. Percept Psychophys, 1997. 59(3): p. 347-57.
Skipper, J.I., et al., Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cereb Cortex, 2007. 17(10): p. 2387-99.