Matt Davis and Ingrid Johnsrude have a very thoughtful new review, "Hearing speech sounds: Top-down influences on the interface between audition and speech perception" in the journal Hearing Research. The paper is available on the journal's web site already.
Matt and Ingrid review several critical bottom-up and top-down aspects of speech perception, both from the perspective of perception research and cognitive neuroscience. They review four types of phenomena: grouping, segmentation, perceptual learning, and categorical perception. The paper makes a persuasive case for the nuanced interaction between top-down factors in bottom-up analysis.
Indeed, their review converges in interesting ways with the Hickok & Poeppel (2007) review, and with another forthcoming review by -- yes, sorry -- me (Poeppel), Bill Idsardi, and Virginie van Wassenhove: "Speech perception at the interface of neurobiology and linguistics", a paper in press at Philosophical Transactions of the Royal Society.
Across these three papers I think there is a fair amount of covergence -- a successful model bridging perception, computation, and brain must account for the subtle but principled interaction between the bottom-up processes we need (by logical necessity) and the top-down processes we (our brains) bring to the problem (by sheer damn luck).
Is this news? Well, there certainly is still controversy, although there may be no issue ... But it seems to me that, say, research on automatic speech recognition is not particularly well-informed by the processes that human brains actually execute in perception. (Neuromorphic engineering approaches are, it goes without saying, an exception.)
"The extensive network of connections that we have documented among various levels in the auditory system (...) may support mechanisms by which higher-level interpretations of speech are tested against incoming auditory information." (p.11)
Hmm... sounds like an analysis-by-synthesis model...
"Speech perception likely proceeds by reconciling interpretations generated on multiple time-scales, ..." (p.13)
Multiple time scales... No wonder you guys are so cozy with this review ;-)
Although, maybe someone can help me out, because I wasn't sure where they had mentioned the idea (or role) of multiple time scales in speech perception/recognition prior to this passage which is the sentence before the conclusion.
Not so cozy actually in many respects. The behavioral evidence review is real nice, as they make a pretty strong case for top-down influences on speech processing, among other points. But the emphasis on frontal systems being the primary source of this top-down influence is overstated, I think. For example, a lot of space is devoted to top-down lexical and contextual constraints on speech processing, appropriately so. Why couldn't this top down influence be coming from temporal lobe systems, which many people believe support lexical-semantic processing, rather than frontal systems?
Another point of disagreement is the section on linkages between speech perception and speech production. I'm definitely into strong connections in this respect, but I don't think the evidence is strong at all that speech production systems play a major role in speech perception. If you look at the evidence sited in that section, all of it has to do with perceptual effects on speech production: learning to talk, effects of auditory feedback on production, verbal working memory, adjusting production patterns to match speakers in your environment... The data make a strong case for perceptual influence on production, which I buy whole-heartedly, but is thin on evidence for the reverse influence.
Similarly the anatomical sections over-emphasize frontal contributions to speech perception. The effects they discuss are interesting -- that degraded speech leads to increased activation in frontal and temporal areas -- but how do we know this isn't just an attentional effect, rather than top-down influences of the sort they discussed in their behavioral review?
They also cite Hickok & Poeppel 2004 as suggesting that "motor activity during speech perception reflects the activation of articulatory representations which permit the listener to derive the intended spoken gestures of the speaker." Not sure about David, but I don't believe it. What we actually said was that "there is a tight relation between speech perception and speech production... [but that] mapping of sensory representations of speech onto motor representations may not be an automatic consequence of speech perception, and indeed is not necessary for auditory comprehension." (p.91)
I was struck by their discussion of an auditory echoic memory buffer, which I guess is part of the HP model too. I hadn't thought about it before, but the existence of that kind of buffer is relevant for the arguments for purely bottom-up models. In me and Phil's review of those arguments (http://ling.umd.edu/~ellenlau/LING621.Final.pdf), our impression was that one of the things that made the MERGE people most uncomfortable about top-down models like TRACE was the idea that there was no place in the system that preserved a record of the 'true' input. In the visual case, I think that is a real possibility, since there are a number of studies showing top-down impacts on activity as low down as V1. But in the auditory case, where you need a buffer for independent reasons because your signal is extended over time, maybe an interesting byproduct is that you do get to temporarily keep this record of the 'true' input.
If so though, it would make the analysis-by-synthesis computation in the two modalities really different, b/c in one case you can keep updating candidate sets you generated at the top and comparing with the real input on the 'bottom', and in the other, like TRACE, you would be updating activity at both the top and the bottom levels until they converged. Unless there's some pristine visual buffer we haven't found yet...
At least speech production systems play a major role in speech perception when I hear functional auditory hallucinations because the sensory consequence(my inner voice)is used to guide the attention devoted to a none verbal peripheral sound that make me hear my inner voice when it substitutes the pitch of my inner voice.
As you probably understand I don´t know much about speech perception and this is more something that I think because of my experiences from hearing voices like this.
More about my experiences of functional auditory hallucinations and the need to understand...:http://www.freewebs.com/stefan661/
My email address: firstname.lastname@example.org
Post a Comment