It is a commonly held belief that speech perception involves the recovery of segmental information -- that is, the speech stream is analyzed in such a way that individual phonemes are recovered. So a typical story is that we analyze the spectro-temporal features to recover phonemes which are put together to form syllables then phonological words, enabling lexical-semantic access. We've suggested, as have others, that maybe the syllable is a basic unit of analysis, while at the same time leaving open the possibility that we might also access segmental information. For example, as in this figure from Hickok & Poeppel 2007:
Or this overly simplified cartoon from Hickok 2009:
So here's the question, what exactly is the evidence that we access segmental information in perception? Do we even need phonemes for speech perception? Why?
Let me play devil's advocate and claim that we don't extract or represent phonemes at all in speech perception (production is a different story). We do it all with syllables.
Convince me that I'm wrong.
Hickok, G. & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, 393-402
Hickok, G. (2009). The functional neuroanatomy of language. Physics of Life Reviews, 6, 121-143.
The problem with the "all syllables all the time" approach is that there are languages which offer typological challenges to this idea, such as the language formerly known as Bella Coola:
Nuxalk allows words without any resonants at all, such as [sxs] 'seal fat' and [xɬpʼχʷɬtʰɬpʰɬːskʷʰt͡sʼ] 'he had had in his possession a bunchberry plant.' (examples from the Wikipedia page, see Nater 1984 and Bagemihl 1991 for details and discussion, the full cites are also on the Wikipedia page).
So, either you modify the definition of syllable very radically, or you believe that some languages don't parse all segments into syllables, leaving some segments outside of the syllable structure (Berber offers similar problems).
I think the question is ill posed. Unless one can change a phoneme without changing a syllable, which I don't think is possible, then the two aren't independent. Therefore they can't be disambiguated in the strong terms that I think you would want.
I think what you really want to say is more of a neural mechanism question. Something along the lines of: Is there a system which represents 'b' 'a' and 'ba' as three non-overlapping populations? Finding all three in one region would seem to be key, as there might certainly be a region that responds to 'ba' but not 'b' or 'a', just as there would be a region that responds to full words but not single syllables or phonemes.
Bill, interesting language for sure! Clearly this is a challenge for the standard notion of the syllable, but does it argue that the individual segments in such a language are extracted during perception. I don't think so.
Kevin, that's kind of the issue I'm getting at. Researchers study the perception of /ba/ and /da/ and think of it as an investigation of phoneme perception, but in fact, as you point out, these are two different syllables.
The question of necessary contextual processing of phones seems to me slightly different from the question of whether syllables are the only (?, fundamental? smallest?) unit of phonological perception. For example, the Rosenberg et al model (Demisyllable-Based Isolated Word Recognition System http://ieeexplore.ieee.org/iel6/29/26177/01164132.pdf?arnumber=1164132) doesn't process phonemes individually, but does process them by "half-syllables" which are (sometimes) smaller than syllables. Greg, how would this count vis a vis your question?
I'm not entirely sure what you are getting at with the demisyllable example, but just to be clear, I'm not committed to the idea that the unit has to be a syllable. My question is really just this: when listeners hear "pa" to they parse it into /p/ + /a/ as is commonly assumed or is just /pa/? If you want to argue the former, what is the evidence?
I can't claim any expertise in this field, but if I were looking for evidence one way or the other I'd probably look at the unique number of phonemes and syllables across languages. The one with lesser variance is likely a more fundamental unit in the brain, e.g. if some language has 10 phonemes and another has 70, but they both end up using 100 syllables, that suggests the brain is primarily grouping by syllables. Similarly, if there's a phoneme drop when you move into alphabetic written languages, that suggests consideration of phonemes is somewhat based on standardizing words into an alphabet. These suggestions both assume there are some cognitive upper limits on the units of speech processing, if these limits aren't approached in our languages this would probably be a dead end (unless you wanted to teach arbitrarily complex fictional vocabularies to children...)
Another direction to pursue might be priming, where e.g. hearing the word 'sun' causes the listener to react quicker to the word 'hot'. If you could prime on syllables (boxing->bottle) better than phonemes (star->stir), that would support this hypothesis.
To follow up on Bill's comment, there are some pretty strong reasons to believe that syllables are epiphenomenal metalinguistic constructs (Ohala, 1998) based on:
- their theoretical superfluousness (Steriade, 1999; Blevins, 2004; crucially syllabification is never contrastive)
- psycholinguistic inconsistency (Trieman and Derwing's work on English, e.g. the variable syllabification of 'lemon; Lin (1997) on the unpredictable syllabification of Piro)
- variable usefulness cross-liguistically (juxtaposed with Bill's example, there are vowel-ful languages like Gokana /kee+ e + e + e + e + e/ 'wake + CAUSE + LOG + him + FOC (Hyman, 1985))
- evidence that example of syllable priming are in fact segment priming (Schiller, Costa, Colome, 2002)
Given the strong evidence that we don't use syllables, what are the possible things left mediating between phonetic cues and words? Either phoneme-like units or more articulatory-phonology type representations. That's an argument for another day, but I think it has to be one of those two, no?
(Hi Bill! Fred from Montreal/Ottawa here)
I've talked about this before in other comments on TB, but I'm becoming more and more convinced that we don't do any explicit parsing at all, either into syllables or segments. I pointed out in a comment to a previous post that illiterates and literates of non-alphabetic writing systems are horrible at phoneme monitoring (to be fair, the evidence is less overwhelming for syllable identification).
Virtually everything[*] that (psycho)linguists claim as evidence for sublexical representations is explainable from "word"-based exemplar models, i.e. models in which we remember something like full acoustic forms of words (maybe even multiword chunks). It's a strange source, but Daniel Silverman's book A Critical Introduction to Phonology does a great job of motivating this approach to linguistic knowledge.
In experimental work that seems to unequivocally show phoneme awareness, we're either (a) being called on to do something ecologically suspect (e.g. a phoneme-monitoring task, cf. the ref in my comment on the previous post to Démonet et al's work), or (b) generating novel outputs. And in the (b) case, it's still possible to generate the relevant forms without recourse to segments or syllables.
[*] Stefan Frisch is of the opinion that the existence of neologistic jargon aphasia strongly implies that we have sublexical representations. I haven't thought about it enough to decide whether I'm convinced, and just to be sure, I'd like to see some literature on NJA in illiterates or literates of yadda yadda.
So Marc, the best evidence that we use phoneme-like units is that we don't use syllables?
I imagine you are thinking of syllables not as strings of segments (or better, yet, bundles of features), but as some sort of perceptual unit with no internal structure. In this case /pa/ is actually "#", where "#" bears no contingent relation to /p/ + /a/, correct?
If so, I have a question: What would speech perception by syllables alone look like?
There's some evidence - speech errors, phoneme recovery tasks (I think they're known not to work for larger units), the fact that the target of phonological processes tends to be a segment (which I think is often overlooked) - but it is far from convincing; I don't think that work says much about the neural encoding of speech, either.
You are asking the wrong person for evidence for the phoneme, but I think the most promising finding is the fact that the time scale of LTG sensitivity to sound is on the order of 40ms, which is tad smaller than a segment.
In retrospect, I guess the RTG's sensitivity to ~200ms units would then be evidence for syllables...
That's right Diogo, no internal structure. I suppose it would be the same as speech perception by phonemes (whatever THAT looks like) except the units are bigger. Think face perception: we seem to get the global configuration without much attention to internal details...
Speech errors don't count -- that's production, although this is good evidence that phonemes are used as a unit of analysis in production.
By phoneme recovery do you mean phonemic restoration effects? This could be taken as evidence that the relevant unit of analysis is larger than the phoneme, smiailr to raednig wtih the itnreanl srtcutrue of wrods srcabmeled, i.e., it is pattern recognition on a higher level.
If there's no internal structure within syllables at all, and no phoneme-sized units, this will cause some major problems in recognizing words and morphemes in resyllabification contexts, e.g. "act" (one syllable) doesn't have the same syllable as the first syllable of "ac.tor" (two syllables). Now, one can try to have a metric of "syllable similarity" (i.e. for which "act" is similar to "ac") but I think that you will find that the multi-dimensional notion of similarity necessitated by this will re-introduce a phoneme-level of representation.
Examples of resyllabification across word boundaries abound in French (and Korean), for example (from Wikipedia article on French liason) "premier étage" (= English "first floor") = /pʁə.mjɛ.ʁ‿e.taʒ/ where the liason messes up two adjacent syllables: for recognition purposes [mjɛ] has to be equated with [mjɛʁ] for "premier" and [ʁe] has to be equated with [e] for "étage". The second case is the killer, as the [e] in "étage" can receive many different consonants as a result of liason, and so we would have to equate a large number of syllables with the [e] syllable ([le], [ze], ...).
Following up on my last comment, such examples can be extended to virtually all sub-syllabic morphology. For example in the Russian phrase "к Ивану" (= English "from Ivan") the preposition (in this case) is the single consonant [k], and the case marker is the vowel [u], so the three syllables are [kɯ] [va] [nu], but the syllables of "Иван" (= English "Ivan") are [i] [van]. So in recognition we have to have some way to relate [kɯ] [va] [nu] with the morpheme sequence of syllables (that's the hypothesis) [k]-[i][van]-[u]. Note that (1) [k] is not a pronounceable syllable in Russian (and so is an abstract syllable in this analysis; it's pronounced [ko] in pre-jer contexts, so we could instead recover [ko]-[i][van]-[u] somehow), (2) we need to equate the phonemes [i] and [ɯ] in this examples (due to a rule of Russian), in syllable terms we have to equate all words beginning with [i...] syllables (whatever that means without phonemes) with [Cɯ...] syllables where C is a hard consonant and the [...] is held constant (whatever that means in syllable terms).
Perhaps simpler examples are all English words with sub-syllables suffixes (-t, -d, -z, -θ, ...) so that we have to equate the syllable [dip] "deep" with that of [dɛpθ] "depth", or [kæt] "cat" with [kæts] "cats". Again parsing these into morphemic "syllables" would require abstract C-only syllables [dip][θ] and [kæt][s], and we're well on our way to reconstructing all of the individual phonemes as "abstract" syllables.
Thanks for the clarification, Greg! I have another question though. You say that the only difference between "perception by phones" vs "perception by syllables" would be that the units of perception are bigger. So, is it safe to assume that while /p + a/ and unit "#" would not be hierarchically related (ie, "#" has no internal structure), "#" is still abstract enough that different /pa/ tokens would be perceived as the same "#" unit?
The reason I am asking is that we do have an idea of what speech perception by segments would look like (even though it might turn out to be wrong), and this has to do with the end point of the process.
We have good reason to believe that words are internally represented as bundles of features (which here I will pretend are the same thing as segments, for the sake of the argument), and so if I want to retrieve the word /k+ae+t/ and its related meaning, it makes sense that I would have to reconstruct the string [k+ae+t] from the speech stream. To what degree [k+ae+t] would have to look like /k+ae+t/ is an open question, of course, and it might be the case that an "incomplete" sub-segmental representation might do the trick. The point, however, is that in this kind of model, we are using the same code for perception and storage.
However, if I am doing everything by "syllables" (in quotes because these would not be really syllables, but units with no internal structure), then what would I be retrieving? On the perceptual side, we have "syllable #", while on the storage side we have the entry /k+ae+t/. How do they relate to each other?
It seems that we would have to posit that "syllable #" would somehow be able to make contact with the meaning related to the entry /k+ae+t/ in the mental lexicon. The only mechanism that I can imagine would be some sort of associative memory of acoustic events and lexical entries (ie, some sort of episodic model). The problem here of course is the stuff that Bill was alluding to, which is that words can change their shape according to the surrounding context (sometimes quite dramatically), so it is hard to see how acoustic similarity alone would do it.
Of course, none of this is actually "evidence for" the role of segments in perception, I am just trying to figure out what the alternative model would look like.
I think you guys have made a good point that the notion of the syllable is not going to do what we need it to do in all situations. So let me back up.
When the Haskins folks first started looking at the acoustic features that drive speech perception. They found that there wasn't an acoustic pattern that uniquely mapped onto phonemes. Hence the motor theory was born. But what if phonemes aren't the relevant unit? Maybe there is a better mapping between acoustic patterns and something larger. Massaro has suggested it is the syllable (Oden & Massaro. Psychological Review. Vol 85(3), May 1978, 172-191), hence my suggestion. Maybe this is too restrictive though. Maybe it is a more flexible mapping between acoustic features and *something*, which may be morphemes, syllables, words -- whatever works in a given situation.
So [kaet] is a spectrotemporal pattern that is mapped onto one morpheme, whereas [kaets] is a pattern that is mapped onto two.
Regarding words as bundles of features, are you assuming articulatory features? If so, I would ask whether this might only be true of the representation of words on the production side. Or do you have in mind articulator-free features? In which case we might ask exactly what this buys us.
Regarding the relation between syllable # in perception and /k+ae+t/ in "storage"... what do you mean by "storage"? Why not store syllable # on the perception side and a bundle of articulatory features on the articulatory side. Lexical access happens by retrieving representations of the form "syllable #" which are linked to conceptual semantic representations. The link between perception and production is then a mapping between "#" and /k+ae+t/.
I don't see why the different acoustic shapes that a word can take is such a problem in principle. Yes, we have to figure out how different acoustic patterns can be mapped onto the same higher level representation -- same as the problem in vision -- but the same problem exists, perhaps more dramatically at the phoneme level.
I've been doing a lot of thinking on this issue as part of my dissertation, and as far as I can gather, the best evidence for prelexical segmental representations of some grain size (probably bundles of features) being used in the normal course of speech recognition comes from the perceptual learning literature (e.g. McQueen et al. 2006).
The basic argument is that the types of generalizations formed by listeners require some prelexical segmental units for those generalizations to operate on or retune. I don't think, however, that this work really differentiates between bundles of features, phonemes, syllables, or some combination as the locus of perceptual learning.
McQueen, J. M., Cutler, A., & Norris, D. (2006). Phonological Abstraction in the Mental Lexicon. Cognitive Science: A Multidisciplinary Journal, 30(6), 1113-1126.
I think articulatory phonology (as contrasted with motor theory) does present an alternative to the syllable and the phoneme: the gesture. This allows for temporally overlapping representations (phonemes don't) with the flexibility you're speaking of.
Following up on Bill again, the re-syllabification point also gets at what I was trying to communicate earlier: Phonological rules tend to target (single) phonemes and not some other unit of representation. This is true of re-syllabification (only single phonemes re-syllabify), allomorphy (k->s due to English's -ity suffix), allophony (aspiration in English for initial voiceless plosives) and on and on. Across languages, processes like these generally target single phonemes. While the argument has been made that vowel harmony target syllables, for example, or that allophonic processes target larger units because of co-articulation, I think it's hard to get around the need for the phoneme unit in describing these processes.
The problem with any articulatory-based account of perception is that it fails empirically: it can't explain speech perception in people who have lost their ability to articulate speech, in people who have failed to acquire the ability to speech, or in animals who don't even have the potential to speak. Sooner or later, the field is going to have to come to terms with these facts.
The fact that phonological rules apply to phonemes (I wouldn't dispute this) is not an argument for the involvement of phonemes in speech perception. After all, the phenomena that these rules capture hold of speech production -- it's therefore no surprise that theories stated over articulatory gestures seem to capture the relevant facts. It is a perfectly legit hypothesis that the same representations/processes apply for perception, but this hypothesis has been falsified by the sorts of data I mentioned above.
So I haven't seen any knock down evidence in this discussion that phoneme-size units are necessarily used in speech perception.
I think slips of the tongue constitute good evidence that phonemes are separately represented in the speech planning process (darn bore for barn door...). Maybe there is evidence of this sort in so called slips of the ear. Does anybody know whether phoneme exchanges occur, for example, in slips of the ear?
One can uncontroversially use an articulatory-phonology theory of representation with only a facilitatory impact of the motor system, rather than an essential role (i.e. do without motor-theory, which I agree is flawed). This also fits within the H&P model, but in lieu of cues->features->(phonemes)->words, you have cues (acoustic, motoric, contextual)->gestural scores (in the abstract sense)->words.
Regarding phonological rules: Rules must be acquired, therefore recognized in our input. So, at some level, either perceptual or at a level used for generalization extraction, there must be a phonemic representation. I think the null hypothesis is the former and even if the latter is true, this phonemic information is available to and must ultimately be used for perception as in the [grim binz]-> green beans experiments or people's ability to go from noncicity to [nonsik]. (This is only true if you believe phonological generalizations are stated over phonemes.)
I do realize that these two points work at cross purposes.
Cues (acoustic, motoric, contextual) -> gestural scores -> words.
I buy the multiple cue part, but why the gestural score? And what is an abstract gestural score? Does a patient with pre-lingually acquired bilateral anterior operculum lesions and a resulting anarthria (can't speak) have abstract gestural scores? Does a chinchilla have them?
My view is that as soon as you abstract away from the actual motor system, whether you call them gestural scores or "intended gestures" you are in the auditory system. That is, the commonality (parity) between auditory and motor speech is not in the gesture but in the auditory system. Put differently, the motor speech system is not aiming for an abstract gesture, it is aiming for a sound.
Regarding rules, how about this: Phonological rules are a description of the sensory-to-motor mappings that allow us to transform an acoustic representation of speech into a motor representation of speech (HP's "dorsal stream"). They do not describe the mapping between acoustic representations of speech and conceptual structures (HP's "ventral stream").
I'm not familiar with the green beans experiments. I'll have to look those up.
Three references that show that phonology or phonological mappings must also reside in the ventral stream:
Marslen-Wilson, W. D., Nix, A., & Gaskell, G. (1995).
Phonological variation in lexical access: Abstractness, inference, and English place assimilation. Language and Cognitive Processes, 10, 285–308.
Gow, D. W. (2001). Assimilation and anticipation in continuous spoken word recognition. Journal of Memory and Language, 45, 133–159.
Meghan Sumner, Arthur G. Samuel, Perception and representation of regular variation: The case of final /t/, Journal of Memory and Language, Volume 52, Issue 3, April 2005, Pages 322-338,
I'll have a look at those. Thanks. Care to provide a brief summary of what's in them?
Jusczyk, Goodman, and Baumann (1999) found that 9-month-olds showed a preference for lists of words that shared an initial consonant (bow, boot, bat, …) over unrelated lists of words, which was taken as sensitivity to phonemes (more precisely: to the internal structure of syllables). Could this be a useful piece of evidence for the role of phonemes in perception?
Well, bow, boot, bat do have onsets in common, but I don't see why that necessarily means the infants are analyzing the perceptual events into discrete phonemic units. Maybe they just like to hear sounds that start similarly.
Re:"Maybe they just like to hear sounds that start similarly." How is similarity defined for this purpose if not in specifically phonemic terms? The bursts and formant transitions are different for /b/ before /æ/ and /u/. What mechanism equates the sound pattern at the start of [bæ...] with that of [bu...] (and doesn't equate it with other patterns belonging to other phoneme sequences)?
Bill: I'm not an expert here, so just curious... are you denying that sounds generated by bilabial stops have no acoustic similarity?
No, I'm not denying that there are common aspects to bilabial voiced stop bursts. But the effect that Nina describes is phonemically specific (at least in that case). There are many ways to carve up the notion of "acoustic similarity" and very few of them will settle on an equivalent to phonemic similarity. Why, for example, does this effect not extend to all voiced stops [b,d,g] which are certainly acoustically similar (albeit under a different metric for similarity)? How do we pick out the right metric for similarity here from the many to choose from; the one that effectively defines phoneme identity?
Remember we are trying to explain an empirical result that babies preferred lists of words starting with the same phoneme to unrelated lists of words. We don't need to solve all problems to explain the result. We only need to show that the same-onset lists were more acoustically similar than the different-onset lists.
Another issue to think about with respect to this finding is that we don't know what level of processing is driving the babies' preference. Suppose 9-month-olds are in the middle of learning a mapping between undecomposed syllables and the motor gestures that can reproduce them. Maybe it is the motor similarly that is driving the preference. Put differently, just because we present a stimulus perceptually doesn't mean that the response is directly output from perceptual computations.
My colleague Sven Mattys at U of Bristol has some evidence for the existence of phonemes (and syllables). Below is the reference and abstract of a paper that speaks to this issue, and a passage that refers to even more direct evidence for the role of phonemes in perception.
Mattys, S.L. & Melhorn, J.F (2005). How do syllables contribute to the perception of spoken English? Evidence from the migration paradigm. Language and Speech, 48, 223-253.
The involvement of syllables in the perception of spoken English has traditionally been regarded as minimal because of ambiguous syllable boundaries and overriding rhythmic segmentation cues. The present experiments test the perceptual separability of syllables and vowels in spoken English using the migration paradigm. Experiments 1 and 2 show that syllables migrate considerably
more than full and reduced vowels, and this effect is not influenced
by the lexicality of the stimuli, their stress pattern, or the syllables’ position relative to the edge of the stimuli. Experiment 3 confirms the predominance of syllable migration against a pseudosyllable baseline, and provides some evidence that syllable migration depends on whether syllable boundaries are
clear or ambiguous. Consistent with this hypothesis, Experiment 4 demonstrates that CVC syllables migrate more in stimuli with a clear CVC-initial structure than in ambisyllabic stimuli. Together, the data suggest that syllables have a greater contribution to the perception of spoken English than previously assumed.
And a relevant passage that highlights some evidence for phonemes:
The migration paradigm is relevant to the quest for languages’ units of perception because its response patterns originate from auditory illusions rather than from
conscious decision processes, a recognized advance in the study of perception (Fodor & Pylyshyn, 1981; Marcel, 1983; Morais & Kolinsky, 1994; Treisman, 1979). Specifically, Morais (1985) contends that a task that bypasses access to conscious representations
in the production of a response provides greater insight into perceptual mechanisms
than one that does not. The migration paradigm suits this category quite well. For
instance, illiterate Portuguese speakers have no conscious awareness of phonemes,
as measured by phoneme detection, deletion, and addition tasks (e.g., Morais, Cary, Alegria, & Bertelson, 1979), but, yet, they experience phoneme migration to the
same extent as literate speakers do (Morais & Kolinsky, 1994). Thus, migrations involve speech properties that do not need to be accessible to conscious experience
to recombine into a new percept, which consigns the method to low processing levels compatible with the perception stage.
I proposed the syllable, V, VC, and CV (where V=vowel and C=consonant or consonant cluster) as the unit of speech perception in 1972, and have supported this hypothesis in a series of experiments since that time.
Massaro, D.W. (1972). Preperceptual Images, Processing Time, and Perceptual Units in Auditory Perception. Psychological Review, 79(2), 124-145.
The question is does speech perception parse heard speech into sub-lexical units? Then the next question is, is that sub-lexical unit the phoneme or some other unit? I opt for some sort of hierarchy of syllable types integrated at the production end to controlled articulatory gestural routines.
>>Nuxalk allows words without any resonants at all, such as [sxs] 'seal fat' and [xɬpʼχʷɬtʰɬpʰɬːskʷʰt͡sʼ] 'he had had in his possession a bunchberry plant.' (examples from the Wikipedia page, see Nater 1984 and Bagemihl 1991 for details and discussion, the full cites are also on the Wikipedia page).<<
Well, first I know it's cliche'd to complain about Wiki for linguistics, but it does stink.
Two, this seems to be not typical of human languages. But more importantly, how prevalent is this in the language itself? Just because we have syllables and words without vowels, that doesn't mean the majority of the lexicon is like this.
I'm not sure what the complaint about Wikipedia is here. The Wikipedia examples are drawn directly from Bagemhil and Nater. Bagemihl is available on-line (but behind a paywall), Nater is not not available online. So the most expedient way to cite this seemed to be through the Wikipedia entry. I did give the original sources (which are given in full on the Wikipedia page).
Nuxalk has many words without vowels. For morphological reasons it's more difficult to find all-obstruent words (those not containing any vowels, liquids or nasals). Very similar issues arise in Tashlhiyt Berber, an unrelated language (http://www.springer.com/education+%26+language/linguistics/book/978-1-4020-1076-7). The problem, however, is much more general as pointed out in earlier comments. There are many sub-syllabic morphemes in languages throughout the world. It remains completely unclear how to handle such cases if only syllables are allowed without any subsyllabic structures.
My complaints about wiki are general. Too often when I try to track down the cited sources they lead to no pages. I would darned if I would rely on one wiki article for knowledge about one language. OK, so the langauge has many words that are 'all obstruent'). Care to give us a count? The majority of words? A large minority? What?
The problem with the usual phonological syllable is it is as static and segmented as other units, such as phonemes and features. Go to something dynamic and articulatory and you get something that is very hard to reference in discourse but comes closer to modelling controlled speech.
Post a Comment