Wednesday, March 5, 2014

Hierarchical and Independent Levels of Representation in Speech Production: Discussion of the HSFC Model

Guest post by Matt Goldrick and Adam Buchwald

As detailed in a 2012 Talking Brains post, Greg and colleagues have proposed a model for speech production that aims to synthesize research from motor control, psycholinguistics, and neuroscience. This year, the inaugural issue of Language, Cognition, and Neuroscience (a re-christening of Language and Cognitive Processes) was guest edited by Albert Costa and F. Xavier Alario. It featured an article by Greg outlining a descendent of this model, the Hierarchical State Feedback Control model (HSFC). This target article was accompanied by a number of commentaries, including one by co-authored by the two of us and Brenda Rapp, as well as a response by Greg.

We (Matt and Adam) wanted to take advantage of the extra space afforded by Talking Brains to continue this conversation. The H in HSFC emphasizes the key role of hierarchical representations in Greg's proposal. In this post, we'd like to articulate why psycholinguists and neuroscientists have argued that in addition to such hierarchical representations, distributed/parallel encoding plays a critical role in language production.

To orient the discussion, consider two classical types of neurocognitive representational structures from vision:

1) Hierarchical representations. In representations that have this type of structure, there is a mapping (a necessary relationship) between two sets of representations. Consider classic simple vs. complex cells (Hubel & Wiesel, 1962). Under this proposal, simple cells preferentially respond to oriented bars in particular locations in the visual field. By integrating responses over many simple cells, complex cells respond to oriented bars across multiple locations. Critically, there is a precise mapping between these two levels of representation; the response properties of complex cells are defined by a function stated over the response properties of simple cells. 

2) Parallel, independent representations. In representations that have this type of structure, the relationship between the two sets of representations is not defined by a direct mapping which spells out one level in terms of the other; rather, they are independent dimensions of structure. These dimensions can be linked or bound together, but they need not necessarily co-occur. Consider Treisman and colleagues' classic Feature-Integration Theory, which claims that some dimensions of visual stimuli are initially processed independently and only later bound together. This proposal provides a ready account of illusory conjunctions (Treisman & Schmidt, 1982). For example, if letter identity and color are coded independently, this can explain how a display with green Xs and brown Ts can give rise to the erroneous perception of a green T; this percept would be unlikely if letter identity and color were encoded in a single representation. Critically, the two types of information must be encoded independently (but in parallel) for these illusory conjunctions to occur during the later process of binding.  

The HSFC model emphasizes the role of hierarchical representations. There is abundant evidence that these play a role in speech production. With respect to speech motor control, many accounts adopt a syllable-sized, relatively coarse-grained specification of motor movements, which directly maps onto detailed information regarding the precise temporal and kinematic coordination involved in production. There is also evidence that there are multiple levels of segment-sized representations that specify different types of information. A classic distinction is between context-independent vs. position-specific aspects of sound structure. The context-independent representations encode information about the sounds (e.g., /t/ in table and stable), and these map to position-specific representations that spell out the details (e.g., table contains aspirated [th] and stable contains unaspirated [t]). Evidence that these constitute distinct levels of representation includes data from individuals with acquired speech impairment (Buchwald & Miozzo, 2011). While this is not directly specified in the current HSFC model, it is clearly consistent with the overarching account as noted in Greg's response.  

But what we'd like to emphasize is that parallel, independent representations also play a key role in language production. In particular, there's abundant reasons to believe that at certain levels of representation syllabic and segmental structure are not organized in a strict hierarchical fashion, but rather form parallel aspects of form representation. A number of results suggest that rather than syllables being defined as chunks of segments, syllable structure defines a frame; segments are then bound or linked to positions within this frame (see Goldrick, in press, for review and discussion of other dimensions of phonological structure). 

To make this contrast explicit, consider the syllable "cat." Under a strictly hierarchical theory, this syllable could be defined by a mapping from [kaet] to the component segment [k-Onset] [ae-Nucleus] [t-Coda]. Under a theory utilizing independent representations, there is a [Onset]-[Nucleus]-[Coda] syllable frame and, independently, three segments /k/, /ae/, /t/. The syllable is represented by the binding /k/-[Onset]; /ae/-[Nucleus]; /t/-[Coda].

The first form of evidence in favor of the independent representations perspectives comes from illusory conjunctions in production. Speech errors can result in the mis-ordering of segments. In the majority of these errors, the segments occur in the wrong syllable but the correct syllable position (e.g., bad cat misproduced as "bad bat"). However, a substantial minority (more than 20% of errors in corpora of spontaneous speech; Vousden, Brown, & Harley, 2000) result in error being produced in incorrect syllable positions (e.g., film misproduced as "flim"). Just as letter identity and color form independent, dissociable dimensions of visual representation, segment identity and syllable positions form dissociable dimensions of phonological representations in production.

Evidence from priming points to a similar conclusion. Colored object naming is facilitated by segmental overlap between the color and object name, even when the segments occur in different syllable positions (e.g., green flagDamian & Dumay, 2009). In addition, production of phrases made up of two nonsense words is facilitated when the two nonsense words have syllables with the same structure compared to nonwords that do not have matching structures -- even when there are no segments shared across the two syllables (Sevald, Dell, & Cole, 1995). For example, repeating two nonwords that both start with CVC syllables (e.g., KEM TIL.FER) or CVCC syllables (KEMP TILF.NER) is faster than repeating nonwords that start with syllables with contrasting consonant-vowel patterns (e.g., KEM TILF.NER or KEMP TIL.FLER). This occurs in spite of the syllables sharing no segments (e.g., KEM and TIL).

Based on data such as these, psycholinguistic theories (e.g., Shattuck-Hufnagel, 1992) have proposed that syllables and segments are not related in a strictly hierarchical fashion, but rather form independent-yet-linked dimensions of sound structure. That's not to say that the links are purely arbitrary; only certain segments can be associated to particular syllable positions (e.g., in English, /ng/ can be associated to coda but not onset). But segments are not merely the "elaborated" form of syllabic chunks; they form independent entities.

While hierarchical representations are a critical part of speech production, it's important to acknowledge the critical role of non-hierarchical representation. Mirroring other domains of processing, both representational schemas serve critical functions in the neurocognitive mechanisms supporting speech. 

Buchwald, A. & Miozzo, M. (2011). Finding levels of abstraction in speech production: Evidence from sound-production impairment. Psychological Science, 22, 1113-1119.
Damian, M. F., & Dumay, N. (2009). Exploring phonological encoding through repeated segments. Language and Cognitive Processes24, 685-712.
Goldrick, M. (in press). Phonological processing: The retrieval and encoding of word form information in speech production. In M. Goldrick, V. Ferreira, & M. Miozzo (Eds.) The Oxford handbook of language production. Oxford: Oxford University Press.
Hickok, G. (2014a). The architecture of speech production and the role of the phoneme in speech processing. Language, Cognition and Neuroscience29, 2-20.
Hickok, G. (2014b). Towards an integrated psycholinguistic, neurolinguistic, sensorimotor framework for speech production. Language, Cognition and Neuroscience29, 52-59.
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology,160(1), 106-154.
Rapp, B., Buchwald, A., & Goldrick, M. (2014). Integrating accounts of speech production: The devil is in the representational details. Language, Cognition and Neuroscience, 29, 24-27.
Sevald, C. A., Dell, G. S., & Cole, J. S. (1995). Syllable structure in speech production: Are syllables chunks or schemas? Journal of Memory and Language34, 807-820.
Shattuck-Hufnagel, S. (1992). The role of word structure in segmental serial ordering. Cognition42, 213-259.
Treisman, A., & Schmidt, H. (1982). Illusory conjunctions in the perception of objects. Cognitive Psychology, 14, 107-141.

Vousden, J. I., Brown, G. D., & Harley, T. A. (2000). Serial control of phonology in speech production: A hierarchical model. Cognitive psychology,41, 101-175.


Greg Hickok said...

Hi Matt and Adam,
Thanks so much for your post. You make an excellent point, which I will not dispute. With respect to the HSFC, I have to say that my conceptualization of "hierarchical" is not terribly technical. All I really meant to emphasize is that there are at least two levels of control circuits in the brain involved in speech production, one driven by an auditory-motor loop and (capable of) coding larger chunks such as syllables and another driven by a somato-motor loop that is coding smaller chunks of information. I think your claim that syllables and segments are independent-but-related fits my functional neuroanatomical view precisely. I certainly do not mean to imply a view in which the syllable code is built up out of the segments, like complex cells from simple cells in the example you provided. So I think we are in perfect agreement on this and I thank you for clarifying things. It's an important point. It would be fun to see if we can get some traction on working out exactly how syllable frames might be coded neurally and related to segment-sized codes.

Now, where we may disagree, and it would be fun to discuss (more), is whether there is a role for segment-sized units in speech recognition.

Matt Goldrick said...

Thanks, Greg! Maybe broadening the discussion a bit beyond segments, it might be useful to think about the degree to which there's representational (or processing) parity across recognition and production. The default assumption in psycholinguistics has been yes, but it's not clear to me what that would need to be true conceptually (much less whether it's empirically valid).