Tuesday, June 1, 2010

An egregious act of methodological imperialism

In an 'Update' in a recent issue of TICS (Weak quantitative standards in linguistics research. 10.1016/j.tics.2010.03.005), Gibson & Fedorenko (GF) commit an egregious act of methodological imperialism, and an unwarranted one, at that.

GF complain that one key source of data for theoretical linguistics (and particularly for syntax and semantics research), acceptability or grammaticality judgments, are not "quantitative." They advocate for some intuitive standard of what it means to do the 'right kind' of quantitative work, arguing that "multiple items and multiple naive experimental participants should be evaluated in testing research questions in syntax/semantics, which therefore require the use of quantitative analysis methods." They contend that the "lack of validity of the standard linguistic methodology has led to many cases in the literature where questionable judgments have led to incorrect generalizations and unsound theorizing." In a peculiar rhetorical twist, GF go on to highlight their worry: "the fact that this methodology is not valid has the unwelcome consequence that researchers with higher methodological standards will often ignore the current theories from the field of linguistics. This has the undesired effect that researchers in closely related fields are unaware of interesting hypotheses in syntax and semantics research."

Now, it's hardly new to express worries about grammaticality judgments. Why this is considered an 'Update' in a journal specializing in Trends is a bit mystifying - the topic has been revisited for decades (e.g. Spencer 1972, Clark 1973, and many thereafter), and is at best an 'Outdate.' And other than some animosity towards theoretical linguistics from Ted and Evelina, two established and productive MIT psycholinguists, it's not clear what trend is being thematized by the journal, other than the pretty banal point that in absolutely every domain of research there are, unfortunately, examples of bad research.

But do linguists really need to be told that there is observer bias? That experiments can be useful? That corpus analyses can yield additional data? I must say I found the school-marmish normativism very off-putting. Like all disciplines, linguistics relies on replication, convergent evidence (e.g. cross-linguistic validation), and indeed any source of information that elucidates the theoretical proposal being investigated. Some theories survive and are sharpened, others are invalidated. Is this different from any other field? GF seem to believe in a hierarchy of evidence and standards, and some unspecified sense of quantitative analysis is considered 'higher' and 'better.' Would they willing to extend that perspective to those of us who do neurobiological research? Are my data even better, because both quantitative and 'hard'? Not a conclusion we want to arrive at for cognitive neuroscience of language, I think.

Culicover & Jackendoff have published a response (Quantitative methods alone are not enough: Response to Gibson and Fedorenko. 10.1016/j.tics.2010.03.012) that tackles some of this. Their tone is pretty conciliatory, although they rightly point out that "theoreticians' subjective judgments are essential in formulating linguistic theories. It would cripple linguistic investigation if it were required that all judgments of ambiguity and grammaticality be subject to statistically rigorous experiments on naive subjects, especially when investigating languages whose speakers are hard to access. And corpus and experimental data are not inherently superior to subjective judgments." Their points are cogently made -- but it's hardly a spirited response. Their meta-commentary is too vanilla and of the "why can't we all be friends?" flavor.

On the other hand... I just read a very clever and appropriately aggressive and quantitative response to GF that I wish TICS had published. It's my understanding that TICS had a chance to look at this response and I am baffled that they didn't publish this actually innovative and insightful commentary. It is by Jon Sprouse and Diogo Almeida (SA) at UC Irvine (The data say otherwise. A response to Gibson and Fedorenko.) SA analyzed the data from more than 170 naïve participants rendering judgments on two types of phenomena that make frequent appearances in linguistics and psycholinguistics (wh-islands and center-embedding). Using a quantitative (resampling) analysis they illustrate how many judgments and how many contrasts one needs to obtain a significant result given the effect sizes of these sorts of studies. Compellingly, they show that for the kind of phenomena that are being investigated, vastly different numbers of subjects and contrasts are necessary to achieve a convincing result. The kinds of contrasts and phenomena that linguists tend to be worried about are clearly evident with very few data points; in contrast, surprisingly large data sets are necessary to achieve a satisfactory result for psycholinguistic phenomena. They conclude, in my view quite correctly, that the only thing that can be concluded is that the objects of study are simply quite different for linguistics and psycholinguistics. There may be controversy, but there is no issue ...

Readers should form their own opinions on this issue, but I urge them to look at this trio of brief commentaries.


Gibson, E., & Fedorenko, E. (2010). Weak quantitative standards in linguistics research Trends in Cognitive Sciences DOI: 10.1016/j.tics.2010.03.005

7 comments:

Greg Hickok said...

Well congrats to Gibson and Fedorenko for stirring the pot a bit. They have succeeded in getting the attention of the linguistics community and maybe some progress will come of it in the end.

So let me stir the pot a bit more, this time targeting the quantitative psycholinguistic soup. While quantitative measurements are all good, they can deceive if your measurement isn't valid. Linguistic theory is built largely on a single measure, acceptability judgments of native speakers of a language. It is an assumption of the theory that this is a valid measure of something important in terms of language knowledge. Psycholinguistic theory is built on quantification of lots of different measures (reading times, various judgments, ERPs, priming, etc.) and there are lots of debates about which measure reflects what. It may be that the noise in the measurement in psycholinguistics is worse than the lack of quantification in traditional linguistic experimentation (yes, they are experiments).

Pot gently stirred. Now let's put it in the food processor. The vast majority of psycholinguistic research involves reading, often in very unnatural ways. Yet, most psycholinguists generalize their theories to natural language processing. This strikes me as a major problem. For example, looking at the hemodynamic brain response, we know that Broca's area lights up like crazy during reading, but not so much during auditory sentence comprehension. What if the bulk of field of psycholinguistics is measuring, primarily, Broca's area's contribution to sentence processing induced by the unnatural act of reading a sentence? Quantitative or not, this would render much of the work invalid *as a theory of natural language processing*. Of course, it would still be valid for reading, but again, that is not the goal of most psycholinguists.

Quantification matters. Task matters more.

David Poeppel said...

Colin Phillips has written a characteristically thoughtful analysis of the issue of data for linguistics and psycholinguistics -- it's a good and funny read as well. If you're into this sort of thing, you should definitely read Colin's piece about the Gibson/Fedorenko challenge. Colin tackles the "we need a bunch of naive subjects and many many stimuli so that we can do a quantitative analysis" and shows when it can and can't work. His paper is an excellent companion piece to read along with the Sprouse and Almeida response to Gibson & Fedorenko.

To amplify Greg's point, to those of us who do cognitive neuroscience and are moving towards more naturalistic stimulation, the tasks employed by psycholinguistics are often very peculiar and in turn lead to neurophysiological data that seem to say much more about what tasks are being executed rather than what the underlying computations are that form the basis for doing the task in the first place.

Matt Goldrick said...

Jumping off from Greg's point and David's amplification, I think that the real issue--for linguists, psycholinguistics, and cognitive neuroscientists alike--is less about quantification (although this is important) than about relating behavioral effects to the underlying cognitive and neural mechanisms.

It's important that we gain an accurate, detailed picture of behavior. This is where quantification may help, as well as gathering data from a larger appropriately selected population of individuals and appropriately selected tasks. But what is in many ways the harder issue--and one that many researchers across disciplines have failed to take seriously enough--is that one needs to articulate (and empirically justify) the link between hypothesized cognitive/neural mechanisms and behavior.

I think it's uncontroversial that any behavior, be it well-formedness or grammaticality judgments, reaction times, etc., arises due to the complex interaction of many component processes. (For example, Chomsky (1980: 188) writes “the system of language is only one of a number of cognitive systems that interact in the most intimate way in the actual use of language.”) Using behavioral data to inform theories therefore requires “a sufficiently detailed model of the cognitive systems of interest to guide the search for richly articulated patterns of performance (Caramazza, 1986: 66).”

Psycholinguists may make incredibly inaccurate assumptions, but many of them should get credit for at least attempting to articulate how their behavioral measures are linked to a complex set of interacting cognitive processes. In contrast, grammaticality and well-formedness judgments are often taken as relatively "direct" reflections of linguistic knowledge. In this paper (in press, see below) I discuss concrete empirical results in phonology that undermine this assumed direct relationship.

Of course, articulating these assumptions doesn't mean we're going to be correct. As Greg points out, the assumed relationships between laboratory psycholinguistic tasks and speech processing in more naturalistic situations have increasingly been called into question. But failing to even attempt to seriously consider the complex web of cognitive/neural mechanisms that underlie behavior seems to me likely to lead to erroneous conclusions.

References
Caramazza, A. (1986). On drawing inferences about the structural of normal cognitive systems from the analysis of patterns of impaired performance: The case for single-patient studies. Brain and Cognition, 5, 41-66.
Chomsky, N. (1980). Rules and representations. New York: Columbia University Press.
Goldrick, M. (in press). Utilizing psychological realism to advance phonological theory. In J. Goldsmith, J. Riggle, & A. Yu (Eds.) Handbook of phonological theory (2nd edition). Blackwell.

Greg Hickok said...

Matt, can you summarize the argument/observations you make in your forthcoming paper? Sounds interesting.

Jon Sprouse said...

In response to Matt's comment, I would say that it is important to distinguish cognitive/neural mechanisms (processes) from cognitive/neural representations (or to use Marr's terminology, algorithms and computations). The properties of both the processes and the representations are necessary for a comprehensive cognitive science of language, and both are equally important aspects of what the brain does -- ie, both representations and processes must ultimately be implemented in the brain.

My impression is that "psycholinguists" focus on mapping their data (reaction times, ERPs, etc) to processes, but often leave their representational assumptions unspecified. "Linguists" focus on mapping acceptability judgments to fully specified representations, but rarely discuss the processes that would be necessary to construct those representations. In short, the two fields are studying two different aspects of language: the representations and the processes. Ultimately, we all want to integrate these two aspects into a unified theory, but these normative proclamations that one type of data is superior to another get in the way of that by encouraging "psycholinguists" to ignore the representational properties uncovered by "linguists". Data is only valuable if it addresses your theoretical question, and the two fields have different theoretical questions.

Here's my take on the whole thing: "Linguists" believe that the best way to uncover the representational properties of sentences is to carefully compare the acceptability of minimally different pairs of sentences. The debate between informal and formal experiments is a red herring. The real question is whether "psycholinguists" know of a better way to get at the properties of representations. If they do, then they should show the "linguists". But to date, "psycholinguists" have rarely been interested in representational questions. And to be clear, reaction times and ERPs, which are interpreted as correlates of processes, are not going to cut it, since their interpretation is based on (often unstated) assumptions about what the representation is. If there is no better way than judgments, then instead of arguing about stats and experiments, we should be looking for ways to integrate representational and processing theories, which is no easy task (e.g., another thoughtful piece by Colin Phillips: Derivational order in syntax)

Matt Goldrick said...

Greg:Matt, can you summarize the argument/observations you make in your forthcoming paper?

I focus on wordlikeness/well-formedness judgments in phonology (the equivalent of grammaticality judgments). For example "is /ngah/ a possible word of English?" There are 3 issues I identify with such work.

1) Quantitative analysis of behavior. This relates to some of the points discussed above. For me, the critical issue is that quantitative analyses have permitted more nuanced discrimination of degrees of well-formedness (not simply this form is possible/impossible).

2) Dynamic weighting of multiple factors in judgments. Recent work has shown that well-formedness judgments are sensitive to multiple factors. Specifically, similarity to existing lexical items appears to exert an independent influence on judgments from purely "structural" well-formedness. For example, ratings of the well-formedness of the nonword /hing/ are independently influenced by the relative probability of /h/ in onset as well as the number of existing lexical items that have /h/ in onset. Critically, there is evidence that the relative weighting of these factors shifts across processing contexts (Shademan, 2006, 2007; Vitevitch, 2003). This is not a surprising result--in many other domains there is evidence for dynamic re-weighting of multiple factors in judgments (Vickers & Lee, 1998; see Ratcliff, Gomez, & McKoon, 2004 for a specific review of dynamic re-weighting in a language-related task). Because of this dynamic re-weighting of multiple factors, attributing variance in judgments to any particular factor (e.g., well-formedness) becomes a much more difficult problem.

3) The interface of judgement processes with other cognitive processes. Wordlikeness judgments--like any other behavior--require the coordination of multiple cognitive processes. One must perceive the acoustic structure of the form, assign a phonological parse to it, etc. The issue is that these interactions are largely unspecified--making it unclear which process(es) give rise to variation in judgments. For example, speakers often misperceive the presence of a vowel within consonant clusters outside of the speakers native language (e.g., /mdef/ is heard as having two syllables; Berent, Steriade, Lennertz, & Vaknin, 2007). There has been a heated debate as to whether this reflects a mis-parse of the phonological structure (Berent et al. 2007) or misperception within more basic acoustic processes (Peperkamp, 2007; see Berent & Lennertz, 2007, as well as many subsequent articles by Berent and colleagues). In this particular case, researchers are sensitive to these alternative accounts and design a number of tests to distinguish them; but more often than not these possiblities are not even considered in the case of studies of well-formedness judgments.

Jon:I would say that it is important to distinguish cognitive/neural mechanisms (processes) from cognitive/neural representations (or to use Marr's terminology, algorithms and computations).

Two points:
1. Following Smolensky (2006), I would characterize the more abstract computational level of description not as pertaining solely to representations but to the structure of functions or relations. At this level of description, one doesn't just specify what the structure of mental objects are (various types of symbol structures) but also the relationships between the mental objects.
2. I do not believe there is an a-priori way to distinguish process from representation. The basic point is--information is only represented by a system if the system actually uses that information in computation (Gallistel, 1990, Chapter 2). Suppose the presence of a noun phrase in a sentence is encoded by a mental representation. If no mental process makes use of this information, then the information is functionally absent. Process and content are inherently intertwined.

Matt Goldrick said...

(references from previous)
Berent, Iris, and Tracy Lennertz (2007). What we know about what we have never heard: Beyond Phonetics. Reply to Peperkamp. Cognition 104: 638-643.

Berent, Iris, Donca Steriade, Tracy Lennertz, and Vered Vaknin (2007). What we know about what we have never heard: Evidence from perceptual illusions. Cognition 104: 591-630.

Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press.

Peperkamp, Sharon (2007). Do we have innate knowledge about phonological markedness? Comments on Berent, Steriade, Lennertz, and Vaknin. Cognition 104: 631-637.

Ratcliff, Roger, Pablo Gomez, and Gail McKoon (2004). A diffusion model account of the lexical decision task. Psychological Review 111: 159-182.

Shademan, Shabnam (2006). Is phonotactic knowledge grammatical knowledge? In Donald Baumer, David Montero, and Michael Scanlon (eds.) Proceedings of the 25th West Coast Conference on Formal Linguistics 371-379. Somerville, MA: Cascadilla Press.

Shademan, Shabnam (2007). Grammar and Analogy in Phonotactic Well-Formedness. Unpublished doctoral dissertation, University of California, Los Angeles. Los Angeles, CA.

Smolensky, Paul (2006). Computational levels and integrated connectionist/symbolic explanation. In Paul Smolensky and Géraldine Legendre The Harmonic Mind: From Neural Computation to Optimality-Theoretic Grammar (Vol. 2, Linguistic and Philosophical Implications) 503-592. Cambridge, MA: MIT Press.

Vitevitch, Michael S. (2003). The influence of sublexical and lexical representations in the processing of spoken words in English. Clinical Linguistics & Phonetics 17: 487-499.