Monday, June 14, 2010

Weak quantitative standards in linguistics research? The Debate between Gibson/Fedorenko & Sprouse/Almeida

The following is an exchange regarding the nature of linguistic data between Ted Gibson and Evelina Fedorenko in one corner and Jon Sprouse and Diogo Almeida in the other. The exchange was sparked by (i) the Gibson & Fedorenko TICS paper and (ii) the unpublished response by Jon Sprouse and Diogo Almeida to that paper. A preview commentary on the issue is provided by David here. The exchange below took place over several days via email. Those involved have allowed me to post it here. I've deleted the previous separate posts that contained bits of this debate. In a few days I'll post a poll to see what people think. Enjoy...

TED:
You note that one particular comparison from the linguistics / syntax literature gives a stronger effect than one particular comparison comparison from the psycholinguistics literature (Gibson & Thomas, 1999). From this observation you conclude that the effects that linguists are interested in are larger than the effects that psycholinguists are interested in.

This is fallacious reasoning. You sampled one example from each of two literatures, and concluded that the literatures are interested in different effect sizes. You need to do a large random sample from each to make the conclusion you make.

Note that it is a tautology to show that you can find two comparisons with different effect sizes: this on its own doesn't demonstrate anything. I can show you the opposite effect-size comparison by selecting different comparisons. For example:

"Syntax" comparison: 2wh vs. 3wh, where 2wh is standardly assumed to be worse than 3wh.

1. a. 2wh: Peter was trying to remember what who carried.
3wh: b. Peter was trying to remember what who carried when.

"Psycholinguistics", where center-embedded is standardly assumed to be worse than right-branching:
2. a. Center-embedded: The ancient manuscript that the graduate student who the new card catalog had confused a great deal was studying in the library was missing a page.
Right-branching: b. The new card catalog had confused the graduate student a great deal who was studying the ancient manuscript in the library which was missing a page.

Clearly the effect size in the comparison in (2) is going to be much higher than in (1). I don't think we want to make the opposite conclusion that you made in your paper.

Indeed the 3wh vs. 2wh comparison (a "syntax" question) is such a small effect as not to even be measurable (which is the point of Clifton et al (2006) and Fedorenko & Gibson (2010)). This is contrary to what has been assumed in the syntax literature (and which was the actual point of our TiCS letter).



JON:
Hi Ted,

Thanks for the comments. It is interesting to note that your comments apply equally well to your own TiCS letter and the longer manuscript that it advertises. I am sure there is a more diplomatic way to do this, but in the interest of brevity, I am going to use your own words to make the point:

You note that one particular comparison from the linguistics/syntax literature is difficult to replicate with naive subjects in a survey experiment. From this observation you conclude that the effects that linguists report are suspicious and that the resulting theory is unsound.

This is fallacious reasoning. You sampled one example from a paper that has 70-odd data points in it (Kayne 1983), and a literature that has thousands, and concluded that this one replication failure means the literature is suspect. You need to do a large random sample from the literature to make the conclusion that you make.

Note that it is a tautology to show that you can find replication failures: this on its own doesn't demonstrate anything. I can show you many such replication failures in all domains of cognitive science. These are never interpreted as a death-knell for a theory or a methodology, so why is this one replication failure such a big problem for linguistic theory and linguistic methodology.

For the record, the point of our letter was to be constructive -- we were trying to figure out how it is that you could claim so much from a single replication failure, especially given that several researchers have reported running hundreds of constructions in quantitative experiments (e.g., Sam Featherston, Colin Phillips) that corroborate linguists' informal judgments. I don't really care if the effect sizes of the two literatures are always of a different magnitude or not (indeed, it is theories, not effect sizes, that determine the importance of an effect). What I do care about is your claim that a single replication failure is more important than the hundreds of (unpublishable!) replications that we've found. Linguists are serious people, and we take these empirical questions seriously... but we haven't found any evidence of a serious problem with linguistic data.

My guess is that you, like many of us on this email, think that there are some challenges facing linguistic theory, especially with how to integrate it with real-time processing theories. Unfortunately, these problems are not the result of bad data (which would be an easy fix). The problem is that the science is hard: complex representational theories are difficult to integrate with real-time processing theories -- and that can't be resolved by attaching numbers to judgments.

-jon


TED & EV:
[Sprouse quote:]"This is fallacious reasoning. You sampled one example from a paper that has 70-odd data points in it (Kayne 1983), and a literature that has thousands, and concluded that this one replication failure means the literature is suspect. You need to do a large random sample from the literature to make the conclusion that you make.

Note that it is a tautology to show that you can find replication failures: this on its own doesn't demonstrate anything. I can show you many such replication failures in all domains of cognitive science. These are never interpreted as a death-knell for a theory or a methodology, so why is this one replication failure such a big problem for linguistic theory and linguistic methodology."[End Sprouse quote]


We think it is misleading to refer to quantitative evaluations of claims from the syntax literature as "replications" or "replication failures". A replication presupposes the existence of a prior quantitative evaluation (an experiment, a corpus result, etc., i.e. data evaluating the theoretical claim). The claims from the syntax/semantics literature have mostly not been evaluated in a rigorous way. So it's misleading to talk about a "replication failure" when there was only an observation to start with.

In the cases that we allude to in the TiCS letter and in another, longer, paper ("The need for quantitative methods in syntax", currently in submission at another journal; an in-revision version is available here, the quantitative experiments that have been performed don't support the claimed pattern in the original papers. The concern is that there are probably many more such cases in the literature, which makes interpreting the theoretical claims difficult.

Second, we didn't find just one example. We have documented several, most that others have observed. Please see our longer paper. We are sure that there are many more.

In any case, we are not arguing that all or most judgments from the literature are incorrect. Suppose that 90% of the judgments are correct or are on the right track. The problem is in knowing which 90% to build theories on. If there are 70 relevant examples in a paper (as in the example paper that was suggested by Sprouse) that means that approximately 63 are correct. But which 63? 70 choose 63 is 1.2 billion. That's a lot of potentially different theories. To be rigorous, why not do the experiments? As we observe in the longer article, it's not hard to do the experiments anymore, especially in English, with the arrival of Amazon's Mechanical Turk. (In addition, many new large corpora are now available – from different languages – that can be used to evaluate hypotheses about syntax and semantics.)

[Sprouse quote:]"What I do care about is your claim that a single replication failure is more important than the hundreds of (unpublishable!) replications that we've found. Linguists are serious people, and we take these empirical questions seriously... but we haven't found any evidence of a serious problem with linguistic data."


Aside from the issue with the use of the term “replication” in this context (as pointed out above), our experience in evaluating claims from the syntax / semantics literature is different from Sprouse's. When we run quantitative experiments evaluating claims from the syntax / semantics literature, we don't typically find exactly the patterns of judgments of the researchers who first made the claim. The results of experiments are almost always more subtle, such that we gain much information (such as effect sizes across different comparisons, relative patterning of different constructions, variability in the population, etc.) from doing the quantitative experiment.

[Sprouse quote:]"My guess is that you, like many of us on this email, think that there are some challenges facing linguistic theory, especially with how to integrate it with real-time processing theories. Unfortunately, these problems are not the result of bad data (which would be an easy fix). The problem is that the science is hard: complex representational theories are difficult to integrate with real-time processing theories -- and that can't be resolved by attaching numbers to judgments."


We never claimed that doing quantitative experiments would solve every interesting linguistic question. But we do think that it is a prerequisite, and that doing quantitative experiments will solve some problems. So we don't see the downside of more rigor in these fields.

Egregiously yours,

Ted Gibson & Ev Fedorenko


DIOGO:
Hi Ted, Hi Ev (and hi everyone else)

Thanks for the comments on our unpublished letter, and for pointing us to your longer article under review.

Jon has already touched upon most of the issues I was going to bring up. However, there is still at least one important point that I would like to raise here in response to some of your comments on your last e-mail, which are also made in your TICS letter and the longer manuscript you provided us with. Namely, I think you profoundly mischaracterize the way linguists work when you say things like:

"the prevalent method in these fields involves evaluating a single sentence/meaning pair, typically an acceptability judgment performed by just the author of the paper, possibly supplemented by an informal poll of colleagues." (from TICS letter)

"...syntax research, where there is typically a single experimental trial." (from Manuscript, p. 7)

"The claims from the syntax/semantics literature have mostly not been evaluated in a rigorous way. So it's misleading to talk about a "replication failure" when there was only an observation to start with." (from last e-mail)


Nothing could be further from the truth. It is simply inaccurate to claim that linguists have lower methodological standards than other cognitive scientists simply because linguists do not routinely run formal acceptability judgments. Linguists test their theories in the exact same way other scientists do: By running experiments for which they (a) carefully construct relevant materials, (b) collect and examine the resulting data, (c) seek systematic replication and (d) present the results to the scrutiny of their peers. There is no "extra" rigour that comes from being able to run inferential statistics beyond what you get from thoughtfully evaluating theories, and systematically investigating the data that bear upon them (in the case of linguistics, through repeated single subject informal experiments that any native speaker can run).

When linguists evaluate contrasts between two (or more) sentence types, they normally run several different examples in their heads, they look for potential confounds, and consult other colleagues (and sometimes naive participants), who evaluate the sentence types in the same fashion. The fact that this whole set of procedures (aka, experiments) is conducted informally does not mean it is not conducted carefully and systematically. I cannot stress this enough: The notion that (a) linguists routinely test their theories with only one specific pair of tokens at a time, (b) proceed to publish papers based on the evaluation of this single data point, and (c) that the results of this single subject/token experiment receives no serious or systematic scrutiny by other linguists, is entirely without basis in reality (eg, see Marantz 2005 and Culicover and Jackendoff's response to your TICS letter).

The only difference between linguists and other scientists is that in order to evaluate the internal validity of their experiments (and again, they are experiments) linguists tend not to rely on inferential statistic methods. One of the possible reasons for that is that linguists normally look at contrasts that are fairly large, and it does not take many trials to be confident about one's own intuition in these cases. Incidentally, in these cases it does not take many trials for the stats to concurr either: if the linguist was running a sign test, 5 trials all going in the same direction would already guarantee statistically significant results at the .05 level (linguists routinely evaluate more tokens than that, btw).

But what happens in the case where the hypothesized contrast is not that obvious? In these cases, linguists would do what any scientist does when confronted with unclear results: they would try to replicate the informal experiment (eg, by asking colleagues/naive subjects to evaluate instances of contrasts of the relevant type), or would seek alternative ways of testing the same question (eg, by running a formal acceptability judgment survey). It is understandable why linguists have historically preferred to take the first course of action: Replicating informal experiments is faster and cheaper, and systematic replication (the gold standard of scientific experimentation) provides the basis for the external validity of the results.

Sincerely,
Diogo

ps: I also think you are overstating the case that formal acceptability judgment experiments routinely reach different conclusions from established contrasts in the linguistic literature and you are overinterpreting the implications of the handful of replication failures that you cited. I won't go into detail here in the interest of brevity, but I would be happy to share my thoughts in a future e-mail if you are interested.


TED:
Dear Diogo:

Thanks for your thoughtful response to my earlier emails. Let me jump right to the point:

You said:
Linguists test their theories in the exact same way other scientists do: By running experiments for which they (a) carefully construct relevant materials, (b) collect and examine the resulting data, (c) seek systematic replication and (d) present the results to the scrutiny of their peers. There is no "extra" rigour that comes from being able to run inferential statistics beyond what you get from thoughtfully evaluating theories, and systematically investigating the data that bear upon them (in the case of linguistics, through repeated single subject informal experiments that any native speaker can run).

... The fact that this whole set of procedures (aka, experiments) is conducted informally does not mean it is not conducted carefully and systematically.


I am sorry to be so blunt, but this is just incorrect. There *is* extra rigor from (a) constructing multiple examples of the targeted phenomena, which are normed for nuisance factors; and (b) evaluating the materials on a naive experimental population.

Both points are very important, but the second point is one that I have found many language researchers underestimate. The problem with not evaluating your hypotheses on a naive population (with distractor materials etc) is that there are unconscious cognitive biases on the part of the researchers and their friends which make their judgments on the materials close to worthless. (That sounds harsh, but unfortunately, it's true.) I know this first-hand. As we document in the longer paper (see pp 16-20), if you read my PhD thesis, there are many cases of judgments that turned out to not be correct, probably because of cognitive biases on my part and on the part of the people that I asked. We provided one example of such an incorrect judgment from my thesis: it was argued that doubly nested relative clause structures are more complicated to understand when they modify a subject (2) than when they modify an object (3) (Gibson, 1991, examples (342b) and (351b) from pp. 145-147):

(1) The man that the woman that the dog bit likes eats fish.
(2) I saw the man that the woman that the dog bit likes.

That is, (1) was argued to be harder to process than (2). In doing this research, I asked lots of people and they all agreed. And I constructed various similar versions. The people that I asked pretty much uniformly agreed that (1) was worse than (2).

But if you do the experiment, with naive subjects, and lots of fillers etc, it turns out that there is no such effect. I ran that comparison about 5 times, and never found any difference. Both are rated as not very acceptable (relative to lots of other things) but there was never a difference in the predicted direction between these two structures.

The problem here was very likely a cognitive bias. I had a theory which predicted the difference, and all my informants had a similar theory (it's basically that more nesting leads to harder processing, as suggested by Miller & Chomsky (1963) and Chomsky & Miller (1963)). So we used that theory to get the judgment predicted by that theory.

If you read the literature on cognitive biases, this is a standard effect. To quote from our longer paper:

"In Evans Barston & Pollard's experiments (1983; cf. other kinds of confirmation bias, such as first demonstrated by Wason, 1960; see Nickerson, 1998, for an overview of many similar kinds of cognitive biases) experiments, people were asked to judge the acceptability of a logical argument. Although the experimental participants were sometimes able to use logical operations in judging the acceptability of the arguments that were presented, they were most affected by their knowledge of the plausibility of the consequents of the arguments in the real world, independent of the soundness of the argumentation. They thus often made judgment errors, because they were unconsciously biased by the real-world likelihood of events.

More generally, when people are asked to judge the acceptability of a linguistic example or argument, they seem unable to ignore potentially relevant information sources, such as world knowledge, or theoretical hypotheses. For example, understanding a theoretical hypothesis whereby structures with more open linguistic dependencies are more complex than those with fewer may lead an experimental participant to judge examples with more open dependencies as more complex ..." as in the examples discussed above.

One of the main points of our papers is that it's really not enough to just be careful and think hard. That is just not rigorous enough to avoid the affects of unconscious cognitive biases. In order to be rigorous, you really need some quantitative evaluation that comes from the analysis of naive subjects. Either corpus analysis or controlled experiment.

You said:
The only difference between linguists and other scientists is that in order to evaluate the internal validity of their experiments (and again, they are experiments) linguists tend not to rely on inferential statistic methods. One of the possible reasons for that is that linguists normally look at contrasts that are fairly large, and it does not take many trials to be confident about one's own intuition in these cases. Incidentally, in these cases it does not take many trials for the stats to concurr either: if the linguist was running a sign test, 5 trials all going in the same direction would already guarantee statistically significant results at the .05 level (linguists routinely evaluate more tokens than that, btw).


The point that I made in response to Jon's earlier comments along these lines still holds. If you want to claim that linguists tend to examine effect sizes that are larger than the effect sizes that psycholinguists examine, then you need to show this. You can't just state it and expect others to accept your hypothesis. Personally, I highly doubt that it's true. I have read hundreds of syntax / semantics papers, and in most of them, there are lots of questionable judgments, which are probably comparisons with small effect sizes, or non-effects.

Ted Gibson


DIOGO & JON:
Dear Ted,

Thank you for your response. This is a joint reply by Jon and me.

[Gibson quote:] Let me jump right to the point: "I am sorry to be so blunt, but this is just incorrect. There *is* extra rigor from (a) constructing multiple examples of the targeted phenomena, which are normed for nuisance factors;"


We totally agree with the need for multiple items. In fact, we just told you that linguists routinely evaluate several instances of any proposed sentence type contrast. On this point, there is simply no difference between what linguists and psycholinguists do (see Marantz 2005).

What we completely disagree with you is about the priority you assign to the results from naive participants. There is no particular reason to assign your average pool of 30-odd college-aged students the role of arbiter of truth. Just finding a difference between what a researcher thinks is going to happen and the experimental results from a pool of naive subjects is not particularly informative, especially if they are of the "failure to replicate" type. There are several reasons why one might get an unexpected null result that have nothing to do with "cognitive bias":

(1) The experiment is underpowered

For instance, in Gibson and Thomas (1999), you claim that, contrary to the initial motivating intuition, you did not find that (b) was rated better than (a):

a. *The ancient manuscript that the graduate student who the new card catalog had confused a great deal was studying in the library was missing a page.

b. ?The ancient manuscript that the graduate student who the new card catalog had confused a great deal was missing a page.

In our unpublished letter (see figure), we show that this is most likely an issue of power, because the effect is definitely there (it's just small and requires a large sample to have a moderate chance of being detected).

(2) There are problems with the experimental design, such as:

(i) The experiment uses a task that is not necessarily sensitive to the manipulation

For instance, why would you necessarily think that acceptability tasks should be equally sensitive to all processing difficulties? It could be the case that acceptability judgment might not be the right dependent measure to use.

(ii) The experiment uses a design or task that is not optimal to reveal the effect of interest

Sprouse & Cunningham (under review, p. 23, figure 8 sent attached) have data showing that the contrast between (a) and (b) above can be detected with a sample half the size Gibson & Thomas (1999) used when one uses a magnitude estimation task with lower acceptability reference sentences (but not at all when higher acceptability reference sentences are used).

None of these explanations invoke cognitive biases. We don't necessarily disagree that cognitive biases are a potential problem. We just think that before you invoke it as an explanation (1) you need positive evidence and (2) a failure to replicate the results from an informal experiment in a formal experiment is not positive evidence. In fact, had you assigned the kind of priority to formal experimental results with naive participants you seem to advocate in your previous e-mail, you would have been misled by the Gibson & Thomas (1999) data, and would have concluded the contrast is not real. In fact you yourself explained the result by appealing to the offline nature of the test, and not to cognitive biases, so why would should cognitive bias be the null hypothesis for linguistics?

Furthermore, we can also find exactly the opposite pattern: a significant experimental effect that confirms the initial expectations that is nonetheless considered by the experimenter as being evidence against them. Take the Wasow & Arnold (2005) paper for example. In your longer manuscript you say this:

"Wasow & Arnold (2005) discuss three additional cases: the first having to do with extraction from datives; the second having to do with the syntactic flexibility of idioms; and the third having to do with acceptability of certain heavy-NP shift items that were discussed in Chomsky (1955 / 1975). The first of these seems particularly relevant to Phillips’ claim. In this example of faulty judgment data, Filmore (1965) states that sentences like (1), in which the first object in a double object construction is questioned, are ungrammatical:

(1) Who did you give this book?

Langendoen, Kalish-Landon & Dore (1973) tested this hypothesis in two experiments, and found many participants (“at least one-fifth”) who accepted these kinds of sentences as completely grammatical. Wasow & Arnold note that this result has had little impact on the syntax literature." (pp. 13-4)

And it shouldn't. If only one fifth of the sample in Langendoen et al. (1973) failed to show the expected contrast, the results are not problematic at all. In fact, they are actually highly signifcant, and overwhelmingly support the original proposal: A simple one-tailed sign test here would give you a p-value of 1.752e-09 and a 95% CI for the probability of finding the result in the predicted direction of (0.7-1)). Let me stress this again: what the experiment is actually telling you is that the results support the linguist's informal experiments, not the contrary, as Wasow & Arnold seem to think.

The same is true of the Wasow & Arnold (2005)'s own acceptability experiment. They decided to test an intuition from Chomsky (1955) about the position of verb particles interacting with the complexity of object NP. They tested the following paradigm, where the object in (a-b) is thought to be more complex than the object of (c-d).

a. The children took everything we said in. (1.8)
b. The children took in everything we said. (3.3)
c. The children took all our instructions in. (2.8)
d. The children took in all our instructions. (3.4)

According to Chomsky, (c) sounded more natural than (a), and (b) and (d) should be equally acceptable. And that is precisely what Wasow & Arnold (2005) found (see mean acceptability in 4 point scale next to each condition above), with highly significant results. These results were also replicated in another of their conditions, omitted here for brevity's sake. And yet, Wasow & Arnold (2005) claim the following:

"there was considerable variation in the responses. In particular, although the mean score for split examples with complex NPs was far lower than any of the others [ie, The results support Chomsky's intuitions], 17% of the responses to such sentences were scores of 3 or 4. That is, about one-sixth of the time, participants judged such examples to be no worse than awkward."

Again, the results were highly significant and support rather than undermine the original intuition from the linguist, and yet Wasow & Arnold (and, given the quote from you article, you too) seem to conclude the opposite from the experimental data presented.

So where is this extra rigor that one gets by simply running formal acceptability judgments? It just seems to us that simply running a formal acceptability experiment with naive participants does not preclude one from being misled by one's results anymore than what happens in the case of informal experiments.

Sincerely,
Diogo & Jon


TED & EV:
Dear Diogo & Jon:

The point is *not* that quantitative evidence will solve all questions in language research. The point is just that having quantitative data is a necessary but not sufficient condition. That's all.

Without the quantitative evidence you just have a researcher's potentially biased judgment. I don't think that that's good enough. It's not very hard to do an experiment to evaluate one's research question, so one should do the experiment. One is *never* worse off after doing experiment. You might find that the issue is complex and harder to address than you thought before doing the experiment. But even that is useful information.

I don't have anything more to say on this for now. Some day, I would be happy to debate you in a public forum if you like.

Best wishes,

Ted (& Ev)


DIOGO:
Dear Ted,

Let me just add a few remarks to your last e-mail, and then I don't think I have anything more to say on the matter either. Thanks for engaging with us in this discussion.

[Gibson quote:] "The point is *not* that quantitative evidence will solve all questions in language research. The point is just that having quantitative data is a necessary but not sufficient condition. That's all."


And the point Jon and I are trying to make is that having quantitative data for linguistic research, while potentially useful, is not always necessary. The implication of your claim is also far from uncontroversial: it implies that linguistics, where quantitative methods are not widely used, fails to live to a "necessary" scientific standard. We think this is both false and misguided.

[Gibson quote:] "Without the quantitative evidence you just have a researcher's potentially biased judgment. I don't think that that's good enough."


Here's the thing: a published judgment contrast in the linguistic literature, especially if it is a theoretically important one, has been replicated hundreds of times in informal experiments. When the contrast is uncontroversial, it will keep being replicated nicely and will attract no further attention. However, when the contrast is a little shaky, linguists are keenly aware of it, and weigh the theory it supports (or rejects) accordingly. Finally, when the contrast is not really replicable, it is actually challenged, because that is the one thing linguists do: they try to test their theories, and if some part of the theory is empirically weak, it will be challenged. I highly doubt that cognitive biases could play any significant role in this systematic replication process.

Now, here's where this methodology is potentially problematic: If there is a judgment contrast from a language for which there are very few professional linguists that are also native speakers and for which access to naive native speakers is limited. In this case, it is possible that a published judgment contrast will go unreplicated, and if faulty, could lead to unsound conclusions. In these cases, I totally agree that having quantitative data is probably necessary. But note that the problem here is not the lack of quantitative data to begin with, the problem is with the lack of systematic replication. Quantitative data only serves as a way around this problem.

[Gibson quote:] "It's not very hard to do an experiment to evaluate one's research question, so one should do the experiment."


The point is that linguists DO the experiment. They just do it informally.

[Gibson quote:] "One is *never* worse off after doing experiment. You might find that the issue is complex and harder to address than you thought before doing the experiment. But even that is useful information."


The question is not whether or not one is worse off after doing the formal experiment. The question is whether or not one is necessarily better off.

There is a very clear cost in running a formal experiment versus an informal experiment. Formal experiments with naive participants take time (IRB approval, advertising on campus, having subjects come to the lab and taking the survey, or setting some web interface so they could do it from home, etc), and potentially money (if you don't have a volunteer subject pool, or if you use things like Amazon's Mechanical Turk). If you want linguists to adopt this mode of inquiry as "necessary", you have to show them that they would be better off doing so. That is the part where it is really not clear that they would.

You can try to show this in two ways: You can show linguists that (1) they get interesting, previously unavailable data or (2) show them that they are being misled by their informal data gathering methods and running the formal experiment really does fix that. Because otherwise, what is the point? If linguists just confirm over and over again that they keep getting the same results running naive participants as they get with their informal methods (and this is what linguists like Jon, Sam Featherston, Colin Phillips and others keep telling you happens), then why should they bother going through a much slower, and much more costly method that does not give them any more information than their quick, informal, but highly replicable method does?

Best wishes,
Diogo

10 comments:

Shuichi Yatabe said...

Diogo's position (as well as Jon's perhaps) seems a little unstable to me.

On the one hand, he says "But what happens in the case where the hypothesized contrast is not that obvious? In these cases, linguists would do what any scientist does when confronted with unclear results: they would try to replicate the informal experiment (eg, by asking colleagues/naive subjects to evaluate instances of contrasts of the relevant type), or would seek alternative ways of testing the same question (eg, by running a formal acceptability judgment survey)."

But on the other hand, Diogo and Jon say "So where is this extra rigor that one gets by simply running formal acceptability judgments?", suggesting that there is no such extra rigor to be gained.

These two statements seem (to me) to contradict each other.

Shuichi Yatabe

Diogo said...

Hi Shuichi,

I don't think there is any contradiction there. In cases where the results from an informal acceptability judgment are not so clear, linguists can try (i) replication (which they do), or (ii) a different way of getting at the same question, which can include looking at different data that bears on the same question (which they also do) or looking at the same constrast from a different angle (eg, doing a formal experiment, which they sometimes also do). It is not clear that any of these strategies is inherently better than the others. It's an empirical question, and so far, there is little evidence to claim that formal acceptability judgment experiments are the only way to go.

Shuichi Yatabe said...

Diogo,

Thanks for the explanation.

In the original context of the second quotation above, you're comparing Wasow & Arnold's formal experimentation with Chomsky's reliance on his own intuition, so I took you to be saying that doing formal experiments doesn't provide any extra rigor above and beyond what can be provided by the "method" of relying on one's own intuition alone. I'm glad to learn that that wasn't your intention.

Brian Barton said...

(Part 1 of 2)

As I am not involved in linguistic research, this debate has little impact on me, personally, which grants me the luxury of not having any stake. That allows me to think of this whole thing abstractly, and so I will illustrate my thinking in that abstract light.

The argument seems to boil down to this: One side says that, in order to achieve scientifically valid results, one must take all reasonable steps to ensure that all variables except the experimental one are held constant. This is not always achievable, yes, but it is a goal to strive for, that allows for other people to be able to examine and replicate findings with as few potential alternative explanations as possible.

The arguments in favor here are generally accepted by all scientists of all stripes. Having just TA’d for a psychology research methods class, this is exactly the sort of thing I try to drill into my students’ heads, and I have paraphrased this paragraph many, many times.

The other side says agrees that this is indeed something to strive for, but “informal” experiments, by which it is meant that a very small sample, composed of the investigator and perhaps a few colleagues, is a close enough approximation to ruling out all possible explanatory variables to get the job done in linguistics. In short, linguistics is a special case.

The arguments in favor here have two basic components: one is that there is little additional benefit to performing “formal” experiments over “informal” ones in a large number of cases, and two, that the additional costs of “formal” experiments outweigh the aforementioned negligible benefits. The benefits are argued to be especially useful only when the effects are small, and the costs are time and money (seeking IRB approval, paying subjects, paying for databases, paying for testing equipment, etc.).

I believe those are fair characterizations of the positions, stripped of details of individual papers or effects or personal feelings of assault on one’s own methodology. It is this abstract scheme of things which I think could shed some light here. In the end, this debate boils down to how certain you want to be. Are thought experiments enough? Is narrowing the range of possible conclusions (e.g., but not ruling out experimenter bias) enough? Is it necessary to do everything reasonably possible to rule out alternative explanations enough?

Brian Barton said...

(Part 2 of 2)

The question, for me, becomes, “what is ‘every reasonable measure?’” This is where I believe I differ from Diogo, and perhaps Jon. I don’t think that gaining IRB approval, getting naïve subjects (paid or not), paying for testing equipment, and so forth poses a high cost. To put it in perspective, an acceptable rate for paying a subject is $10 an hour for a behavioral experiment. And if linguistic experiments generally have large effects that require few trials, an hour should suffice for most experimental procedures, as should a small number of subjects, let’s say 10. To run 10 subjects, 4 computers capable of running say, Matlab, are available for $500 apiece, plus Matlab licenses, would be reasonable. That’s a start-up cost on the order of $2,500. The time cost will vary based on a lot of factors, so I won’t try to estimate it, but I think it would be reasonable to conservatively assume that 3 months would be enough time to program up an experiment, purchase computers, and get it past the IRB. So, 3 months and $2500 to start up, plus $100 an experiment, at 10 participants with one hour apiece. Even if my numbers are off by a large margin, that is a far cry from being very expensive or incredibly time consuming, and I’m only considering the extra time required for a “formal” experiment over an “informal” one. That’s less than a lot of campus-wide awards that are given out each year at many universities.

Let’s compare that to work that requires a big piece of equipment, like a Magnetic Resonance Imaging facility, in the case of a lab coming to a campus that already has a scanner (so we don’t consider the largely center-dependent costs of purchasing the scanner, maintaining it, etc.—just the cost to the lab to use it). Here, our scanner costs roughly $500 an hour to use, plus all the same costs I listed in the previous example (actually higher, because better computers are required to crunch data than to run experiments, but let’s ignore that for now). Those costs are substantial and they add up fast, and they have a lot of effects on the research (you’d better have your design down and know what effects to expect: no “let’s see what happens if I try this”), but you live with it because of the advantages.

Full disclosure: I work in a functional MRI laboratory, but I previously worked in a behavioral laboratory, so I’ve dealt with these two levels of cost, which almost certainly is why I chose those examples. That said, I personally feel that the costs of performing “formal” experiments are not much more than “informal” ones, and being able to rule out the additional potentially confounding variable of experimenter bias is worth that cost (aside: the important point here is not whether or not I am biased, which I try not to be, rather that other people can conclude that the possibility of my bias affecting my results has been minimized). I am in no place to decide whether this is true in the particular case of linguistics (perhaps funding is much more difficult to acquire than I am familiar with, or slowing yourself down even long enough to get IRB approval would put you at a significant competitive disadvantage relative to other linguists), but at this abstract level, I think that performing “formal” experiments falls into the category of “every reasonable measure."

Greg Hickok said...

Although this discussion represents probably the longest ever post on TB, I think it is an interesting issue that is very much worth discussing. It is certainly the case that many psychologists view linguistic research as less rigorous than a field that regularly calculates t-values. Ted and Ev's call for linguistics to run more experiments could result, if their urgings are followed, in a science that is viewed more favorably from the outside. (On the other hand, it unfortunately reinforced the incorrect view that linguistics is not really a science.) And all involved agree that for more than a few of the judgments on which linguistic theory depends, some experimentation is in order. Jon is in the vanguard of linguists doing exactly this.

But be careful what you wish for in terms of quantitative rigor. None of us has the time or resources to quantify every detail of our experiment. For example, do those of you doing language studies quantify the level of linguistic competence (or reading skill) of your subjects? Or do you trust that they are native speakers (proficient readers) because they say so? (For the most part we trust them.) Have you thrown out a subject because even though they said they were native, they spoke with an accent? (Yes.) Is this decision based on quantitative data or the experimentor's intuition? (Intuition.)

More to the point: When you design a rigorous quantitative psycholinguistic experiment that uses reading times to measure the effects of x on sentence processing, do you pre-test all your stimuli to make sure that a random sample of 30 subjects judges each of your stimuli as grammatical? Or do you use your own linguistic intuitions to determine whether your stimuli are part of your subjects' language? You use and trust your own intuitions! The same ones that linguists use.

There are some things that simply don't need to be experimentally tested: that the word "plant" is ambiguous, that a Necker cube is bi-stable, that "Who did you see Mary and?" is ungrammatical whereas "Who did you see Mary with?" is perfectly fine.

In the real world, with limited resources, we indeed need to make rational decisions about which aspects of our experiments to quantify. If we make a wrong choice now and then, we can rest assured that a reviewer will point it out, and we will then run Experiment 3B. This is how science works.

I suggest that the more dangerous situation (in terms of slowing the pace of a field) is weak science hiding behind the appearance of quantitative rigor.

JS said...

An important issue came up in the course of this exchange, but was not really dealt with. At one point, Diogo made the following reply to an argument from Ted & Ev's manuscript:

Ted & Ev's argument: "... Langendoen, Kalish-Landon & Dore (1973) tested this hypothesis in two experiments, and found many participants (“at least one-fifth”) who accepted these kinds of sentences as completely grammatical. Wasow & Arnold note that this result has had little impact on the syntax literature."

Diogo's response: "And it shouldn't. If only one fifth of the sample in Langendoen et al. (1973) failed to show the expected contrast, the results are not problematic at all. In fact, they are actually highly signifcant, and overwhelmingly support the original proposal: A simple one-tailed sign test here would give you a p-value of 1.752e-09 and a 95% CI for the probability of finding the result in the predicted direction of (0.7-1)). Let me stress this again: what the experiment is actually telling you is that the results support the linguist's informal experiments, not the contrary, as Wasow & Arnold seem to think."

The implication here is that there is no distinction to be made between varying levels of acceptability; there is merely a binary choice between grammatical and ungrammatical. The theoretical implications are no different if 20% of subjects judge something acceptable versus 1% of subjects. It seems to me that this kind of black-and-white approach is the norm when it comes to theories of syntax. It also helps to justify the use of single observations. If linguistic theories were more probabilistic in their predictions of grammaticality, then there would be more motivation to get large samples from naive subjects.

Diogo and Jon said...

Thanks for the input, JS. I think there are two issues here that need to be teased apart.

The first is your assumption that linguists disregard the varying levels of acceptability. This just isn't true. Arguments made on the basis of effect sizes have been made throughout the history of linguistic theory. For instance, in the 1980s there was a classic (and infamous) distinction between Subjacency violations and ECP violations, both of which operate on what seems to be the exact same structural configuration, and both of which lead to extreme unacceptability. One of the pieces of evidence for the distinction was that the ECP violations were consistently judged worse than Subjacency violations (e.g., Huang 1982, Lasnik and Saito 1984, Chomsky 1986).

The second is your assumption that all of the variability in acceptability judgments should be accounted for by the theory of grammar. This is a strong claim to make, which you can see by applying it to psycholinguistics rather than linguistics. In the lexical access literature, repetition priming (reaction times to words in a lexical decision task are faster on the second presentation than the first) is as close to a blow-out effect as you are going to get. We just looked at some old repetition priming data that Diogo has. If we look at the individual subject data, it turns out that 4 out of 21 subjects show negative priming (ie, they are slower for the second presentation when compared to the first) In other words, about 20% of the sample shows the opposite effect from what is theoretically predicted. Should psycholinguists interested in repetition priming be required to account for the 20% of the data that goes in the wrong direction?

Most psycholinguists would say no (or at the very least, not necessarily). The theory of lexical access is not a theory of all reaction times. It is one factor that affects reaction times, but we assume that there are many other factors that could have caused this variability. Reaction times are just a way of uncovering facts about lexical access mechanisms. The same goes for grammatical theory. Grammatical theory is not a theory of acceptability judgments. We assume that grammar is one major predictor of acceptability judgments, but a full theory of acceptability judgments would include many other factors. Acceptability judgments are just a way of uncovering facts about syntactic representations. Just like with reaction times, it is an empirical question whether any instance of variability in acceptability judgments should be included in the theory - one that can't be decided based on the mere existence of variability. As you would expect, this is an active research question in linguistics, with several different proposals on the table (see for instance the work of Sam Featherston and Frank Keller).

Colin Phillips said...

It is hard to resist the temptation to quote Henry Gleitman: "If an experiment is not worth doing, it is not worth doing well."

Let us not forget that one of the things that is at stake is how best to make use of scarce resources. Almost all of us are using money that comes from students' tuition, or from taxpayers' pockets, and when we are running experiments we are typically expending the valuable time of talented young researchers who carry out those experiments. We have a responsibility to make good use of those resources. And much of the time we are also competing for scarce resources by writing grant proposals that argue that our research is a high priority and represents good value for money.

Some linguistic facts are so obvious that it would be a waste of resources to test them experimentally. E.g., a test of whether English speakers find "John didn't leave" more acceptable than "John left not". Some other phenomena are far more subtle, and call for finer-grained tests. (But in those cases it would be suspect to experimentally confirm a subtle difference and then to use that to make categorical well-formedness judgments.) And then there are lots of other cases where one simply has to make value judgments and ones conscience to figure out whether fancier testing is worthwhile. There is nothing unusual about this.

I might add that in discussions such as these one often encounters the complaint that experimental psychologists would take linguists more seriously if only they would spend more of their time doing the things that experimental psychologists spend their time doing. This claim does not stand up to closer scrutiny. If one controls for methodological practices by looking at linguists and psychologists who use very similar experimental methods in their research, and asks whether this eliminates cross-field skepticism and even scorn, then it becomes clear that the mistrust runs deeper, alas. Also, if one looks for the individuals in one scientific community who are taken most seriously by researchers from another field, it is clear that the best predictor of influence is shared interests, shared questions, or shared conclusions rather than shared methodologies. So the notion that linguists' arguments would be taken more seriously if they decorated them with a few simple judgment experiments is a myth.

Philip Hofmeister said...

I'm sorry I found this interesting conversation so late---I found both perspectives illuminating. But I wanted to add several points in defense of what Ted & Ev have advocated.

First of all, I find the claim that linguists, when making informal judgments, control for nuance variables and use other scientific methods for controlling for personal bias a bit utopian. I've only been in linguistics for 10 years, but from experience with scores if not hundreds of syntacticians and semanticists, this is a mischaracterization of what happens probably 95% of the time. Most often, one or two examples are developed by the linguist, and posed to a few close friends or colleagues. So, if what Diogo and Jon actually happened, that would be a different story, but I don't think it does. What's more, there's no record of that data or an accompanying description of how the data was collected, what variation there was, etc. This makes it hard for future researchers to evaluate, and what separates other fields from standard practices in linguistics.

Second, with respect to ``replicability", I think Jon and Diogo have slightly misrepresented how linguistic research advances. In particular, when a researcher publishes a paper listing judgments for some set of stimuli, those judgments are taken as truth, unless strong evidence (in the form of corpus examples, formal experiments, etc) proves them otherwise. Submitting a paper with simply differing judgments from some prior author does not fly in the linguistics world. In this sense, I wouldn't call repetition of these judgments replication, so much as citing or referencing someone else's findings.

Finally, the strength of Ted & Ev's point can be seen in the promulgation of erroneous conclusions and data in the syntax literature. As one example, Haegeman's widely used 1997 syntactic textbook contains some incredibly dubious judgments. Some of the sentences judged ungrammatical have serious confounds (such as the fact the sentences can't have a coherent meaning), but the judgments assigned to the sentences have profound theoretical consequences. There are numerous other such examples where judgments for particular constructions vary across researchers and underlie strikingly different theories. For instance, in an investigation of "either . . . or" disjunctive constructions, I found that the judgments of several linguists contrasted starkly with what a naive group of subjects found acceptable with respect to the possible word order positions of ``either". As Ted says, I'm sure there are many more such cases.

My two cents . . .