Wednesday, May 16, 2012

Decline Effect in Linguisics?

It seems to me that in the past few years, the empirical foundations of the social sciences, especially Psychology, have been coming under increased scrutiny and criticism. For example, there was the New Yorker piece from 2010 called "The Truth Wears Off" about the "decline effect," or how the effect size of a phenomenon appears to decrease over time. More recently, the Chronicle of Higher Education had a blog post called "Is Psychology About to Come Undone?" about the failure to replicate some psychological results.

These kinds of stories are concerning at two levels. At the personal level, researchers want to build a career and reputation around establishing new and reliable facts and principles. We definitely don't want the result that was such a nice feather in our cap to turn out to be wrong! At a more principled level, as scientists, our goal is for our models to approximate reality as closely as possible, and we don't want the course of human knowledge to be diverted down a dead end.

Small effects

But, I'm a linguist. Do the problems facing psychology face me? To really answer that, I first have to decide which explanation for the decline effect I think is most likely, and I think Andrew Gelman's proposal is a good candidate:
The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount.

I've put together some R code to demonstrate this point. Let's say I'm looking at two populations, and unknown to me as a researcher, there is a small difference between the two, even though they're highly overlapping. Next, let's say I randomly sample 10 people from each population, do a t-test for the measurement I care about, and write down whether or not the p-value < 0.5 and the estimated size of the difference between the two populations. Then I do this 1000 more times. Some proportion (approximately equal to the power of the test) of the t-tests will have successfully identified a difference. But did those tests which found a significant difference also accurately estimate the size of the effect?

For the purpose of the simulation, I randomly generated samples from two normal distributions with standard deviations 1, and means 1 and 1.1. I did this for a few different sample sizes, 1000 times each. This figure show how many times larger the estimated effect size was than the true effect for tests which found a significant difference. The size of each point shows the probability of finding a significant difference for a sample of that size.
So, we can see that for small sample sizes, the test has low power. That is, you are not very likely to find a significant difference, even though there is a true difference (i.e., you have a high rate of Type II error). Even worse, though, is that when the test has "worked," and found a significant difference when there is a true difference, you have both Type M (magnitude) and Type S (sign) errors. For small sample sizes (between 10 and 50 samples each from the two populations), the estimated effect size is between 5 and 10 times greater than the real effect size, and the sign is sometimes flipped!

Taking the approach of just choosing a smaller p-value will help you out insofar as you will be less likely to conclude that you've found a significant difference when there is a true difference (i.e., you ramp up your Type II error rate, by reducing the power of your test), but that doesn't do anything to ameliorate the size of the Type M errors when you do find a significant difference. This figure facets by different p-value thresholds.

So do I have to worry?

So, I think how much I ought to worry about the decline effect in my research, and linguistic research in general, is inversely proportional to the size of the effects we're trying to chase down. If the true size of the effects we're investigating are large, then our tests are more likely to be well powered, and we are less likely to experience Type M errors.

And in general, I don't think the field has exhausted all of our sledgehammer effects. For example, Sprouse and Almeida (2012) [pdf] successfully replicated somewhere around 98% of the syntactic judgments from the syntax textbook Core Syntax (Adger 2003) using experimental methods (a pretty good replication rate if you ask me), and in general, the estimated effect sizes were very large. So one thing seems clear. Sentence 1 is ungrammatical, and sentences 2 and 3 are grammatical.
  1. *What did you see the man who bought?
  2. Who did you see who bought a cow?
  3. Who saw the man who bought a cow?
And the difference in acceptability between these sentences is not getting smaller over time due to the decline effect. The explanatory theories for why sentence 1 isn't grammatical may change, and who knows, maybe the field will decide at some point that its ungrammaticality is no longer a fact that needs to be explained, but the fact that it is ungrammatical is not a moving target.

Maybe I do need to worry

However, there is one phenomenon that I've looked at that I think has been following a decline effect pattern: the exponential pattern in /t d/ deletion. For reasons that I won't go into here, Guy (1991) proposed that if the rate at which a word final /t/ or /d/ is pronounced in past tense forms like packed is given as p, the rate at which it is pronounced in semi-irregular past tense forms like kept is given as pj, and the rate at which it is pronounced in regular words like pact is given as pk, then j = 2, k = 3.

Here's a table of studies, and their estimates of j and k, plus some confidence intervals. See this code for how I calculated the confidence intervals.


StudyYearDialectjk
Guy1991White Philadelphia
4.74
2.37
1.17
4.26
2.75
1.86
Santa Ana1992Chicano Los Angeles
2.29
1.76
1.35
3.39
2.91
2.51
Bayley1994Tejano San Antonio
2.08
1.51
1.11
3.59
2.99
2.52
Tagliamonte & Temple2005York, Northern England
1.85
1.12
0.66
1.96
1.43
1.04
Smith & Durham & Fortune2009Buckie, Scotland
1.36
0.64
0.24
3.59
2.33
1.53
Fruehwald2012Columbus, OH
2.48
1.38
0.76
2.35
1.93
1.59


I should say right off the bat that all of these studies are not perfect replications of Guy's original study. They have different sample sizes, coding schemes, and statistical approaches. Mine, in the last row, is probably the most divergent, as I directly modeled and estimated the reliability of j and k using a mixed effects model, while the others calculated pj and pk and compared them to the maximum likelihood estimates for words like kept and pact.

But needless to say, estimates of j and k have not hovered nicely around 2 and 3. 

26 comments:

  1. Daniel Ezra JohnsonMay 16, 2012 at 11:45 AM

    Joe, when I analyzed the Buckeye Corpus data you gave me, it showed that semi-irregular past tenses deleted MORE than monomorphemes, which I called "an unexpected result that deserves future investigation". But you're showing the usual ordering of the three categories. Any idea why the difference?

    ReplyDelete
    Replies
    1. I can't be completely certain, but I now have code you can clone under the repository TD-Classifier on my github. My phone won't let me paste the link here. Check it out and send me a pull request of anything looks like it needs fixed.

      Delete
  2. Daniel Ezra JohnsonMay 16, 2012 at 11:47 AM

    My other question is: don't your charts show that the problem of overestimation is alleviated by a large enough sample size? This contradicts Gelman to some extent.

    ReplyDelete
    Replies
    1. I think Gelman says that smaller samples are more likely to have worse Type M errors. His suggestion of doing retrospective power analyses suggests that underpowered studies are more likely to have Type M errors.

      Delete
  3. Daniel Ezra JohnsonMay 16, 2012 at 11:53 AM

    My third question is, why not try to replicate Guy's results in White Philadelphia first - you're adding a major (potential) other variable when you compare them to global Englishes.

    ReplyDelete
    Replies
    1. It's on my to-do list! But given Guy's model, the dialect differences would have to do with fundamentally different morphological structures for the cases involved, which seems less likely than just there isn't the relationship there that Guy thought there was.

      Delete
  4. Christian DiCanioMay 16, 2012 at 2:10 PM

    I think that there are two major contributors to the decline effect in psychology that might not pose a problem to linguists. The first is the effect size-publishability issue. Often, experiments are constructed for a debate involving small effects. If you observe a p value of .07 after having run 20 subjects, you might consider running 5 more subjects to try to get a p value under .05. The only reason why one might do this is to use the "magic" word, 'significant,' in one's paper. As a phonetician, I have never constructed an experiment in this fashion. Though, it is not uncommon in psychology. The result of this is that it is entirely possible to get Type M errors and false positives.

    The second issue has to do with the observability of the phenomena. Empirical studies in linguistics are often mostly behavioral. We observe speakers doing something that we can clearly measure, e.g. not pronouncing a past tense morpheme, and want to know the conditions in which they do it. It may be that we construct the wrong context and observe no difference in behavior (or a very small one), but we are usually basing the design of our experiments on things already observed in the literature (even impressionistically). Despite this, most of what we are interested in is measurable. This contrasts starkly with work in cognitive psychology where we are interested in the effects of more abstract dimensions, like phonological neighborhood density. As such abstract dimensions only play a role in speech production/perception insofar as they are predicted within a particular psycholinguistic theory, they are more ephemeral than things like vowel formants. I suppose this is also a reason why I've always considered the methodologies in linguistics to be much more akin to those in ethology than psychology.

    ReplyDelete
  5. I tried replying to this earlier, but I guess Disqus doesn't like comments from mobile phones.

    I can't be 100% sure why the numbers have come out different, but between when I started working with the Buckeye data and now, I've gotten better at doing reproducible research. So, if you go get the Buckeye Corpus, you ought to be able to recreate the data set I'm working with now using the code here: https://github.com/JoFrhwld/TD-Classifier

    The python script crawls the actual Buckeye transcriptions. buckeCoder.R codes the morphological and phonological factors, and buckSubset.R takes a subset of the data.

    ReplyDelete
  6. I think Gelman actually says that smaller sample sizes are more prone to Type M errors. His suggestion that we should do retrospective power analyses seems to point to the fact that underpowered studies are more likely to overestimate the effect size.

    ReplyDelete
  7. It's on my to-do list! But, given Guy's explanatory model, the differences between dialects would have to be driven by fundamentally different morphological structures for the cases involved.

    ReplyDelete
  8. Daniel Ezra JohnsonMay 17, 2012 at 10:46 AM

    Just FYI I have recently done exactly what you mention, adding subjects in search of a "significant" effect, in a syntactic acceptability experiment. I can see why it's a bad practice but it's not limited to psychology. Don't quite see why linguists wouldn't have the same pressures for a "significant" result - are our editors/referees that much more enlightened than those in psych?

    ReplyDelete
  9. Daniel Ezra JohnsonMay 17, 2012 at 10:56 AM

    Sorry, a couple more questions about the exponential results!
    You say here:Fruehwald 2012, Columbus OH, j = 1.38, k = 1.93But what about Fruehwald 2008:
    Buckeye Corpus (N = 13,414)
    Retention: .768, .588, .467
    Predicted: .768, .589, .453

    I know the data's not exactly the same, nor the exact statistical analysis, but you're going from irregulars being almost exactly spot on the exponential prediction in 2008 (j = 2.01 ), and monomorphemes being fairly close to the prediction (k = 2.88) to these drastically different results in 2012: j = 1.38, k = 1.93. Or am I confused?

    ReplyDelete
  10. Daniel Ezra JohnsonMay 17, 2012 at 11:05 AM

    Agreed. Although another interpretation is that all the divergent results in other places challenges "Guy's explanatory model" - and perhaps even challenges the procedure "assume what's true in Philadelphia is true everywhere, until proven otherwise".

    ReplyDelete
  11. Ah! This is all entirely about the statistical methodology. With the data set I'm working with now, when I calculate the probabilities the same way as I know I did in 2008, it comes out

    Retention: 0.78  0.63  0.47
    j = 1.9, k = 3.1

    Not a huge difference due to coding. 

    ReplyDelete
  12. Agreed. I guess I should have said that I think fundamentally different morphological structures in each dialect is less likely than the Exponential Model isn't true.

    ReplyDelete
  13. Daniel Ezra JohnsonMay 17, 2012 at 11:09 AM

    I don't mean that the exponential account is actually likely to be true only in Philadelphia. I always thought the exponential thing was one of the best results in sociolinguistics, because of the way it ties quantitative variationism to phonological theory. But... since then, I've heard that it doesn't actually conform to any extant version of Lexical Phonology, and now it seems very shaky on the data side as well.

    ReplyDelete
  14. Yes, but it gave us the great methodological gift of "multiplication."

    ReplyDelete
  15. Daniel Ezra JohnsonMay 17, 2012 at 11:37 AM

    Guy 1991 was so right in 2008.

    ReplyDelete
  16. There is good methodological lesson here (though admittedly an obvious one), namely that choice of statistical meth. greatly influences estimates

    ReplyDelete
  17. The "size-publishability" issue is just as bad in linguistics as in psychology. Linguistic reviewers often complain about non-sig results to me. Conversely, they also get mad (as do many speakers) when I say that an effect is significant but so tiny as to be of no practical significance. Cf. the "QWERTY" debacle. 


    I don't see the contrast in methodologies either. Non-pronunciation of past tense is in fact difficult to observe, and it's not even clear WHAT should be observed: should we code something as a deleted /t/ if the burst is inaudible but an ultrasound reveals a pronounced coronal gesture? (My short answer is that what is audible is what can be learned, so we should focus on that, though that ultrasound data might be of interest to understanding the underlying processes.) (TD) is more empheral than, say, response times. I don't think there's any valid generalization of the form "linguistic observables are more/less observable than cog. psych. observables". 

    ReplyDelete
  18. Daniel Ezra JohnsonMay 17, 2012 at 12:16 PM

    I'm not even sure what we're talking about.

    We're only talking about percent retention in three categories. What could affect this?1) coding of the categories2) what predictors are included in a regression model3) use of mixed model?

    Can't we use Guy's methodology to replicate Guy? Using Fruehwald 2012 methodology, even Guy 1991 doesn't come very close to 2 and 3.

    ReplyDelete
  19. Well, my "methodology" for calculating j and k given Guy (1991) data is to just calculate j and k. Nothing too special about that. I don't know why no one has every reported the maximum likelihood exponents.

    As for what could push around the estimates so much, I'm writing up a blog post right now,

    ReplyDelete
  20. One source of shrinking effects that I haven't seen mentioned explicitly in this discussion is simultaneous development in theory. That is, over time we might expect theory in a particular area to move from broad contrasts (and large effects) toward more subtle contrasts between variables that result in smaller effects.



    I'll give an example from a recent synthesis (Plonsky and Gass, 2011 in Language Learning). We looked at 174 studies in the interactionist tradition of second language acquisition from 1980-2009 and found the average d values for treatment-comparison contrasts to decline over the life of this line of research: 1980s = 1.62, 1990s = .82, 2000s = .52.

    ReplyDelete
  21. Does nsamp refer to the number of "observations" within each of the two groups or the total number of "observations"?

    ReplyDelete
  22. nsamp refers to the the number of observations within each group. So for nsamp = 10, there were 10 observations in each group, for a total of 20 observations.

    ReplyDelete
  23. Did you know that Jonah Lehrer, the author of "The Truth Wears Off", has recently resigned in disgrace after being revealed as a fraud, plagiarist, and etc.?

    ReplyDelete