Thursday, May 17, 2012

On calculating exponents

In my post on the decline effect in linguistics, the question came up of how I've calculated the exponents for the Exponential Model in my papers. I think this is a point worth clarifying, but it's not likely to be interesting to a broad audience. You have been forewarned.

To recap as briefly as possible, in English, when a word ends in a consonant cluster, which also ends in a /t/ or a /d/, sometimes that /t/ or /d/ is deleted. This deletion can affect a whole host of different words, but the ones which have been of most interest to the field are the regular past tense (e.g., packed), the semiweak past tense (e.g., kept) and morphologically simplex words (e.g., pact), which I'll call mono. Other morphological cases which can be affected, and which I believe have occasionally and erroneously been categorized with the semiweak are no-change past tense (e.g., cost), "devoicing" (or something) past tense (e.g., built), stem changing past tense (e.g., found), etc. For the sake of this post, I'm only looking at the the main three cases: past, semiweak, and mono.

Now, Guy (1991) came up with a specific proposal where if you described the proportion of pronounced /t d/ for past as p, for semiweak as pj and for mono as pk, then j= 2, and k = 3. It is specifically whether or not  j= 2 and k = 3 that I'm interested in here. If you've calculated the proportions of pronounced /t d/ for each grammatical class, you can calculate j by log(semiweak)log(past) and k by log(mono)log(past). The trick is in how you decide to calculate those proportions.

For this post, you can play along at home. Here's code to get set up. It'll load the Buckeye data I've been using, and do some data prep.


So, how do you calculate the rate at which /t d/ are pronounced at the end of the word when you have a big data set from many different speakers? Traditional practice within sociolinguistics has been to just pool all of the observations from each grammatical class across all speakers.

So you come out with j = 1.91, k = 3.1, which is a  pretty good fit to the proposal of Guy (1991).

The problem is that this isn't really the best way to calculate proportions like this. There are some words which are super frequent, and they therefore get more "votes" in the proportion of their grammatical class. And, some speakers talk more than others, and they get more "votes" towards making the over-all proportions look more similar to their own. One approach to ameliorate this is to first calculate the proportion for each word within a grammatical class within a speaker, then for each grammatical class within a speaker, then within a grammatical class. Here's the code for this nested proportion approach.

All of a sudden, we're down to j = 1.34 and k = 2.05, and I haven't even dipped into mixed-effects models black magic yet.

But when it comes to modeling the proposal of Guy (1991), calculating the proportions is really just a mean to an end. I asked Cross Validated how to directly model j and k, and apparently you can do so using a complementary log-log link. So here is the mixed effects model for j and k directly.

The model estimates look very similar to the nested proportions approach, = 1.38, = 2.11.

What if we fit the model without the by-word random intercepts?

Now we're a bit closer back to the original pooled proportions estimates, j = 1.57, = 3.19.

My personal conclusion from all this is that the apparent j = 2, k = 3 pattern is driven mostly by the lexical effects of highly frequent words. This table recaps all of the results, plus the estimates of two more model. One has just a by speaker random intercept, and a flat model, which looks just like the maximum likelihood estimate of the fully pooled approach, because it is.
Methodjk
Pooled1.913.1
Nested1.342.05
~Gram+(Gram|Speaker)+(1|Word)1.382.11
~Gram+(Gram|Speaker)1.573.19
~Gram+(1|Speaker)1.843.14
~Gram1.913.1

The lesson is that it can matter a low how you calculate your proportions.

Wednesday, May 16, 2012

Decline Effect in Linguisics?

It seems to me that in the past few years, the empirical foundations of the social sciences, especially Psychology, have been coming under increased scrutiny and criticism. For example, there was the New Yorker piece from 2010 called "The Truth Wears Off" about the "decline effect," or how the effect size of a phenomenon appears to decrease over time. More recently, the Chronicle of Higher Education had a blog post called "Is Psychology About to Come Undone?" about the failure to replicate some psychological results.

These kinds of stories are concerning at two levels. At the personal level, researchers want to build a career and reputation around establishing new and reliable facts and principles. We definitely don't want the result that was such a nice feather in our cap to turn out to be wrong! At a more principled level, as scientists, our goal is for our models to approximate reality as closely as possible, and we don't want the course of human knowledge to be diverted down a dead end.

Small effects

But, I'm a linguist. Do the problems facing psychology face me? To really answer that, I first have to decide which explanation for the decline effect I think is most likely, and I think Andrew Gelman's proposal is a good candidate:
The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount.

I've put together some R code to demonstrate this point. Let's say I'm looking at two populations, and unknown to me as a researcher, there is a small difference between the two, even though they're highly overlapping. Next, let's say I randomly sample 10 people from each population, do a t-test for the measurement I care about, and write down whether or not the p-value < 0.5 and the estimated size of the difference between the two populations. Then I do this 1000 more times. Some proportion (approximately equal to the power of the test) of the t-tests will have successfully identified a difference. But did those tests which found a significant difference also accurately estimate the size of the effect?

For the purpose of the simulation, I randomly generated samples from two normal distributions with standard deviations 1, and means 1 and 1.1. I did this for a few different sample sizes, 1000 times each. This figure show how many times larger the estimated effect size was than the true effect for tests which found a significant difference. The size of each point shows the probability of finding a significant difference for a sample of that size.
So, we can see that for small sample sizes, the test has low power. That is, you are not very likely to find a significant difference, even though there is a true difference (i.e., you have a high rate of Type II error). Even worse, though, is that when the test has "worked," and found a significant difference when there is a true difference, you have both Type M (magnitude) and Type S (sign) errors. For small sample sizes (between 10 and 50 samples each from the two populations), the estimated effect size is between 5 and 10 times greater than the real effect size, and the sign is sometimes flipped!

Taking the approach of just choosing a smaller p-value will help you out insofar as you will be less likely to conclude that you've found a significant difference when there is a true difference (i.e., you ramp up your Type II error rate, by reducing the power of your test), but that doesn't do anything to ameliorate the size of the Type M errors when you do find a significant difference. This figure facets by different p-value thresholds.

So do I have to worry?

So, I think how much I ought to worry about the decline effect in my research, and linguistic research in general, is inversely proportional to the size of the effects we're trying to chase down. If the true size of the effects we're investigating are large, then our tests are more likely to be well powered, and we are less likely to experience Type M errors.

And in general, I don't think the field has exhausted all of our sledgehammer effects. For example, Sprouse and Almeida (2012) [pdf] successfully replicated somewhere around 98% of the syntactic judgments from the syntax textbook Core Syntax (Adger 2003) using experimental methods (a pretty good replication rate if you ask me), and in general, the estimated effect sizes were very large. So one thing seems clear. Sentence 1 is ungrammatical, and sentences 2 and 3 are grammatical.
  1. *What did you see the man who bought?
  2. Who did you see who bought a cow?
  3. Who saw the man who bought a cow?
And the difference in acceptability between these sentences is not getting smaller over time due to the decline effect. The explanatory theories for why sentence 1 isn't grammatical may change, and who knows, maybe the field will decide at some point that its ungrammaticality is no longer a fact that needs to be explained, but the fact that it is ungrammatical is not a moving target.

Maybe I do need to worry

However, there is one phenomenon that I've looked at that I think has been following a decline effect pattern: the exponential pattern in /t d/ deletion. For reasons that I won't go into here, Guy (1991) proposed that if the rate at which a word final /t/ or /d/ is pronounced in past tense forms like packed is given as p, the rate at which it is pronounced in semi-irregular past tense forms like kept is given as pj, and the rate at which it is pronounced in regular words like pact is given as pk, then j = 2, k = 3.

Here's a table of studies, and their estimates of j and k, plus some confidence intervals. See this code for how I calculated the confidence intervals.


StudyYearDialectjk
Guy1991White Philadelphia
4.74
2.37
1.17
4.26
2.75
1.86
Santa Ana1992Chicano Los Angeles
2.29
1.76
1.35
3.39
2.91
2.51
Bayley1994Tejano San Antonio
2.08
1.51
1.11
3.59
2.99
2.52
Tagliamonte & Temple2005York, Northern England
1.85
1.12
0.66
1.96
1.43
1.04
Smith & Durham & Fortune2009Buckie, Scotland
1.36
0.64
0.24
3.59
2.33
1.53
Fruehwald2012Columbus, OH
2.48
1.38
0.76
2.35
1.93
1.59


I should say right off the bat that all of these studies are not perfect replications of Guy's original study. They have different sample sizes, coding schemes, and statistical approaches. Mine, in the last row, is probably the most divergent, as I directly modeled and estimated the reliability of j and k using a mixed effects model, while the others calculated pj and pk and compared them to the maximum likelihood estimates for words like kept and pact.

But needless to say, estimates of j and k have not hovered nicely around 2 and 3. 

Thursday, April 19, 2012

Come and see

Yesterday, as a pre-amble to an ordinary newsletter sent out via listserv to most PhD students at UPenn, we were offered this piece of advice:
Tip of the day: You should all know this by now: It is incorrect to say “come and see” or “come out and help”, or any other “come…and…” phrase. It is an infinitive phrase: “Come to see”, “Come out to help”, “Come to have fun”. Don’t aggravate anyone’s pet peeves; just write and say it correctly. You’re welcome.
Well, many of us linguistics graduate students felt this merited some kind of response. I don't know about other linguists out there, but if someone said this to me in a personal e-mail, or in conversation, I couldn't not respond.

And then, an amazing thing happened. We started drafting a letter in a Google document with 16 contributors. It was a litte chaotic, but we marshaled together intuitions, data, and argumentation, and had drafted this message in about an hour's time.
To whom it may concern:

We were recently sent a grammar “tip” via the [redacted] listserv which read:
Tip of the day: You should all know this by now: It is incorrect to say “come and see” or “come out and help”, or any other “come…and…” phrase. It is an infinitive phrase: “Come to see”, “Come out to help”, “Come to have fun”. Don’t aggravate anyone’s pet peeves; just write and say it correctly. You’re welcome.
The linguistics graduate students felt that this required a response, as in fact, the cited examples “come and see” and “come out and help” are both grammatical and widely used constructions in American English.

The two constructions differ slightly in meaning. If one says,
  • Mary came and saw Tupac’s hologram perform.
it must be the case that the performance actually occurred; it cannot be the case that there were technical difficulties and the performance was cancelled. However,
  • Mary came to see Tupac’s hologram perform.
admits the possibility that the performance was cancelled due to technical difficulties. Therefore, asserting that the infinitive phrase is a uniformly appropriate replacement for the conjoined phrase is not an appropriate representation of the linguistic facts.

Phrases like “come and see” are not restricted to the spoken idiom, but are also used in the written language. They even occur in texts considered by some to be canonical, as the following examples show:
He saith unto them, “Come and see”. (John 1:39, King James Bible) 
“Then you may come and see the picture”. (Merry Wives of Windsor II:II, William Shakespeare) 
“Will you come and see me?” (Pride & Prejudice, chap. 26, Jane Austen)
Generally, grammatical prescriptivism contributes little to useful discourse, and may even cause intelligent language users to be unfairly stigmatized. Thus, while we appreciate [redacted]'s light-hearted "tips-of-the-day," we would encourage authors to keep an open mind about the breadth of possible language use, especially in public forums.

Sincerely,

Jana Beck*
Claire Crawford*
[redacted]*
Sabriya Fisher*
Aaron Freeman*
Lauren Friedman*
Josef Fruehwald*
Kyle Gorman*
Marielle Lerner*
Caitlin Light*
Laurel MacKenzie*
Brittany McLaughlin*
Hilary Prichard*
Kobey Shwayder*
Jon Stevens*
[redacted]*

*Department of Linguistics
Thinking about it some more, I think at least the past tense "came to see" even has the implicature that either the seeing was unsuccessful, or there is some other more relevant event than the seeing which the speaker is about to tell us about.

Anyway, I think we did a bang up job, and produced a really excellent message, especially considering there were 16 authors!

Saturday, April 14, 2012

Linguistic Notation Inside of R Plots!

So, I've been playing around with learning knitr, which is a Sweave-like R package for combining LaTeX and R code into one document. There's almost no learning curve if you already use Sweave, and I find a lot of knitr's design and usage to be a lot nicer.

I wasn't going to make a blog post or tutorial about knitr, because the documentation is already pretty good, and contains a lot of tutorials.  However, I've just had a major victory in incorporating linguistic notations into plots using knitr, and I just had to share. I'll show you the payoff first, and then include the details.

First, I managed to successfully use IPA characters as plot symbols and legend keys.
The actual data in the plot is on car fuel economy, but that's not the point. Look at that IPA!

Then, I tried to expand on the principles that got me the IPA, and look what I produced.
Yes, that is a syntax tree overlaid on top of the plot. But why stop there when you could go completely crazy?

How to do it.

The important thing about making these plots is that they were easy given my pre-existing knowledge of R, LaTeX and what I've learned about knitr.  The crucial element here is that knitr supports tikz graphics. I don't know anything about tikz graphics, and I still don't, which means that if you don't know anything about tikz graphics, you can still make plots like these.

Like most linguists who use LaTeX, I already know how to include IPA characters and draw syntactic trees in a LaTeX document. It's simple as
...
\usepackage{tipa}
\usepackage{qtree}
...
\textipa{D C P}
\Tree [.S NP VP ]
...

What is so cool about the tikz device is that it lets you define these notations in LaTeX syntax, and then incorporates them into R graphs. Here are the important code chunks to include in your knitr document to make it all work.

1 — Load the right R packages

Early on, load the ggplot2 and tikzDevice R packages.

2 — Define your LaTeX libraries

Then, you need to tell the tikz device which LaTeX packages you want to use.
<<>>=
    options(tikzLatexPackages = c(getOption("tikzLatexPackages"),
                                  "\\usepackage{tipa}",
                                  "\\usepackage{qtree}"))
@

3 — Define the plotting elements in LaTeX

We're done with the hard part. Now, it's as simple as faking up some data...
<<>>=
    levels(mpg$drv) <- c("\\textipa{D}",
                         "\\textipa{C}",
                         "\\textipa{P}")
 
    mpg$tree <- "{\\footnotesize \\Tree [.S NP VP ]}"
@

4 — Plot the data using the tikz device

...and plotting it, using the tikz device.
<<dev="tikz", fig.width=8, fig.height=5, out.width="0.9\\textwidth", fig.align="center">>=
    ggplot(mpg, aes(displ, hwy, label = drv, color = drv)) + 
            geom_text() + 
            stat_smooth()+
            xlab("\\textipa{IPA!}")    
@
Or, in the case of the syntactic trees,
<<dev="tikz", fig.width=8, fig.height=5, out.width="0.7\\textwidth", fig.align="center">>=
    ggplot(mpg, aes(displ, hwy, label = tree))+
            geom_text() + 
            stat_smooth()+
            xlab("TREES")
@

5 — Compile the .Rnw to a .tex document

Here's some source code to embed these plots in a beamer presentation. To compile a .tex document from the .Rnw source, you can run
library(knitr)
knit("./ling-plot.Rnw")
Then, just compile the .tex document however your little heart desires.

How to do it with one click

As if this weren't awesome and easy enough yet, it's possible to compile the whole document in one click using RStudio, as outlined on this knitr page. You'll need to download the development (i.e. not guaranteed to be stable) RStudio release, then set the compilation option to use knitr, and you're done!

I have to say that from  a practical standpoint, I've found writing Sweave documents in RStudio to be a much better experience than what I was doing before, because I can run and debug the R code from within the .Rnw source document. No need to go flipping back and forth between a Tex editor and R.

P.S. I highlighted the code above at http://www.inside-r.org/pretty-r

Saturday, March 31, 2012

More on Philadelphia Homicide

I've been doing more analysis of the Philadelphia Homicide data that the Philadelphia Inquirer has published, and presented some of it at the Philadelphia UseR group yesterday. My slides [pdf] and source [knitr .Rnw] are on github.

I should be clear that I am not an expert on crime and murder. In fact, I'm not even fairly knowledgeable. If anyone out there with more expertise has strong criticism of my "analysis" (really, it's just a rough exploration of the data), I'll eat it, and I'll look forward to your own analysis of the data (again, it's right here). Here are some of the most striking patterns that I found.

Results

First, here is the total number of murders that occurred over the past 23 years, broken down by the day of the week. The weekends are worse than the weekdays.

Next, here are the total number of murders by hour of the day. The hour of the day was not included in the data until 2006, so this only represents murders between 2006 and 2011. The plot is centered around midnight, so the afternoon of Day 1 is on the left, and the morning of Day 2 is on the right.
It looks like there's something weird going on around 11pm and midnight, which I have to chalk up to the reporting patterns of the PPD. For some reason, it seems like murders which occurred in the midnight hour are more likely to be logged as occurring at 11PM.

Here is the most striking plot that I produced this time around. It plots, by month, the average frequency of murders. The y-axis represents 1 murder every X days.

Since 1988, the African American community has been living in a Philadelphia with approximately a murder every day, or every other day. The White community, on the other hand, has been living in a Philadelphia with a murder once a week.

I also did some meager statistical analysis, specifically poisson regression with terms for the month (that is, January, February, etc, to look for a seasonal pattern), race of the victim, and weapon used. There was a significant month effect, but the coefficients didn't have much of a pattern to them. I did use number of days in the month as an offset in the regression, so it's not that. More importantly, there was an unsurprising main effect of race, but also a big interaction between race and weapon. Specifically, African American victims were way more likely to be killed by a gun.

Guns and knives are the two most common weapons used in murders in the data. White murder victims are 2.54x more likely to have been shot than stabbed, while an African American murder victim is 7.19x more likely to have been shot than stabbed, meaning that African American murder victims are 2.83x more likely to have been shot than a White murder victim was.
Update: There was a pretty serious flaw in my regression, in that if there was a Month where, say, no African Americans were murdered with a knife (and there were plenty), that month's data was missing, rather than 0. Filling in the data appropriately to reflect months with 0 murders for a particular race x weapon combination, the estimates are pretty different. White murder victims are 5.71x times more likely to be murdered with a gun than a knife, while African American murder victims were 8.62x times more likely to be murdered with a gun than a knife, meaning African Americans are 1.51x times more likely to be shot than stabbed. So, that's a pretty serious revision approximately halving the multiplier. I've already updated the linked code and slides.
So, gun deaths are an especially acute problem in the African American community. In fact, if you exclude gun deaths from the data, it actually looks like the racial disparity in murder rates has been narrowing.


It is purely coincidental that I'm posting this on the same day that the Philadelphia Police Department are doing a gun buyback. You can bring in a gun and receive a $100 Shoprite voucher, no questions asked. Seems like a good initiative.

Analysis Discussion

I spent a bit of time trying to figure out what I thought the most meaningful way to represent the murder rate was. First, I calculated the murder frequency by counting how many n murders there were a month, then divided that by the number of days in the month for (n murders/n days)=murders per day. But the resulting measure had values like 0.14 murders per day, which isn't too informative. What people want to know about murders, or at least what I want to know, is how often murders happen, not how many happened in a given time window. So, instead, I calculated (n days/n murders)=days per murder.

The y-axis for the murder rate figures is also a logarithmic scale, which is both reasonable given the distribution of the data, and the impression of the timescale. From a human perspective, the difference between 1 day and 2 days feels larger than the difference between 3 weeks and 4 weeks. The y axis is also flipped, to indicate that smaller numbers mean "more often".  I managed the reversed log transformation by writing my own coordinate transformation using the new scales package. Here's the R code.

Wednesday, March 7, 2012

Philadelphia Schools

I'm on spring break, and yesterday I took some time to check off some items on my to-do list, namely:
  1. Start getting acquainted with all the new features of ggplot2 [PDF].
  2. Get a handle on dealing with geographic data in R.
I've done some furtive geographic analysis using R [pdf], but the code behind it was very hacky. There is a whole field of geospatial data analysis out there that I am really ignorant of, and still am, but I've made a little bit of progress.

I mostly followed the tutorial laid out here for making maps in ggplot2. The most difficult part was getting the rgdal package installed. It's one of these packages that relies on other,  non-R libraries being installed. I managed to get GDAL and Proj.4 installed (even though I honestly don't know what they do,), and got rgdal installed (I had to work around an apparently non-standard installation location for Proj.4).

Now, it's all about getting some good data, and fortunately, I stumbled across opendataphilly.org yesterday as well! I found a shapefile of all schools in Philadelphia, and a separate data set about how many public and charter high school graduates in 2010 went on to postsecondary education of various sorts. Unfortunately, there weren't any shared IDs of any sort between the two data sets, so to join them I had to hack it by hand, mostly.

So, here is the result.
I'm not sure what I expected to see, which certainly weakens any conclusions I'd like to draw, but I am surprised at how little geographic patterning there is. I'm also almost certain that there are some data reporting problems. For example, that huge dark blue dot in the Northeast is Northeast High School, which reports that of their 652 graduates, 0 went on to any postsecondary education. I just don't think that can be true, and not because I'm an idealist. Northeast is right down the street from where I grew up, and while its not a fancy prep school by any means, it has both a Magnet program, and an International Baccalaureatte program.

There's no way that zero students from Northeast went on to postsecondary education, a category which includes non-degree granting programs and specialized training programs. It's a lot more likely that they either didn't report the numbers, or the Pennsylvania Department of Education lost them, and then didn't distinguish between missing data and 0. Unfortunately, that calls all schools with reports of 0% postsecondary education into question, even though some schools probably did have 0 students go on to further education.

Looking at the distribution of the proportion of graduates going on to postsecondary education, the numbers are hugely bimodal (at least for the public schools).


Even after excluding the schools which reported 0 students going on to postsecondary education, there are still 3 schools with basically 0 students getting further education out of high school: Frankford (1/341),  West Philly (1/208) and University City (2/205).

Excluding the schools which reported less than 1% of students going on the further education (assuming either that they have faulty data, or have acute problems of other sorts), I replotted the map (note that the colors now run from 50% to 100%).


Still no huge geographic patterns.

Here's the R code that I used (including links to the data).

Sunday, March 4, 2012

My Pocket Change

I'm playing around with some personal data collection, and using some cloud computing to visualize it. Following the directions in this blog post, I've written an R function which visualizes data it draws from a Google Docs spreadsheet, and uploaded it to OpenCPU's servers. The plots you're seeing in this post were actually generated by OpenCPU when you loaded this page, meaning they're live!


So, I've been logging, daily, my pocket change. The first plot shows the cumulative growth of the change in my change jar by 3 different measures, raw number of each kind of coin, total value as contributed by each kind of coin, and total mass contributed by each kind of coin (based on official data on how much each kind of coin should weigh).


This plot shows the proportional contribution each coin makes to each measure. The first panel shows what percent of all my coins belong to each type, the second panel shows how much each coin contributes to the over-all value proportionally, and the third how much each kind of coin contributes to  the over-all mass.


So, depending on how long I keep this habit up, if you keep checking in on this post, you'll see new plots every day.

I have two primary motivations for logging my coins. First, last time I cashed in all my change, someone asked me how long it took me to save it up, and I had no idea! Second, I'm curious to see how much effort I'm putting into carrying around relatively heavy coins, like pennies, for their small contribution to the over-all value of my coin jar.