Val Systems: 2012

Tuesday, November 27, 2012

To take "Zombie Nouns" seriously, you must've had your brains eaten.

At first, I didn't feel like blogging about the NYT Column on "Zombie Nouns" because I feel like I've been spending too much time being critical here, arguing against usage advice like this is futile, and I knew Mark Liberman would cover it. In fact, I drafted this post all the way back during the summer, and just let it sit. But now, I've seen the column, nearly verbatim, pop up on TED-Ed as a fully animated "lesson", which presumably means some educators are actually assigning it to classrooms of fertile and impressionable minds! It really can't pass without comment now.

Helen Sword says that you should avoid using nominalizations, which she calls "zombie nouns." They're nouns that have been made out of other parts of speech. To take one of her examples, calibrate + ion = calibration.

What is so wrong about nominalizations? Not exactly clear. She seems to take aim at unnecessarily jargonistic writing, which frequently contains novel coinings of words of all types, including nominalizations. So sure, being jargonistic to obscure your other intellectual shortcomings is not so good. But is it really, actually, the mere use of nominalizations that's doing the damage there?

She also seems to take a page out of the anti-passive voice book, saying, "it fails to tell us who is doing what," which just like the passive, is just not true. For example, in the sentence

My criticism of her column is a day late and a dollar short.

It's very clear who is doing what, even though I used a nominalization (in bold).

But on top of the half baked usage advice, there are some more reprehensible social attitudes being expressed. For example, she lists epistemology as a useful nominalization for expressing a complex idea, but heteronormativity as one only out of touch academics who are enchanted by jargon use. First off, I would not want to use epistemology as an example when explaining what nominalizations are. What's it derived from? Episteme? Episteme has a Wikipedia page, so I guess it's that. Which brings me to the next issue here. It's embarrassing for me to admit, but whenever someone says or writes epistemology, I have to go look it up on Wikipedia. How does using epistemology not count as being out of touch with how ordinary people speak? Heteronormativity, on the other hand, is pretty easy to wrap your mind around. From Wikipedia:

Heteronormativity is a term to describe any of a set of lifestyle norms that hold that people fall into distinct and complementary genders (man and woman) with natural roles in life. It also holds that heterosexuality is the normal sexual orientation, and states that sexual and marital relations are most (or only) fitting between a man and a woman. Consequently, a "heteronormative" view is one that involves alignment of biological sex, sexuality, gender identity, and gender roles.

That's a pretty complex idea. But you know what? It's pretty easy to decode most of that meaning from the word itself, at least, if you're vaguely familiar with the politics of the time. Hetero(sexual) + normative + ity. It seems to me that she's saying more about her position on sex and gender politics here than she is about usage advice.

But who is this person, and why is she writing an opinion column in the New York Times, and getting the full TED treatment? Just like everyone, she's selling something: the icing on the cake, and my reason for blogging about this at all. She has a book out called The Writer's Diet, which has an accompanying online Writer's Diet Test. No, it's not diet as in "food for thought and inspiration," like a Chicken Soup for the Writer's Soul. It's diet as in dieting as in "drop 20 lbs and get the six pack abs you always wanted." Just paste in a paragraph of your writing into the test, and it'll rate you along a five point scaled labeled:

lean

fit & trim

needs toning

flabby

heart attack territory

Ain't nothing like exploiting the collective dysmorphia of a nation to push your quarter-baked usage decrees. But in doing so, Sword actually clarifies the role that books like hers play. The analogy to the diet and weight loss industry is entirely apt. The dieting industry makes their money by sowing seeds of personal insecurity, then reaps their harvest with offers of unfounded, unscientific, and ultimately futile dieting pills, products, methods, 10 step plans, meals, regimes, books, magazines, etc.

I won't mince words. The NYT column and the TED-Ed video have the equivalent intellectual content of the magazines in the supermarket aisle promising you 5 super easy steps to trim your belly fat to get a sexy beach bod in time for the summer. And they serve the same purpose: to undermine the confidence of every-day folk, so that they may be taken advantage of by self-appointed gurus.

Thursday, November 15, 2012

Creative Work

Whenever I hear "creative" people describe their creative process, or more precisely their creative woes, I am always struck by the strong similarities to my own experiences trying to do science. I do consider myself as trying to do science.

Take, for example, this excellent statement on self-disappointment at the early stages of your career from Ira Glass.

Ira Glass on Storytelling from David Shiyang Liu on Vimeo.

This almost perfectly sums up how I felt about almost all of the early work I did in graduate school. I can't say that I've actually gotten to the point where the work I produce meets up with my my own personal standards, but it has been on an upward trend, and I'd say Ira Glass' advice is spot on. If you want to write good papers, just write a lot of papers, and if you want to be good at giving talks, give a lot of talks, preferably in a context where you feel comfortable being bad or mediocre.

That last bit, being comfortable with being bad is really reminiscent of things Brother Ali says in this interview.

Ill Doctrine: Brother Ali Meets the Little Hater from ANIMALNewYork.com on Vimeo.

There are a few things Brother Ali says that really resonate with me.

There was a moment where I was so stressed out. And I'm like, "Man, everything that I ever did that people liked, I just got lucky. I'm a fraud."

...

It's a weird weird thing to have what you create also be your livelihood. What we create is also our sense of self. What we create is also the way the world views us.

...

And so I start thinking about it. Ok, it's not that I'm blocked. It's not that I don't have anything to say. It's that I don't know how to say what I need to say. Or it's that I don't think that it's going to be received well. Or it's that the people that love me and have supported me and have, you know, gave me the little bit of freedom in my life that I have, I don't want to let them down and I don't want to hurt their feelings by saying what needs to be said.

I think almost all academics of any variety feel this way from time to time.

But I wonder if some people might not be surprised that I would feel so similarly to creative artists in the pursuit of my science, or that maybe take it as evidence that I what I do is not science. It is certainly doubted about Linguistics occasionally. But I think these people (probably strawmen) are mistaken in thinking that science is not a creative process. This was recognized by Max Weber in is 1918 essay "Science as a Vocation" (which I've blogged about before).

[I]nspiration plays no less a role in science than it does in the realm of art. It is a childish notion to think that a mathematician attains any scientifically valuable results by sitting at his desk with a ruler, calculating machines or other mechanical means. The mathematical imagination of a Weierstrass is naturally quite differently oriented in meaning and result than is the imagination of an artist, and differs basically in quality. But the psychological processes do not differ. Both are frenzy (in the sense of Plato's 'mania') and 'inspiration.'

He also suggests that the best science and the best art is produced by individuals devoted to the science and art for their own sake, rather than being driven by the express goal of producing something new, for the sake of novelty.

The distinction that Weber draws between art and science is that science is necessarily committed to the abandonment of old science. That is, art from the Renaissance is still, and always will be, art, but science from the same period is no longer science. It has been superseded by more recent developments.

Anyway, here's the song Brother Ali was talking about, which I'm sure almost all academics can identify with, except for the suicide ideation, hopefully.

Wednesday, November 7, 2012

Nate Silver vs.the baseline

The 2012 election has been declared a victory for Nate Silver. As Rick Reilly said:

If Nate Silver told me it's going to rain marshmallows tomorrow, I'd stand outside with cups of hot chocolate.
— Rick Reilly (@ReillyRick) November 7, 2012

For me, as a data geek, this is nothing but good news. There's been a lot of talk about how Silver's high profile during the election could have broader effects on how every day people think about data and prediction. There's also talk about how Silver's performance is challenging to established punditry, as summed up in this XKCD comic.

Coming at this from the other side, though, I'm curious as a data person about how much secret sauce Silver's got. Sure, in broad qualitative strokes, he got the map right. But quantitatively, Silver's model also produced more detailed estimates about voting shares by state. How accurate were those?

Well, to start out, there is not some absolute sense of accuracy. When it comes to predicting which states would go to which candidates, it's easy to say Silver's predictions were maximally accurate. But what's tricker is to figure out how many he could have gotten wrong and still have us call his prediction accurate. For example, Ohio was a really close race. If Ohio had actually gone to Romney, but all of Silver's other predictions were right, could we call that a pretty accurate prediction? Maybe. But now let's say that he got all of conventional battle ground states right, but out of nowhere, California went for Romney. It's the same situation of getting one state wrong, but in this case it's big state, and an anomalous outcome that Silver's model would have missed. Would his prediction be inaccurate in that case? What if it was Rhode Island instead? That would be equally anomalous, but would have a smaller impact on the final election result. Now let's imagine a different United States where all of the races in all of the states had razor thin margins, and Silver correctly predicted 30 out of 50. In that case, we might say it was an accurate prediction.

All of this is to say that the notion of "accuracy" is really dependent upon what you're comparing the prediction to, and what the goal of the prediction is.

So what I want to know is how much Silver's model improves his prediction over what's just immediately obvious from the available data. That is, I want to see how much closer Silver's prediction of the vote share in different states was than some other baseline prediction. For the baseline, I'll take the average of the most recent polls from that state, as handily provided by Nate Silver on the 538 site. I also need to compare both the averaging method and the 538 method to the actual outcomes, which I've copy-pasted from the NPR big board. (Note: I think they might still be updating the results there, so I might have to update this post at some future date with the final tally.)

First I'll look at the Root Mean Square Error for the simple average-of-polls prediction and the 538 prediction. I'll take Obama and Romney separately. The "Silver advantage" row is just the poll averaging prediction divided by the 538 prediction.

	Obama	Romney
Averaging Polls	3.3	4.1
538	1.8	1.7
Silver Advantage	1.8	2.4

So it looks like Silver has definitely got some secret sauce, effectively halving the RMSE of the stupid poll averaging prediction. I also tried out a version of the RSME weighted by the electoral votes of each state, for a more results oriented view of the accuracy. I just replaced the mean of the squared error by a weighted average of the squared error, weighted by the electoral votes of the state. The results come out basically the same.

	Obama	Romney
Averaging Polls	3.2	3.1
538	1.5	1.5
Silver Advantage	2.2	2.0

So what was it about the 538 forecast that made it so much better than simply averaging polls? I think these plots might help answer that. They both plot the error in the 538 forecast against the error in poll averaging.

It looks like for both Obama and Romney, the 538 forecast did more to boost up the prediction in places where they outperformed their polls than tamping them down where they underperformed. The effect is especially striking for Romney.

So, Silver's model definitely outperforms simple poll watching & averaging. Which is good for him, because it means he's actually doing something to earn his keep.

You can grab the data I and R code was working with at this github repository. There's also this version of the R code on RPubs.

Friday, July 27, 2012

Teens and Texting and Grammar

I'm just one man, one linguist, impotently shouting into the vast mediascape, "PLEASE POPULAR MEDIA! PLEASE DON'T RUN WITH THE TEEN TEXTING GRAMMAR STORY!"

There is a paper out in New Media and Society called Texting, techspeak, and tweens: The relationship between text messaging and English grammar skills. If you are a linguist, and you winced at the title, I have to warn you, you're not done wincing yet.

Is the key problem that the authors collected data on text messaging behaviors from self reports? No.

Is the key problem that the authors did not directly assess whether or not the teens in the study used "techspeak"? No. (Let's set aside the fact that high volumes of txtspeak are increasingly associated with out of touch adults).

Is the key problem that the authors didn't include any figures plotting the relationship between any of their measures? No.

Is the key problem that the authors included no control group of teens who don't text, or adults who adopted texting late in life? No.

The key problem is that the authors appear to have no idea what grammar or language are. I quote:

Similar to synchronous online communications such as instant messaging, the speed, ease, and brevity of text messaging have created a perfect platform for adapting the English language to better suit attributes of the technology. This has led to an evolution in grammar, the basis of which we shall call ‘techspeak.’ This language differs from English in that it takes normal English words and modifies them [...]

The depth of misunderstanding and naiveté present in this quote about the relationship between actual language and grammar and the way we write is equivalent to thinking that the sun revolves around the Earth, and that stars are bright dots on a large dome in the sky. Mind you, the Earth-centric, skydome model of the universe is a perfectly reasonable one until you are exposed to the most basic, rudimentary scientific understanding of how the world works.

The authors of this paper appear not to have been exposed to the most basic, rudimentary scientific understanding about how language and grammar work.

From Appendix A of the paper, I present to you the 20 point "grammar" assessment used in the study.

There (is, are) two ways to make enemies.
One of the men forgot to bring (his, their) tools.
Gail and Sue (make, makes) friends easily.
The coach thought he had (tore, teared, torn) a ligament.
During the flood, we (dranked, drank, drunk, drunked) bottled water.
The boy called for help, and I (swum, have swam, swam) out to him.
Fortunately, Jim’s name was (accepted, excepted) from the roster of those who would have to clean bathrooms because he was supposed to go downtown to (accept, except) a reward for the German Club.
I don’t know how I could (lose, loose) such a big dress. It is so large that it is (lose, loose) on me when I wear it!
The man around the corner from the sandlots (come, comes) to our meetings.
The man and his little girls (was, were) not injured in the accident.
The pictures in this new magazine (shows, show) the rugged beauty of the West.
The orders from that company (is, are) on your desk there.
The (boys, boys’, boy’s, boys’s) hats were lost in the water because they were careless in not tying them to the side of the boat.
(Its, It’s, Its’) an honor to accept the awards certificates and medals presented to the club.
Worried, and frayed, the old man paced the floor waiting for his daughter. (Correct/Incorrect)
The boy yelled, ‘Please help me’! (Correct/Incorrect)
She got out of the car, waved hello, and walked into the house. (Correct/Incorrect)
When Suzie arrived at the dance, no one else was there. (Correct/Incorrect)
Dad and I enjoyed our trip to new york city. (Correct/Incorrect)
The boy’s mother picked him up from school. (Correct/Incorrect)

To quote what it was the authors were trying to assess:

The first portion of the assessment consisted of 16 questions designed to test the student’s grasp of verb/noun agreement, use of correct tense, homophones, possessives, and apostrophes. [...] The second portion of the assessment asked participants to indicate whether or not a sentence was correct, such as ‘The boy yelled, “Please help me”!’ (Correct/ Incorrect). This portion tested the student’s understanding of comma usage, punctuation, and capitalization.

Virtually none of these points (homophones, apostrophes, comma usage, punctuation and capitalization) fall under the purview of what is scientifically understood to be "grammar". Arnold Zwicky has suggested the term "garmmra" for such things. Punctuation, comma rules, spelling conventions, etc. are all only arbitrary decisions settled upon a long time ago, and have nothing, nothing to do with human language. You could, by fiat, swap periods and commas (like many cultures do with their numeral systems), insist that sentence initial adverbs be followed by a semicolon, and decide to revert back to the symbols <þ> and <ð> to spell the sounds we currently both spell with <th>, and you know how many things that would change about English grammar? Zero things.

The remaining points of assessment could be considered to be well within the domain of grammar (tense and subject/verb agreement), except authors chose really poor, very variable items for the evaluation. The very first item involves verbal agreement with an expletive subject, and the rest involve cases of coordination, and agreement attraction! These are items which really lie on the outside edges of linguistic processing abilities, and there is no way that they could serve as reliable measures of fluency and grammatical competence. Search the work of any good writer, and I'm sure you'll find examples of both kinds of usage.

And then there's the second item: "One of the men forgot to bring (his, their) tools." Both possibilites are acceptable English, and have been for a long time.

The most depressing thing about this grammar assessment is where the researchers say they got it.

This assessment was adapted from a ninth-grade grammar review test.

I'm reminded of a piece I read called For Ebonics, the New Milennium Is Pretty Much Like the Old One, which said: "This suggests to me a catastrophic failure of the public school 'language arts' curriculum: people spend years in various language arts classes and leave with the same 19th-century folk notions that they started with."

So what have these authors actually found? Well, maybe it's the case that the more people who write in a broader range of contexts for a broader range of purposes, the more the arbitrary, conventionalized aspects of the writing system of English will undergo natural drift. What effect with this have on English grammar, as it is represented in the minds of every day English users? Probably just as much as the current writing system does: a minimal one.

And what about my plea to the popular media? Even if someone of note finds this post and reads it, I already know that it won't matter at all. Per my commentary on the coverage on vocal fry, no one is going to report on this piece because they care about science or facts. This research fits snugly into pre-existing biases about young people and the general decline of society, and frankly, these biases seem to have more to do with why these researchers did the study in the first place than science or facts. And there's is no way that something so trivial as a bunch of experts on language and grammar are about to derail this train of garbage and nonsense.

UPDATE! There is, in fact, actual paper on the topic of Instant Messaging and Grammar by Sali Tagliamonte and Derek Denis from 2008 called "Linguistic Ruin? LOL! Instant Messaging and Teen Language." Remember hearing about that in the news? Here's selections from their conclusions.

In a million and a half words of IM discourse among 71 teenagers, the use of short forms, abbreviations, and emotional language is inﬁnitesimally small, less than 3% of the data.

Our foray into the IM environment through quantitative sociolinguistic analysis, encompassing four areas of grammar and over 20,000 individual examples, reveals that IM is ﬁrmly rooted in the model of the extant language,reﬂecting the same structured heterogeneity (variation) and the same dynamic, ongoing processes of linguistic change that are currently under way in the speech community in which the teenagers live.

UPDATE! See also Enregistering internet language by Lauren Squires (2010)

Wednesday, July 25, 2012

Don't worry, I'm a physicist.

Today, I came across a science news item from ABC (the Australian Broadcasting Corporation) with the title "Study opens book on English evolution." Oh goodness. Here are the opening paragraphs:

A study of 500 years of the English language has confirmed that 'the', 'of' and 'and' are the most frequently printed words in the modern era.

The study, by Slovenian physicist Matjaz Perc, also found the top dozen phrases most-printed in books include "at the end of the", "as a result of the" or "on the part of the".

That sound you hear is the stunned silence of linguists everywhere over the fact that you can get into the science news with the primary result that "'the' is the most common English word."

But to be fair, what the author was trying to argue is that the Zipfian distribution of word frequencies is a result of "preferential attachment," where frequent words get more frequent. He tried to demonstrate this by showing that the frequency of a word in a given year is predictive of its frequency in the future, specifically that relatively high frequency words will be even more frequent in the future. They key result is shown in Figure 4 in the paper, available here.

Say what?

While that quantitative result may stand, the fact that Perc is a physicist probably contributed to some really bananas statements about language. In the first paragraph, he almost completely conflates human language and written langauge as being the same thing, and erases the validity and richness of cultures with unwritten languages.

Were it not for books, periodicals and other publications, we would hardly be able to continuously elaborate over what is handed over by previous generations, and, consequently, the diversity and efficiency of our products would be much lower than it is today. Indeed, it seems like the importance of the written word for where we stand today as a species cannot be overstated.

He also presents some results of English "coming of age" and reaching "greater maturity" around 1800 AD (Figure 3). Finally! It only took us like, what, a thousand years or so?

The discussion section kicks off with the statement

The question ‘Which are the most common words and phrases of the English language?’ alone has a certain appeal [...]

That may be true for physicists, but for people who are dedicated to studying language (what are they called again?) not so much. Fortunately, his ignorance of linguistics is actually a positive quality of this research!

On the other hand, writing about the evolution of a language without considering grammar or syntax, or even without being sure that all the considered words and phrases actually have a meaning, may appear prohibitive to many outside the physics community. Yet, it is precisely this detachment from detail and the sheer scale of the analysis that enables the observation of universal laws that govern the large-scale organization of the written word.

See, linguists are just too caught up in the details to see the big picture! Fire a linguist and your productivity goes up, amirite?

For real though?

But back to the substantive claim of the paper. Is the Zipfian distribution of words due to the rich getting richer? That is, are words like snowballs rolling down a hill? The larger they are, the more additional snow the pick up, the even larger they get. Maybe, but maybe not.

Here's a little experiment that I was told about by Charles Yang, who read about it in a paper by Chomsky that I don't know the reference to. Right now, we're defining "words" as being all the characters between white spaces. But what if we redefined "words" as being all the characters between some other kind of delimiter? The example Charles used was "e". If we treat the character "e" as being the delimiter between words, and we apply this a large corpus, we'll get back "words" like " " and " th" and less frequently "d and was not paralyz". What kind of distribution to these kinds of "words" have?

Well, I coded up this experiment (available here: https://github.com/JoFrhwld/zipf_by_vowels) where I compare the ordinary segmentation of the Brown corpus into words by using white spaces to segmentations using "a", "e", "i", "o" and "u." Here's the resulting log-log plot of the frequencies and ranks of the segmentations.

It all looks quite Zipfian. So are not only the characters between spaces, but the characters between any arbitrary delimiters subject to a rich-get-richer process? Keep in mind that the definition of "word" as being characters between spaces is relatable to representations in human cognition, the definition of "word" as characters between arbitrary delimiters is not, especially not with English's occasionally idiosyncratic orthography.

Maybe it's possible for the results of my little experiment to be parasitic on a larger rich-get-richer process operating over normal words, but for now I'm dubious.

Tuesday, July 10, 2012

Visualizing Graphical Models

I'm anticipating presenting research of mine based on Bayesian graphical models to an audience that might not be familiar with them. When presenting ordinary regression results, there's already the sort of statistical sniper questions along the lines of "What if the effect is actually being driven by this other correlate?" and "That effect might result from assumptions a, b, and c of the test." etc. Sometimes these questions are useful, but sometimes they seem to detract from the substantive issues at hand. And frequently, I see talks get way too bogged down in anticipating questions like this by cramming too much statistical detail into their talk, leaving not enough time to do justice to the theoretical importance of their results.

Add to this the customizability of graphical models, the number of possible distributions and parameter settings, and the notion that "Bayesian" = "subjective", and I'm really feeling stressed out by the presentational task ahead of me.

So, I'm trying to figure out a good way to both make the model I've built fully available and accessible to someone who can't read JAGS code, has a little bit of presentational pizzaz, and also allows me to focus in on the parameters of specific interest. I started off trying to use Graphviz to produce directed graphs, and wound up with this (an actual level in the model I'm hoping to present).

It's all a ton of spaghetti, difficult to hilight the particular parameters of interest, and doesn't represent some important distinctions (like stochastic and deterministic nodes).

I've moved on from Graphiz to trying to build an interactive tree diagram using the Javascript InfoViz Toolkit. It's been kind of slow going, since I don't know any Javascript, and am still trying to sort out what functions are basic and which ones are defined by the toolkit. Click on the image below to visit the visualization.

It's getting there, but I'm not convinced yet that it'll do the job of making the whole model digestible. For one, I'm modeling effects at a few different levels. The token level is represented in this visualization, but I'm also looking at speaker level effects, treating the linguistic context as a within speaker variable, and at word level effects. The way I'm setting things up now, that's going to call for two more trees like this one.

Maybe the lesson here is that I should just fit and present a simpler model, but remember those sniper questions? I'm worried that if I leave out someone's favorite correlate, I'll 1) have to deal with it in the questions and 2) they'll leave unconvinced, or rather, they'll leave convinced that it was their favorite correlate doing the work all along. But these are really research anxieties that no visualization toolkit on earth could assuage.

Sunday, July 8, 2012

On "Welcome to the Internet"

I interrupt the regularly scheduled (linguistics/data/stats) programming to bring you a special message about a topic which has been really bothering me. This blog is my primary venue for writing publicaly about anything, so even though Anita Sarkeesian's project on Tropes vs. Women in Video Games doesn't fit into any of my usual topics, I'm going to write about it here.

I think most people will have heard about what's going on here. Anita Sarkeesian puts together an excellent video series called Feminist Frequency which offers accesible feminist critiques of movies, TV shows, etc. She set up a Kickstarter project to help fund research and production of a new video series called Tropes vs. Women in Video Grames. The project was a great success, raising over 26x the original goal, but the backlash from people on the internet has been really vile. You can look over a summary of links about the issue here.

I'm not writing about how vile I think the backlash is. Instead I'm writing about how much some people's reactions to the backlash have bothered me. I've read some of these online, and had them come up in conversation. They fall into a few categories.

"We can disagree without being disagreeable"

I have not heard one person in a respectable forum defend the backlash against Sarkeesian. However, I have heard a lot of "you might disagree with what she says, but you can do so in a civil manner." But at this moment, nobody can disagree with what Sarkeesian says, because she has not, in fact, said it yet. The whole backlash is not against what she said about misogyny in video games, but rather against her stated intention to say anything about misogyny in video games. What we are looking at is simply unvarnished hatred, and its exponents cannot make pretentions to having intellectual differences of opinions. That would require careful consideration of Sarkeesian's points, which again, is impossible, because she hasn't even had the opportunity to put them forward yet.

"Welcome to the internet"

I've heard more than one person say "welcome to the internet" about the harassment Sarkeesian is experiencing. As if what is happening to her just happens to everybody. A porn bot following you on twitter is a "welcome to the internet" moment. A spam comment on your blog including links to purportedly cheap viagra is a "welcome to the internet" moment. What we're observing with this backlash is not a "welcome to the internet" moment.

Even if we limit the discussion to the trolling comments on her blog and YouTube pages, the magnitude and intensity of the comments are already far beyond the average person's experience. And as Jay Smooth pointed out, it's also the case that members of marginalized groups tend to have a much worse experience with trolling like this. So this isn't just your plain vanilla internet, it's one that is especially bad for for people who are already marginalized IRL.

But we can't really limit the discussion to high volume trollish comments. We have to also bring in the vandalism of her Wikipedia page, which included adding a lot of porn. We also have to bring in the meme-ification of her image with the goal of specifically attacking her in specifically sexual ways. We need to bring in the fact that people are sending her explicit threats of rape and violence. And we also need to bring in the creation of a flash game that invited the player to beat Sarkeesian's face in. This last one is especially disturbing to me, because I've been reading a lot of guys talking about how much they want to hit her. To quote YouTuber MundaneMatt (linked here just to provide substantiating evidence, I wouldn't advise visiting it):

She's got those eyes that make you just want to punch her in the face.

And to quote a user's review on Destructoid of the flash game (I'm not even linking to it this time):

The voice acting isn’t the best at riling up the player, especially as her videos do this quickly anyway.

We are far far outside the realm of "welcome to the internet" and deep into the very dark, very real topic of silencing women with rape and violence.

And of course, there's the internet vigilantism. Her site has been DDoS-ed, there have been attempted hacks of her e-mail and various social networks, and she's been dox-ed (her personal address and telephone number posted online). This is the kind of treatment reserved for people dubbed villains by the internet. It is more than atypical, it is specifically reserved for the worst of the worst. By no means is it "welcome to the internet." And what did she do worthy of being treated like such a villain?

I think it is justified, given the evidence, to say that what is happening to Anita Sarkeesian is uniquely bad, and it is happening to her because she is a woman.

The Mos Eisley Gambit

Closely related to "welcome to the internet" is the Mos Eisley Gambit, which is simply stating that on the internet at large (and in YouTube comments specifically) "you will never find a more wretched hive of scum and villainy." This more and more easily believable the more you read about the Sarkeesian backlash.

But, I'm sorry, don't a lot of the same people who deploy the Mos Eisley Gambit also have a lot to say about how the internet is the future of free and open discourse? Wasn't there a whole collective kumbaya moment just a few months ago where "the internet defeated SOPA"? Wasn't the whole SOPA thing a backlash against the possibility government censorship? Isn't the goal of the backlash against Sarkeesian to censor her? You can't have it both ways. You can't go around hailing the internet as a revolutionary space for free communication (a human right even) that must be protected at all costs, and be so flip about what's happening to Sarkeesian.

And what's more, the residents of this hive of scum and villainy don't actually live in the internet. The trolls, vandals and harrasers are not internet pixies, they are real actual people. The images of Sarkeesian's likeness being raped by video game characters didn't just pop into existence of their own accord. A person, someone's next door neighbor, son, brother, sat down and spent time drawing the damn thing, and e-mailed it to her. The hive of scum and villainy is actually the real world we're all living in, and it's just reflected in the internet. Trolls are people too, and that's exactly the problem. You don't get away from the racist YouTube commenters by going outside, you ride the bus with them. Which is why, I think, hateful trolling is a worthwhile thing to worry about. It's not just about silly things that happen on the internet. It's about the attitudes and actions of real people who we all interact with every day.

Wednesday, July 4, 2012

Question: Work on -ly-less adverbs

I think I'm going to ask general information gathering questions that I have about linguistics research here on my blog, rather than as Facebook or Twitter posts. Then, I can add the answers I get back to the post.

What research is there on -ly-less adverbs? I think the most common one that comes up is "personal," as in

Don't take it personal.

Here are two more real life examples (the second one I heard just today, hence the question):

I go to South Jersey occasional.
I need a cigarette desperate.

I have some vague intuitions about restrictions on the -ly-less forms. Specifically, I think they're only possible post-verbally, so

*I personal took it.

And I doubt we'd ever see it with a sentential adverb, like

*Hopeful, we'll find an answer.
*We'll find an answer, hopeful.

But then, I don't really trust my intuitions, because I would have also rejected the "occasional" and "desperate" sentences above, which I heard come out of real people's mouths.

So, anyone know of any research on the topic?

Update
People came through for me! First, Mercedes Durham pointed me in the right direction on Twitter.

@JoFrhwld Hoping I'm not the 10th to say this: Tagliamonte and Ito in JSoc 2002 looked at it in York UK + Opdahl's 2 books r mentioned there
— Mercedes D (@drswissmiss) July 5, 2012

The Tagliamonte and Ito paper provides a great introduction to the topic of -ly~ø variation in adverbs. First, in the long view of history, the -ly adverbs are the innovation creeping in, not the zero forms. Here's how I understand it worked. There used to be a morpheme -lic which was used to create adverbs from nouns.

friend + lic
man + lic

And there was a separate morpheme -e that created adverbs from adjectives.

direct + e
open + e

Sometimes you'd get them stacking on top of each other

friend + lic + e
man + lic + e

And sometimes you'd wind up with the -lic+e morphemes coming together and behaving like one morpheme that turns adjectives into adverbs.

sweet + lice

This part sounds similar to a more modern situation. We have a morpheme -ate that turns nouns into verbs.

assasin + ate

And a morpheme -ion that turns verbs into nouns, which sometimes stacks on top of -ate.

delete + ion
assasin + ate + ion

But sometimes, we get -ation coming together and acting like one morpheme that turns verbs into nouns.

cause + ation (*causate)

Anyway, back to Old English. At some point the little -e morpheme that turned adjectives into adverbs got lost (probably as part of a larger language change that dropped a lot of word final unstressed e's). At that point, adjectives and derived adverbs just all sounded the same. That is, derived adverbs were all zero forms. But then, the fused form -lice started being used to make adverbs in more places than it used to be, and it eventually changed in pronunciation to modern day -ly.

On these historical issues, a lot of ink has been spilled including a whole two volume series on just this case of variation in adverb formation, and a few book chapters.

Tagliamonte & Ito also provide a lot of cool examples from other studies, like these ones from Appalachian and Ozark English (Christian, Wolfram & Dube, 1988).

I come from Virginia original.
It certain was some reason.

Their own study was on a large corpus of speech from York, Enland. After treating really separately (they argued the patterns in really had more to do with its use as a special intensifier and less to do with adverb formation), they found basically no age effects, but working class men strongly favored the zero form compared to everyone else.

As for language internal effects, they completely excluded preverbal adverbs as being invariantly -ly forms (per my intuition, but not per that one example above from Appalachian English). After that, the found that the concreteness of the verb had the strongest effect, with concrete verbs favoring the zero form a lot more than abstract verbs.

I noticed that both the examples that I felt were interesting enough to take a mental note of above involve abstract verbs + zero form adverbs. Maybe the fact that abstract verbs disfavor zero forms is what made them jump out at me.

Allison Shapp pointed me to work she's doing on -ly~ø variation in American English, and specifically (if I understood the poster right) African American English. They've found a big effect of education, where higher education favors more -ly form, and that African American speakers, who are likely to be speakers of African American English, favor the zero form.

So! That was a fruitful information gathering adventure! This is a really cool variable!

Wednesday, June 20, 2012

Have you been in a Wawa's?

I would like to pre-empt all discussion here by saying that this blog post is strictly motivated by linguistics, and has no relevance to the US presidential election.

This video of Mitt Romney speaking about his experience in a Wawa has been floating around my newsfeed this week.

Full Disclosure: Wawa is my convenience market of choice.

What strikes me most about this video is the fact that Romney repeatedly said "Wawa's" even though the name of the store is just "Wawa" with no "s." First, there's the linguistic issue of why this was such a natural mistake for Romney to make. Second, there's the sociolinguistic issue about why this particular mistake seems so egregious.

On the first point, there is clearly a strong tendency for store names to be formed in the possessive, indicating their ownership (or at least that's the origin). For example, "Macy's" was founded by Rowland Hussey Macy, "Wanamaker's" was founded by John Wanamaker, etc. However, not all stores which have names clearly formed in the genitive follow the ordinary orthographic rules for possessives. For example, "Starbucks" is named after the Moby Dick character Starbuck, but the official name doesn't have an apostrophe. Similarly, JCPenney, which today isn't formed in the genitive, used to go by "Penneys" according to this logo from Wikipedia, also lacking the apostrophe.

Perhaps this is some kind of specialized "commercial genitive," I don't know.

At the same time, there are a lot of store names which are not formed in the genitive for some reason or another. One example someone brought up to me is "Nordstrom," which has no "s" even though it was founded by John Nordstrom. It's a little mysterious to me why this might be, except its original name was "Wallin & Nordstrom" (as in Carl Wallin and John Nordstrom), and coordination structures wreak havoc on everything. A similar story could be told for "Barnes & Noble." In fact, the one kind of business that I know to be named after coordinated personal names are law firms (like "Dewey, Cheatum & Howe") which seem to never be formed in the genitive. There's also this blog post from Linguism which discusses the question of which store names get formed in the genitive and which don't, and he concludes that store names which are originally acronyms, like "Asda" and "Tesco" and foreign imports, like "Aldi" are less likely to be in the genitive.

At the same time, there is also a lot of asymmetric variation. People seem to be likely to form an officially non-genitive store name in the genitive, but not vice versa. How many of you would blink if someone said "I went shopping at Aldi's."? But no one would say "I went into a Starbuck."

Update: Ben Zimmer informed me on Twitter that the "Friendly Ice Cream" company officially changed their name to "Friendly's" perhaps because that's what all their customers called them anyway.

@GPHemsley @JoFrhwld @LiteralMinded Friendly changed to Friendly's and Church Chicken to Church's to match common customer (mis)perceptions.
— Ben Zimmer (@bgzimmer) June 22, 2012

@JoFrhwld @gphemsley @literalminded Hmm, I may be wrong about Church's Chicken: bit.ly/aPks92 But Friendly -> Friendly's is legit.
— Ben Zimmer (@bgzimmer) June 22, 2012

According to this slideshow from the Boston Globe, the name change happened in 1989.

The point of all this is that Romney was wading into very muddy linguistic waters when he started talking about Wawa, and it's not surprising he screwed it up.

Which brings us to the second point: Why was saying "Wawa's" such a big deal? I just said that I wouldn't blink an eye if someone said "Aldi's" and that's basically the same kind of error. But, and I'm trying to speak here as a Philadelphian and Wawa devotee, not as a partisan hack, when I heard him say "Wawa's" my reaction was "Oh, he doesn't know how it works."

In some ways, my reaction was similar to how I feel when someone screws up the correct use of determiners in proper names. For example, if someone said to me "I looked it up on the Wikipedia," I'd immediately know they were uninitiated to the internet. Similarly, if someone said "they were uninitiated to Internet," I'd immediately know they were hopelessly ignorant.

I think what it comes down to is that where there is variation, there is complexity, and where there is complexity, the ability to successfully navigate complexity the right way is an important social signal that you are the right kind of person. Consider, for example, the needlessly complex language surrounding Twitter, and the communal paroxysm of self satisfaction when a politician says "I sent out a twitter to my followers," or refers to the service as "Tweeter."

I don't think that reaction, or the reaction to Romney saying "Wawa's," is fundamentally different from the dirty word in linguistics: prescriptivism. A lot of prescriptivism is specific discrimination against politically, economically and socially marginalized people, but a lot of it also comes out of nowhere, and just turns into a really complex game that people play for the sake of showing they can play it. So be cautious, fellow linguists, because today's "Wawa's" and "Tweeters" are tomorrow's split infinitives and passive voice.

Monday, June 18, 2012

Overplotting solution for black-and-white graphics

I'm working on producing some black and white graphics of data which has a lot of overplotting. There are three basic groups, which if I made the plot in ordinary full color ggplot2 would look like this (the code for the reverse-log x-axis is available in this gist, and the code for stat_ellipse() is available in this github repository).

For a black and white image, however, it's trickier. I don't usually find grey color scales to be sufficiently different for a plot like this, so I'd go for different point shapes. Unfortunately, the default shape scale in ggplot2 isn't very distinct in this case.

My first strategy to improve things was to add a custom shape scale, with alternating empty vs solid point shapes.

Better, but not great. All the overplotting of the empty point shapes creates this awful indiscriminate mash in the middle of the clusters.

My solution to this problem was to use filled points. While point shapes 1 and 5 in R correspond to an empty circle and an empty diamond, respectively, point shapes 21 and 23 correspond to a filled circle and a filled diamond, respectively, where the fill color and the border color can be different. So, I used shapes 21 and 23 instead of 1 and 5, and set the fill color to be white.

I think it's a big improvement. Here's one more iteration, filling the points with a light grey shade instead of white, just for some aesthetic appeal.

Thursday, May 17, 2012

On calculating exponents

In my post on the decline effect in linguistics, the question came up of how I've calculated the exponents for the Exponential Model in my papers. I think this is a point worth clarifying, but it's not likely to be interesting to a broad audience. You have been forewarned.

To recap as briefly as possible, in English, when a word ends in a consonant cluster, which also ends in a /t/ or a /d/, sometimes that /t/ or /d/ is deleted. This deletion can affect a whole host of different words, but the ones which have been of most interest to the field are the regular past tense (e.g., packed), the semiweak past tense (e.g., kept) and morphologically simplex words (e.g., pact), which I'll call mono. Other morphological cases which can be affected, and which I believe have occasionally and erroneously been categorized with the semiweak are no-change past tense (e.g., cost), "devoicing" (or something) past tense (e.g., built), stem changing past tense (e.g., found), etc. For the sake of this post, I'm only looking at the the main three cases: past, semiweak, and mono.

Now, Guy (1991) came up with a specific proposal where if you described the proportion of pronounced /t d/ for past as p, for semiweak as p^j and for mono as p^k, then j= 2, and k = 3. It is specifically whether or not j= 2 and k = 3 that I'm interested in here. If you've calculated the proportions of pronounced /t d/ for each grammatical class, you can calculate j by ^{log(semiweak)}⁄_log(past) and k by ^log(mono)⁄_log(past). The trick is in how you decide to calculate those proportions.

For this post, you can play along at home. Here's code to get set up. It'll load the Buckeye data I've been using, and do some data prep.

So, how do you calculate the rate at which /t d/ are pronounced at the end of the word when you have a big data set from many different speakers? Traditional practice within sociolinguistics has been to just pool all of the observations from each grammatical class across all speakers.

So you come out with j = 1.91, k = 3.1, which is a pretty good fit to the proposal of Guy (1991).

The problem is that this isn't really the best way to calculate proportions like this. There are some words which are super frequent, and they therefore get more "votes" in the proportion of their grammatical class. And, some speakers talk more than others, and they get more "votes" towards making the over-all proportions look more similar to their own. One approach to ameliorate this is to first calculate the proportion for each word within a grammatical class within a speaker, then for each grammatical class within a speaker, then within a grammatical class. Here's the code for this nested proportion approach.

All of a sudden, we're down to j = 1.34 and k = 2.05, and I haven't even dipped into mixed-effects models black magic yet.

But when it comes to modeling the proposal of Guy (1991), calculating the proportions is really just a mean to an end. I asked Cross Validated how to directly model j and k, and apparently you can do so using a complementary log-log link. So here is the mixed effects model for j and k directly.

The model estimates look very similar to the nested proportions approach, j = 1.38, k = 2.11.

What if we fit the model without the by-word random intercepts?

Now we're a bit closer back to the original pooled proportions estimates, j = 1.57, k = 3.19.

My personal conclusion from all this is that the apparent j = 2, k = 3 pattern is driven mostly by the lexical effects of highly frequent words. This table recaps all of the results, plus the estimates of two more model. One has just a by speaker random intercept, and a flat model, which looks just like the maximum likelihood estimate of the fully pooled approach, because it is.

Method	j	k
Pooled	1.91	3.1
Nested	1.34	2.05
~Gram+(Gram\|Speaker)+(1\|Word)	1.38	2.11
~Gram+(Gram\|Speaker)	1.57	3.19
~Gram+(1\|Speaker)	1.84	3.14
~Gram	1.91	3.1

The lesson is that it can matter a low how you calculate your proportions.

Wednesday, May 16, 2012

Decline Effect in Linguisics?

It seems to me that in the past few years, the empirical foundations of the social sciences, especially Psychology, have been coming under increased scrutiny and criticism. For example, there was the New Yorker piece from 2010 called "The Truth Wears Off" about the "decline effect," or how the effect size of a phenomenon appears to decrease over time. More recently, the Chronicle of Higher Education had a blog post called "Is Psychology About to Come Undone?" about the failure to replicate some psychological results.

These kinds of stories are concerning at two levels. At the personal level, researchers want to build a career and reputation around establishing new and reliable facts and principles. We definitely don't want the result that was such a nice feather in our cap to turn out to be wrong! At a more principled level, as scientists, our goal is for our models to approximate reality as closely as possible, and we don't want the course of human knowledge to be diverted down a dead end.

Small effects

But, I'm a linguist. Do the problems facing psychology face me? To really answer that, I first have to decide which explanation for the decline effect I think is most likely, and I think Andrew Gelman's proposal is a good candidate:

The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount.

I've put together some R code to demonstrate this point. Let's say I'm looking at two populations, and unknown to me as a researcher, there is a small difference between the two, even though they're highly overlapping. Next, let's say I randomly sample 10 people from each population, do a t-test for the measurement I care about, and write down whether or not the p-value < 0.5 and the estimated size of the difference between the two populations. Then I do this 1000 more times. Some proportion (approximately equal to the power of the test) of the t-tests will have successfully identified a difference. But did those tests which found a significant difference also accurately estimate the size of the effect?

For the purpose of the simulation, I randomly generated samples from two normal distributions with standard deviations 1, and means 1 and 1.1. I did this for a few different sample sizes, 1000 times each. This figure show how many times larger the estimated effect size was than the true effect for tests which found a significant difference. The size of each point shows the probability of finding a significant difference for a sample of that size.

So, we can see that for small sample sizes, the test has low power. That is, you are not very likely to find a significant difference, even though there is a true difference (i.e., you have a high rate of Type II error). Even worse, though, is that when the test has "worked," and found a significant difference when there is a true difference, you have both Type M (magnitude) and Type S (sign) errors. For small sample sizes (between 10 and 50 samples each from the two populations), the estimated effect size is between 5 and 10 times greater than the real effect size, and the sign is sometimes flipped!

Taking the approach of just choosing a smaller p-value will help you out insofar as you will be less likely to conclude that you've found a significant difference when there is a true difference (i.e., you ramp up your Type II error rate, by reducing the power of your test), but that doesn't do anything to ameliorate the size of the Type M errors when you do find a significant difference. This figure facets by different p-value thresholds.

So do I have to worry?

So, I think how much I ought to worry about the decline effect in my research, and linguistic research in general, is inversely proportional to the size of the effects we're trying to chase down. If the true size of the effects we're investigating are large, then our tests are more likely to be well powered, and we are less likely to experience Type M errors.

And in general, I don't think the field has exhausted all of our sledgehammer effects. For example, Sprouse and Almeida (2012) [pdf] successfully replicated somewhere around 98% of the syntactic judgments from the syntax textbook Core Syntax (Adger 2003) using experimental methods (a pretty good replication rate if you ask me), and in general, the estimated effect sizes were very large. So one thing seems clear. Sentence 1 is ungrammatical, and sentences 2 and 3 are grammatical.

*What did you see the man who bought?
Who did you see who bought a cow?
Who saw the man who bought a cow?

And the difference in acceptability between these sentences is not getting smaller over time due to the decline effect. The explanatory theories for why sentence 1 isn't grammatical may change, and who knows, maybe the field will decide at some point that its ungrammaticality is no longer a fact that needs to be explained, but the fact that it is ungrammatical is not a moving target.

Maybe I do need to worry

However, there is one phenomenon that I've looked at that I think has been following a decline effect pattern: the exponential pattern in /t d/ deletion. For reasons that I won't go into here, Guy (1991) proposed that if the rate at which a word final /t/ or /d/ is pronounced in past tense forms like packed is given as p, the rate at which it is pronounced in semi-irregular past tense forms like kept is given as p^j, and the rate at which it is pronounced in regular words like pact is given as p^k, then j = 2, k = 3.

Here's a table of studies, and their estimates of j and k, plus some confidence intervals. See this code for how I calculated the confidence intervals.

Study

Year

Dialect

Guy

1991

White Philadelphia

4.74

2.37

1.17

4.26

2.75

1.86

Santa Ana

1992

Chicano Los Angeles

2.29

1.76

1.35

3.39

2.91

2.51

Bayley

1994

Tejano San Antonio

2.08

1.51

1.11

3.59

2.99

2.52

Tagliamonte & Temple

2005

York, Northern England

1.85

1.12

0.66

1.96

1.43

1.04

Smith & Durham & Fortune

2009

Buckie, Scotland

1.36

0.64

0.24

3.59

2.33

1.53

Fruehwald

2012

Columbus, OH

2.48

1.38

0.76

2.35

1.93

1.59

I should say right off the bat that all of these studies are not perfect replications of Guy's original study. They have different sample sizes, coding schemes, and statistical approaches. Mine, in the last row, is probably the most divergent, as I directly modeled and estimated the reliability of j and k using a mixed effects model, while the others calculated p^j and p^k and compared them to the maximum likelihood estimates for words like kept and pact.

But needless to say, estimates of j and k have not hovered nicely around 2 and 3.

Thursday, April 19, 2012

Come and see

Yesterday, as a pre-amble to an ordinary newsletter sent out via listserv to most PhD students at UPenn, we were offered this piece of advice:

Tip of the day: You should all know this by now: It is incorrect to say “come and see” or “come out and help”, or any other “come…and…” phrase. It is an infinitive phrase: “Come to see”, “Come out to help”, “Come to have fun”. Don’t aggravate anyone’s pet peeves; just write and say it correctly. You’re welcome.

Well, many of us linguistics graduate students felt this merited some kind of response. I don't know about other linguists out there, but if someone said this to me in a personal e-mail, or in conversation, I couldn't not respond.

And then, an amazing thing happened. We started drafting a letter in a Google document with 16 contributors. It was a litte chaotic, but we marshaled together intuitions, data, and argumentation, and had drafted this message in about an hour's time.

To whom it may concern:

We were recently sent a grammar “tip” via the [redacted] listserv which read:

Tip of the day: You should all know this by now: It is incorrect to say “come and see” or “come out and help”, or any other “come…and…” phrase. It is an infinitive phrase: “Come to see”, “Come out to help”, “Come to have fun”. Don’t aggravate anyone’s pet peeves; just write and say it correctly. You’re welcome.
The linguistics graduate students felt that this required a response, as in fact, the cited examples “come and see” and “come out and help” are both grammatical and widely used constructions in American English.

The two constructions differ slightly in meaning. If one says,

Mary came and saw Tupac’s hologram perform.

it must be the case that the performance actually occurred; it cannot be the case that there were technical difficulties and the performance was cancelled. However,

Mary came to see Tupac’s hologram perform.

admits the possibility that the performance was cancelled due to technical difficulties. Therefore, asserting that the infinitive phrase is a uniformly appropriate replacement for the conjoined phrase is not an appropriate representation of the linguistic facts.

Phrases like “come and see” are not restricted to the spoken idiom, but are also used in the written language. They even occur in texts considered by some to be canonical, as the following examples show:

He saith unto them, “Come and see”. (John 1:39, King James Bible)

“Then you may come and see the picture”. (Merry Wives of Windsor II:II, William Shakespeare)

“Will you come and see me?” (Pride & Prejudice, chap. 26, Jane Austen)
Generally, grammatical prescriptivism contributes little to useful discourse, and may even cause intelligent language users to be unfairly stigmatized. Thus, while we appreciate [redacted]'s light-hearted "tips-of-the-day," we would encourage authors to keep an open mind about the breadth of possible language use, especially in public forums.

Sincerely,

Jana Beck*
Claire Crawford*
[redacted]*
Sabriya Fisher*
Aaron Freeman*
Lauren Friedman*
Josef Fruehwald*
Kyle Gorman*
Marielle Lerner*
Caitlin Light*
Laurel MacKenzie*
Brittany McLaughlin*
Hilary Prichard*
Kobey Shwayder*
Jon Stevens*
[redacted]*

*Department of Linguistics

Thinking about it some more, I think at least the past tense "came to see" even has the implicature that either the seeing was unsuccessful, or there is some other more relevant event than the seeing which the speaker is about to tell us about.

Anyway, I think we did a bang up job, and produced a really excellent message, especially considering there were 16 authors!

Saturday, April 14, 2012

Linguistic Notation Inside of R Plots!

So, I've been playing around with learning knitr, which is a Sweave-like R package for combining LaTeX and R code into one document. There's almost no learning curve if you already use Sweave, and I find a lot of knitr's design and usage to be a lot nicer.

I wasn't going to make a blog post or tutorial about knitr, because the documentation is already pretty good, and contains a lot of tutorials. However, I've just had a major victory in incorporating linguistic notations into plots using knitr, and I just had to share. I'll show you the payoff first, and then include the details.

First, I managed to successfully use IPA characters as plot symbols and legend keys.

The actual data in the plot is on car fuel economy, but that's not the point. Look at that IPA!

Then, I tried to expand on the principles that got me the IPA, and look what I produced.

Yes, that is a syntax tree overlaid on top of the plot. But why stop there when you could go completely crazy?

How to do it.

The important thing about making these plots is that they were easy given my pre-existing knowledge of R, LaTeX and what I've learned about knitr. The crucial element here is that knitr supports tikz graphics. I don't know anything about tikz graphics, and I still don't, which means that if you don't know anything about tikz graphics, you can still make plots like these.

Like most linguists who use LaTeX, I already know how to include IPA characters and draw syntactic trees in a LaTeX document. It's simple as

...
\usepackage{tipa}
\usepackage{qtree}
...
\textipa{D C P}
\Tree [.S NP VP ]
...

What is so cool about the tikz device is that it lets you define these notations in LaTeX syntax, and then incorporates them into R graphs. Here are the important code chunks to include in your knitr document to make it all work.

1 — Load the right R packages

Early on, load the ggplot2 and tikzDevice R packages.

<<>>=
    library(ggplot2)
    library(tikzDevice)
@

2 — Define your LaTeX libraries

Then, you need to tell the tikz device which LaTeX packages you want to use.

<<>>=
    options(tikzLatexPackages = c(getOption("tikzLatexPackages"),
                                  "\\usepackage{tipa}",
                                  "\\usepackage{qtree}"))
@

3 — Define the plotting elements in LaTeX

We're done with the hard part. Now, it's as simple as faking up some data...

<<>>=
    levels(mpg$drv) <- c("\\textipa{D}",
                         "\\textipa{C}",
                         "\\textipa{P}")
 
    mpg$tree <- "{\\footnotesize \\Tree [.S NP VP ]}"
@

4 — Plot the data using the tikz device

...and plotting it, using the tikz device.

<<dev="tikz", fig.width=8, fig.height=5, out.width="0.9\\textwidth", fig.align="center">>=
    ggplot(mpg, aes(displ, hwy, label = drv, color = drv)) + 
            geom_text() + 
            stat_smooth()+
            xlab("\\textipa{IPA!}")    
@

Or, in the case of the syntactic trees,

<<dev="tikz", fig.width=8, fig.height=5, out.width="0.7\\textwidth", fig.align="center">>=
    ggplot(mpg, aes(displ, hwy, label = tree))+
            geom_text() + 
            stat_smooth()+
            xlab("TREES")
@

5 — Compile the .Rnw to a .tex document

Here's some source code to embed these plots in a beamer presentation. To compile a .tex document from the .Rnw source, you can run

library(knitr)
knit("./ling-plot.Rnw")

Then, just compile the .tex document however your little heart desires.

How to do it with one click

As if this weren't awesome and easy enough yet, it's possible to compile the whole document in one click using RStudio, as outlined on this knitr page. You'll need to download the development (i.e. not guaranteed to be stable) RStudio release, then set the compilation option to use knitr, and you're done!

I have to say that from a practical standpoint, I've found writing Sweave documents in RStudio to be a much better experience than what I was doing before, because I can run and debug the R code from within the .Rnw source document. No need to go flipping back and forth between a Tex editor and R.

P.S. I highlighted the code above at http://www.inside-r.org/pretty-r

Saturday, March 31, 2012

More on Philadelphia Homicide

I've been doing more analysis of the Philadelphia Homicide data that the Philadelphia Inquirer has published, and presented some of it at the Philadelphia UseR group yesterday. My slides [pdf] and source [knitr .Rnw] are on github.

I should be clear that I am not an expert on crime and murder. In fact, I'm not even fairly knowledgeable. If anyone out there with more expertise has strong criticism of my "analysis" (really, it's just a rough exploration of the data), I'll eat it, and I'll look forward to your own analysis of the data (again, it's right here). Here are some of the most striking patterns that I found.

Results

First, here is the total number of murders that occurred over the past 23 years, broken down by the day of the week. The weekends are worse than the weekdays.

Next, here are the total number of murders by hour of the day. The hour of the day was not included in the data until 2006, so this only represents murders between 2006 and 2011. The plot is centered around midnight, so the afternoon of Day 1 is on the left, and the morning of Day 2 is on the right.

It looks like there's something weird going on around 11pm and midnight, which I have to chalk up to the reporting patterns of the PPD. For some reason, it seems like murders which occurred in the midnight hour are more likely to be logged as occurring at 11PM.

Here is the most striking plot that I produced this time around. It plots, by month, the average frequency of murders. The y-axis represents 1 murder every X days.

Since 1988, the African American community has been living in a Philadelphia with approximately a murder every day, or every other day. The White community, on the other hand, has been living in a Philadelphia with a murder once a week.

I also did some meager statistical analysis, specifically poisson regression with terms for the month (that is, January, February, etc, to look for a seasonal pattern), race of the victim, and weapon used. There was a significant month effect, but the coefficients didn't have much of a pattern to them. I did use number of days in the month as an offset in the regression, so it's not that. More importantly, there was an unsurprising main effect of race, but also a big interaction between race and weapon. Specifically, African American victims were way more likely to be killed by a gun.

Guns and knives are the two most common weapons used in murders in the data. White murder victims are 2.54x more likely to have been shot than stabbed, while an African American murder victim is 7.19x more likely to have been shot than stabbed, meaning that African American murder victims are 2.83x more likely to have been shot than a White murder victim was.
Update: There was a pretty serious flaw in my regression, in that if there was a Month where, say, no African Americans were murdered with a knife (and there were plenty), that month's data was missing, rather than 0. Filling in the data appropriately to reflect months with 0 murders for a particular race x weapon combination, the estimates are pretty different. White murder victims are 5.71x times more likely to be murdered with a gun than a knife, while African American murder victims were 8.62x times more likely to be murdered with a gun than a knife, meaning African Americans are 1.51x times more likely to be shot than stabbed. So, that's a pretty serious revision approximately halving the multiplier. I've already updated the linked code and slides.
So, gun deaths are an especially acute problem in the African American community. In fact, if you exclude gun deaths from the data, it actually looks like the racial disparity in murder rates has been narrowing.

It is purely coincidental that I'm posting this on the same day that the Philadelphia Police Department are doing a gun buyback. You can bring in a gun and receive a $100 Shoprite voucher, no questions asked. Seems like a good initiative.

Analysis Discussion

I spent a bit of time trying to figure out what I thought the most meaningful way to represent the murder rate was. First, I calculated the murder frequency by counting how many n murders there were a month, then divided that by the number of days in the month for (n murders/n days)=murders per day. But the resulting measure had values like 0.14 murders per day, which isn't too informative. What people want to know about murders, or at least what I want to know, is how often murders happen, not how many happened in a given time window. So, instead, I calculated (n days/n murders)=days per murder.

The y-axis for the murder rate figures is also a logarithmic scale, which is both reasonable given the distribution of the data, and the impression of the timescale. From a human perspective, the difference between 1 day and 2 days feels larger than the difference between 3 weeks and 4 weeks. The y axis is also flipped, to indicate that smaller numbers mean "more often". I managed the reversed log transformation by writing my own coordinate transformation using the new scales package. Here's the R code.