Friday, July 27, 2012

Teens and Texting and Grammar

I'm just one man, one linguist, impotently shouting into the vast mediascape, "PLEASE POPULAR MEDIA! PLEASE DON'T RUN WITH THE TEEN TEXTING GRAMMAR STORY!"

There is a paper out in New Media and Society called Texting, techspeak, and tweens: The relationship between text messaging and English grammar skills. If you are a linguist, and you winced at the title, I have to warn you, you're not done wincing yet.

Is the key problem that the authors collected data on text messaging behaviors from self reports? No.

Is the key problem that the authors did not directly assess whether or not the teens in the study used "techspeak"? No. (Let's set aside the fact that high volumes of txtspeak are increasingly associated with out of touch adults).

Is the key problem that the authors didn't include any figures plotting the relationship between any of their measures? No.

Is the key problem that the authors included no control group of teens who don't text, or adults who adopted texting late in life? No.

The key problem is that the authors appear to have no idea what grammar or language are. I quote:
Similar to synchronous online communications such as instant messaging, the speed, ease, and brevity of text messaging have created a perfect platform for adapting the English language to better suit attributes of the technology. This has led to an evolution in grammar, the basis of which we shall call ‘techspeak.’ This language differs from English in that it takes normal English words and modifies them [...]
The depth of misunderstanding and naiveté present in this quote about the relationship between actual language and grammar and the way we write is equivalent to thinking that the sun revolves around the Earth, and that stars are bright dots on a large dome in the sky. Mind you, the Earth-centric, skydome model of the universe is a perfectly reasonable one until you are exposed to the most basic, rudimentary scientific understanding of how the world works.

The authors of this paper appear not to have been exposed to the most basic, rudimentary scientific understanding about how language and grammar work.

From Appendix A of the paper, I present to you the 20 point "grammar" assessment used in the study.
  1. There (is, are) two ways to make enemies.
  2. One of the men forgot to bring (his, their) tools.
  3. Gail and Sue (make, makes) friends easily.
  4. The coach thought he had (tore, teared, torn) a ligament.
  5. During the flood, we (dranked, drank, drunk, drunked) bottled water.
  6. The boy called for help, and I (swum, have swam, swam) out to him.
  7. Fortunately, Jim’s name was (accepted, excepted) from the roster of those who would have to clean bathrooms because he was supposed to go downtown to (accept, except) a reward for the German Club.
  8. I don’t know how I could (lose, loose) such a big dress. It is so large that it is (lose, loose) on me when I wear it!
  9. The man around the corner from the sandlots (come, comes) to our meetings.
  10. The man and his little girls (was, were) not injured in the accident.
  11. The pictures in this new magazine (shows, show) the rugged beauty of the West.
  12. The orders from that company (is, are) on your desk there.
  13. The (boys, boys’, boy’s, boys’s) hats were lost in the water because they were careless in not tying them to the side of the boat.
  14. (Its, It’s, Its’) an honor to accept the awards certificates and medals presented to the club.
  15. Worried, and frayed, the old man paced the floor waiting for his daughter. (Correct/Incorrect)
  16. The boy yelled, ‘Please help me’! (Correct/Incorrect)
  17. She got out of the car, waved hello, and walked into the house. (Correct/Incorrect)
  18. When Suzie arrived at the dance, no one else was there. (Correct/Incorrect)
  19. Dad and I enjoyed our trip to new york city. (Correct/Incorrect)
  20. The boy’s mother picked him up from school. (Correct/Incorrect)
To quote what it was the authors were trying to assess:
The first portion of the assessment consisted of 16 questions designed to test the student’s grasp of verb/noun agreement, use of correct tense, homophones, possessives, and apostrophes. [...] The second portion of the assessment asked participants to indicate whether or not a sentence was correct, such as ‘The boy yelled, “Please help me”!’ (Correct/ Incorrect). This portion tested the student’s understanding of comma usage, punctuation, and capitalization.
Virtually none of these points (homophones, apostrophes, comma usage, punctuation and capitalization) fall under the purview of what is scientifically understood to be "grammar". Arnold Zwicky has suggested the term "garmmra" for such things. Punctuation, comma rules, spelling conventions, etc. are all only arbitrary decisions settled upon a long time ago, and have nothing, nothing to do with human language. You could, by fiat, swap periods and commas (like many cultures do with their numeral systems), insist that sentence initial adverbs be followed by a semicolon, and decide to revert back to the symbols <þ> and <ð> to spell the sounds we currently both spell with <th>, and you know how many things that would change about English grammar? Zero things.

The remaining points of assessment could be considered to be well within the domain of grammar (tense and subject/verb agreement), except authors chose really poor, very variable items for the evaluation. The very first item involves verbal agreement with an expletive subject, and the rest involve cases of coordination, and agreement attraction! These are items which really lie on the outside edges of linguistic processing abilities, and there is no way that they could serve as reliable measures of fluency and grammatical competence. Search the work of any good writer, and I'm sure you'll find examples of both kinds of usage.

And then there's the second item: "One of the men forgot to bring (his, their) tools." Both possibilites are acceptable English, and have been for a long time.

The most depressing thing about this grammar assessment is where the researchers say they got it.
This assessment was adapted from a ninth-grade grammar review test.
I'm reminded of a piece I read called For Ebonics, the New Milennium Is Pretty Much Like the Old One, which said: "This suggests to me a catastrophic failure of the public school 'language arts' curriculum: people spend years in various language arts classes and leave with the same 19th-century folk notions that they started with."

So what have these authors actually found? Well, maybe it's the case that the more people who write in a broader range of contexts for a broader range of purposes, the more the arbitrary, conventionalized aspects of the writing system of English will undergo natural drift. What effect with this have on English grammar, as it is represented in the minds of every day English users? Probably just as much as the current writing system does: a minimal one.

And what about my plea to the popular media? Even if someone of note finds this post and reads it, I already know that it won't matter at all. Per my commentary on the coverage on vocal fry, no one is going to report on this piece because they care about science or facts. This research fits snugly into pre-existing biases about young people and the general decline of society, and frankly, these biases seem to have more to do with why these researchers did the study in the first place than science or facts. And there's is no way that something so trivial as a bunch of experts on language and grammar are about to derail this train of garbage and nonsense.

UPDATE! There is, in fact, actual paper on the topic of Instant Messaging and Grammar by Sali Tagliamonte and Derek Denis from 2008 called "Linguistic Ruin? LOL! Instant Messaging and Teen Language." Remember hearing about that in the news? Here's selections from their conclusions.
In a million and a half words of IM discourse among 71 teenagers, the use of short forms, abbreviations, and emotional language is infinitesimally small, less than 3% of the data.
Our foray into the IM environment through quantitative sociolinguistic analysis, encompassing four areas of grammar and over 20,000 individual examples, reveals that IM is firmly rooted in the model of the extant language,reflecting the same structured heterogeneity (variation) and the same dynamic, ongoing processes of linguistic change that are currently under way in the speech community in which the teenagers live.

UPDATE! See also Enregistering internet language by Lauren Squires (2010)

Wednesday, July 25, 2012

Don't worry, I'm a physicist.

Today, I came across a science news item from ABC (the Australian Broadcasting Corporation) with the title "Study opens book on English evolution." Oh goodness. Here are the opening paragraphs:
A study of 500 years of the English language has confirmed that 'the', 'of' and 'and' are the most frequently printed words in the modern era.

The study, by Slovenian physicist Matjaz Perc, also found the top dozen phrases most-printed in books include "at the end of the", "as a result of the" or "on the part of the".
That sound you hear is the stunned silence of linguists everywhere over the fact that you can get into  the science news with the primary result that "'the' is the most common English word."

But to be fair, what the author was trying to argue is that the Zipfian distribution of word frequencies is a result of "preferential attachment," where frequent words get more frequent. He tried to demonstrate this by showing that the frequency of a word in a given year is predictive of its frequency in the future, specifically that relatively high frequency words will be even more frequent in the future.  They key result is shown in Figure 4 in the paper, available here.

Say what?

While that quantitative result may stand, the fact that Perc is a physicist probably contributed to some really bananas statements about language. In the first paragraph, he almost completely conflates human language and written langauge as being the same thing, and erases the validity and richness of cultures with unwritten languages.
Were it not for books, periodicals and other publications, we would hardly be able to continuously elaborate over what is handed over by previous generations, and, consequently, the diversity and efficiency of our products would be much lower than it is today. Indeed, it seems like the importance of the written word for where we stand today as a species cannot be overstated.
He also presents some results of English "coming of age" and reaching "greater maturity" around 1800 AD (Figure 3). Finally! It only took us like, what, a thousand years or so?

The discussion section kicks off with the statement
The question ‘Which are the most common words and phrases of the English language?’ alone has a certain appeal [...]
That may be true for physicists, but for people who are dedicated to studying language (what are they called again?) not so much. Fortunately, his ignorance of linguistics is actually a positive quality of this research!
On the other hand, writing about the evolution of a language without considering grammar or syntax, or even without being sure that all the considered words and phrases actually have a meaning, may appear prohibitive to many outside the physics community. Yet, it is precisely this detachment from detail and the sheer scale of the analysis that enables the observation of universal laws that govern the large-scale organization of the written word.
See, linguists are just too caught up in the details to see the big picture! Fire a linguist and your productivity goes up, amirite?

For real though?

But back to the substantive claim of the paper. Is the Zipfian distribution of words due to the rich getting richer? That is, are words like snowballs rolling down a hill? The larger they are, the more additional snow the pick up, the even larger they get. Maybe, but maybe not.

Here's a little experiment that I was told about by Charles Yang, who read about it in a paper by Chomsky that I don't know the reference to. Right now, we're defining "words" as being all the characters between white spaces. But what if we redefined "words" as being all the characters between some other kind of delimiter? The example Charles used was "e". If we treat the character "e" as being the delimiter between words, and we apply this a large corpus, we'll get back "words" like " " and " th" and less frequently "d and was not paralyz". What kind of distribution to these kinds of "words" have?

Well, I coded up this experiment (available here: where I compare the ordinary segmentation of the Brown corpus into words by using white spaces to segmentations using "a", "e", "i", "o" and "u." Here's the resulting log-log plot of the frequencies and ranks of the segmentations.

It all looks quite Zipfian. So are not only the characters between spaces, but the characters between any arbitrary delimiters subject to a rich-get-richer process? Keep in mind that the definition of "word" as being characters between spaces is relatable to representations in human cognition, the definition of "word" as characters between arbitrary delimiters is not, especially not with English's occasionally idiosyncratic orthography.

Maybe it's possible for the results of my little experiment to be parasitic on a larger rich-get-richer process operating over normal words, but for now I'm dubious.

Tuesday, July 10, 2012

Visualizing Graphical Models

I'm anticipating presenting research of mine based on Bayesian graphical models to an audience that might not be familiar with them. When presenting ordinary regression results, there's already the sort of statistical sniper questions along the lines of "What if the effect is actually being driven by this other correlate?" and "That effect might result from assumptions a, b, and c of the test." etc. Sometimes these questions are useful, but sometimes they seem to detract from the substantive issues at hand. And frequently, I see talks get way too bogged down in anticipating questions like this by cramming too much statistical detail into their talk, leaving not enough time to do justice to the theoretical importance of their results.

Add to this the customizability of graphical models, the number of possible distributions and parameter settings, and the notion that "Bayesian" =  "subjective", and I'm really feeling stressed out by the presentational task ahead of me.

So, I'm trying to figure out a good way to both make the model I've built fully available and accessible to someone who can't read JAGS code, has a little bit of presentational pizzaz, and also allows me to focus in on the parameters of specific interest. I started off trying to use Graphviz to produce directed graphs, and wound up with this (an actual level in the model I'm hoping to present).
 It's all a ton of spaghetti, difficult to hilight the particular parameters of interest, and doesn't represent some important distinctions (like stochastic and deterministic nodes).

I've moved on from Graphiz to trying to build an interactive tree diagram using the Javascript InfoViz Toolkit. It's been kind of slow going, since I don't know any Javascript, and am still trying to sort out what functions are basic and which ones are defined by the toolkit. Click on the image below to visit the visualization.

It's getting there, but I'm not convinced yet that it'll do the job of making the whole model digestible. For one, I'm modeling effects at a few different levels. The token level is represented in this visualization, but I'm also looking at speaker level effects, treating the linguistic context as a within speaker variable, and at word level effects. The way I'm setting things up now, that's going to call for two more trees like this one.

Maybe the lesson here is that I should just fit and present a simpler model, but remember those sniper questions? I'm worried that if I leave out someone's favorite correlate, I'll 1) have to deal with it in the questions and 2) they'll leave unconvinced, or rather, they'll leave convinced that it was their favorite correlate doing the work all along. But these are really research anxieties that no visualization toolkit on earth could assuage.

Sunday, July 8, 2012

On "Welcome to the Internet"

I interrupt the regularly scheduled (linguistics/data/stats) programming to bring you a special message about a topic which has been really bothering me. This blog is my primary venue for writing publicaly about anything, so even though Anita Sarkeesian's project on Tropes vs. Women in Video Games doesn't fit into any of my usual topics, I'm going to write about it here.

I think most people will have heard about what's going on here. Anita Sarkeesian puts together an excellent video series called Feminist Frequency which offers accesible feminist critiques of movies, TV shows, etc. She set up a Kickstarter project to help fund research and production of a new video series called Tropes vs. Women in Video Grames. The project was a great success, raising over 26x the original goal, but the backlash from people on the internet has been really vile. You can look over a summary of links about the issue here.

I'm not writing about how vile I think the backlash is. Instead I'm writing about how much some people's reactions to the backlash have bothered me. I've read some of these online, and had them come up in conversation. They fall into a few categories.

"We can disagree without being disagreeable"

I have not heard one person in a respectable forum defend the backlash against Sarkeesian. However, I have heard a lot of "you might disagree with what she says, but you can do so in a civil manner." But at this moment, nobody can disagree with what Sarkeesian says, because she has not, in fact, said it yet. The whole backlash is not against what she said about misogyny in video games, but rather against her stated intention to say anything about misogyny in video games. What we are looking at is simply unvarnished hatred, and its exponents cannot make pretentions to having intellectual differences of opinions. That would require careful consideration of Sarkeesian's points, which again, is impossible, because she hasn't even had the opportunity to put them forward yet.

"Welcome to the internet"

I've heard more than one person say "welcome to the internet" about the harassment Sarkeesian is experiencing. As if what is happening to her just happens to everybody. A porn bot following you on twitter is a "welcome to the internet" moment. A spam comment on your blog including links to purportedly cheap viagra is a "welcome to the internet" moment. What we're observing with this backlash is not a "welcome to the internet" moment.

Even if we limit the discussion to the trolling comments on her blog and YouTube pages, the magnitude and intensity of the comments are already far beyond the average person's experience. And as Jay Smooth pointed out, it's also the case that members of marginalized groups tend to have a much worse experience with trolling like this. So this isn't just your plain vanilla internet, it's one that is especially bad for for people who are already marginalized IRL.

But we can't really limit the discussion to high volume trollish comments. We have to also bring in the vandalism of her Wikipedia page, which included adding a lot of porn. We also have to bring in the meme-ification of her image with the goal of specifically attacking her in specifically sexual ways. We need to bring in the fact that people are sending her explicit threats of rape and violence. And we also need to bring in the creation of a flash game that invited the player to beat Sarkeesian's face in. This last one is especially disturbing to me, because I've been reading a lot of guys talking about how much they want to hit her. To quote YouTuber MundaneMatt (linked here just to provide substantiating evidence, I wouldn't advise visiting it):
She's got those eyes that make you just want to punch her in the face.
And to quote a user's review on Destructoid of the flash game (I'm not even linking to it this time):
The voice acting isn’t the best at riling up the player, especially as her videos do this quickly anyway.
We are far far outside the realm of "welcome to the internet" and deep into the very dark, very real topic of silencing women with rape and violence.

And of course, there's the internet vigilantism. Her site has been DDoS-ed, there have been attempted hacks of her e-mail and various social networks, and she's been dox-ed (her personal address and telephone number posted online). This is the kind of treatment reserved for people dubbed villains by the internet. It is more than atypical, it is specifically reserved for the worst of the worst. By no means is it "welcome to the internet." And what did she do worthy of being treated like such a villain?

I think it is justified, given the evidence, to say that what is happening to Anita Sarkeesian is uniquely bad, and it is happening to her because she is a woman.

The Mos Eisley Gambit

Closely related to "welcome to the internet" is the Mos Eisley Gambit, which is simply stating that on the internet at large (and in YouTube comments specifically) "you will never find a more wretched hive of scum and villainy." This more and more easily believable the more you read about the Sarkeesian backlash.

But, I'm sorry, don't a lot of the same people who deploy the Mos Eisley Gambit also have a lot to say about how the internet is the future of free and open discourse? Wasn't there a whole collective kumbaya moment just a few months ago where "the internet defeated SOPA"? Wasn't the whole SOPA thing a backlash against the possibility government censorship? Isn't the goal of the backlash against Sarkeesian to censor her? You can't have it both ways. You can't go around hailing the internet as a revolutionary space for free communication (a human right even) that must be protected at all costs, and be so flip about what's happening to Sarkeesian.

And what's more, the residents of this hive of scum and villainy don't actually live in the internet. The trolls, vandals and harrasers are not internet pixies, they are real actual people. The images of Sarkeesian's likeness being raped by video game characters didn't just pop into existence of their own accord. A person, someone's next door neighbor, son, brother, sat down and spent time drawing the damn thing, and e-mailed it to her. The hive of scum and villainy is actually the real world we're all living in, and it's just reflected in the internet. Trolls are people too, and that's exactly the problem. You don't get away from the racist YouTube commenters by going outside, you ride the bus with them. Which is why, I think, hateful trolling is a worthwhile thing to worry about. It's not just about silly things that happen on the internet. It's about the attitudes and actions of real people who we all interact with every day.

Wednesday, July 4, 2012

Question: Work on -ly-less adverbs

I think I'm going to ask general information gathering  questions that I have about linguistics research here on my blog, rather than as Facebook or Twitter posts. Then, I can add the answers I get back to the post.

What research is there on -ly-less adverbs? I think the most common one that comes up is "personal," as in
  • Don't take it personal.
Here are two more real life examples (the second one I heard just today, hence the question):
  • I go to South Jersey occasional.
  • I need a cigarette desperate.
I have some vague intuitions about restrictions on the -ly-less forms. Specifically, I think they're only possible post-verbally, so
  • *I personal took it.
And I doubt we'd ever see it with a sentential adverb, like
  • *Hopeful, we'll find an answer.
  • *We'll find an answer, hopeful.
But then, I don't really trust my intuitions, because I would have also rejected the "occasional" and "desperate" sentences above, which I heard come out of real people's mouths.

So, anyone know of any research on the topic?

People came through for me! First, Mercedes Durham pointed me in the right direction on Twitter.
The Tagliamonte and Ito paper provides a great introduction to the topic of -ly~ø variation in adverbs. First, in the long view of history, the -ly adverbs are the innovation creeping in, not the zero forms. Here's how I understand it worked. There used to be a morpheme -lic which was used to create adverbs from nouns.
  • friend + lic
  • man + lic
And there was a separate morpheme -e that created adverbs from adjectives.
  • direct + e
  • open + e
Sometimes you'd get them stacking on top of each other

  • friend + lic + e
  • man + lic + e
And sometimes you'd wind up with the -lic+e morphemes coming together and behaving like one morpheme that turns adjectives into adverbs.
  • sweet + lice
This part sounds similar to a more modern situation. We have a morpheme -ate that turns nouns into verbs.
  • assasin + ate
And a morpheme -ion that turns verbs into nouns, which sometimes stacks on top of -ate.
  • delete + ion
  • assasin + ate + ion
But sometimes, we get -ation coming together and acting like one morpheme that turns verbs into nouns.
  • cause + ation (*causate)
Anyway, back to Old English. At some point the little -e morpheme that turned adjectives into adverbs got lost (probably as part of a larger language change that dropped a lot of word final unstressed e's). At that point, adjectives and derived adverbs just all sounded the same. That is, derived adverbs were all zero forms. But then, the fused form -lice started being used to make adverbs in more places than it used to be, and it eventually changed in pronunciation to modern day -ly.

On these historical issues, a lot of ink has been spilled including a whole two volume series on just this case of variation in adverb formation, and a few book chapters.

Tagliamonte & Ito also provide a lot of cool examples from other studies, like these ones from Appalachian and Ozark English (Christian, Wolfram & Dube, 1988).
  • I come from Virginia original.
  • It certain was some reason.
Their own study was on a large corpus of speech from York, Enland. After treating really separately (they argued the patterns in really had more to do with its use as a special intensifier and less to do with adverb formation), they found basically no age effects, but working class men strongly favored the zero form compared to everyone else.

As for language internal effects, they completely excluded preverbal adverbs as being invariantly -ly forms (per my intuition, but not per that one example above from Appalachian English). After that, the found that the concreteness of the verb had the strongest effect, with concrete verbs favoring the zero form a lot more than abstract verbs.

I noticed that both the examples that I felt were interesting enough to take a mental note of above involve abstract verbs + zero form adverbs. Maybe the fact that abstract verbs disfavor zero forms is what made them jump out at me.

Allison Shapp pointed me to work she's doing on -ly~ø variation in American English, and specifically (if I understood the poster right) African American English. They've found a big effect of education, where higher education favors more -ly form, and that African American speakers, who are likely to be speakers of African American English, favor the zero form.

So! That was a fruitful information gathering adventure! This is a really cool variable!

Disqus for Val Systems