Sunday, February 17, 2013

A difference between men and women.

This post was originally going to be a lot more mathy, with a bit of explanation about the source-filter model of speech production with an aside about dead dog heads mounted on compressed air tanks thrown in there, and a whole description of my methods, but I felt like I was sort of burying the lede there. Instead, I'm focusing more on how people are interested in magnifying the difference between men and women.

It started off with me estimating the vocal tract lengths of the speakers in the Philadelphia Neighborhood Corpus. Given sufficient acoustic data from a speaker, and making some simplifying assumptions, and taking into account the acoustic theory of speech, you can roughly estimate how long a person's vocal tract (meaning distance from vocal cords to lips) is. I went ahead and did this for the speakers in the PNC, and plotted the results over age.


Pretty cool, right? There's nothing especially earth shattering here. It's known that men, on average, have longer vocal tracts than women. I was a little bit surprised by how late in age the bend in the growth of vocal tracts were.

Here's the density distribution of vocal tract lengths for everyone over 25 in the corpus.



That's a pretty big effect size. Mark Liberman has recently posted about the importance of reporting effect sizes. He was focusing on how even though people are really obsessed with cognitive differences between men and women, the distributions of men and women are almost always highly overlapping.

Following Mark on this, I went ahead and calculated Cohen's-d for these VTL estimates.
So, 1.71 is a fairly large Cohen's-d effect size. I had heard that the difference in vocal tract length between men and women was disproportionately large given just body size differences. I managed to find some data on American male/female height differences, but the effect size is not impressively smaller than the VTL effect size (1.64, about 95% the VTL effect size).



Compared to the effect that Mark was looking at (science test scores), these effect sizes are enormous. The effect size of height between men and women is about 23 times larger than the science test score differences which warranted a writeup in the New York Times.

Yet, still not big enough.

As I was thinking about how height difference is perhaps one of the largest statistical differences between men and women, it also struck me how often it is still not big enough for social purposes. Sociological Images has a good blog post about how even though Prince Charles was about the same height, if not shorter than Princess Diana, in posed pictures he was posed to look much taller than her. Here's an example of them on a postage stamp:

And in another post, they provide this picture of a reporter being comically boosted to appear taller than the woman he's interviewing.


My take away point is that when it comes to socially constructing large and inherent differences between men and women, even the largest statistical difference there is out there is still not good enough for people, and needs to be augmented and supported. Then take into account that most other psychological and cognitive differences have drastically smaller effect sizes, and it really brings into focus how the emphasis on gender differences must draw almost all of its energy from social motivations, rather than from evidence or data or facts.

Thursday, February 7, 2013

I recommend Lexicon Valley

Perhaps the most frustrating thing about being a linguist is the enormous gap among educated people about how little they actually know about language, and how confident they are that they know a lot about language. If you keep up with this blog, I spend a lot of time venting this frustration here (etc. etc. etc.).

But I didn't start blogging in order to complain about how other people are getting it wrong. I started blogging to have an informal outlet for passion for linguistics! I've been a little concerned about the negative tone of a few of my recent posts, so here's a more positive one.

But... it does start off with a complaint. At the LSA this year, David Pesetsky's plenary focused on the failure of linguistics (and more specifically, generative linguistics) to penetrate the popular science press. Instead, stories about physicists discovering the most common English word is "the," and psychologists arguing that structure of language is really words like beads on a string get a lot more play. At the Q&A, Ray Jackendoff made the point that there is a folk linguistics that is intricately tied up in social politics that acts as a major roadblock to the popular advancement of real linguistic research. I've said similar things before.

What is to be done about this state of affairs is the topic of another blog post. Right now, I'd like to bring attention to a bright light of potential linguistics popularization.

Lexicon Valley


Lexicon Valley is a podcast hosted by Slate. I've been listening to it off and on since it started, and I have to say I've always enjoyed it. The hosts play two roles in a dialectic. Mike Vuolo is the patient intellectual, and I've always been impressed by the background research he's done. Bob Garfield is the voice of the untutored establishment, and, well, I think that description adequately sums up my opinion of what he brings to the show. It's actually an important role he plays, because without a vocal foil, Vuolo's research would lie rather flat. It's also important for the cause of linguists to have people hear brash knee jerk reactions rebuked by careful research.

They have covered a few topics I know a little bit about, and I've always started listening to each show bracing myself for frustration and disappointment. It's a learned reaction I have from every other discussion of language in popular media. But Lexicon Valley usually carries through for me. They've done great shows on African American English, grammatical gender, and the English epicene pronoun, speaking to actual linguists in each case, and most recently they've just done a really good portrayal of Labov's department store study (Part 1,Part 2).

They did catch a lot of flack recently for their show on creaky voice. I was so nervous when I started listening to it, because the recent coverage creaky voice has gotten has been worse than terrible. Per usual, though, Vuolo's research and discussion were excellent. Garfield, on the other hand, spouted some really negative attitudes, and I think he deserves every criticism of sexism that he got. Even within the dialectic of the show, Garfield brought a net negative contribution that time round. On the subsequent show, though, Vuolo read out some pretty harsh commentary about Garfield. Garfield offered a nonpology (something about how he can't be sexist, he has daughters), but it was good to have some of the criticism read out loud.

On average, modulo Garfield's frustrating attitudes, I would highly recommend the podcast, and would recommend recommending the podcast.

Could it be better?


While I think Lexicon Valley has done some great work so far, I don't think it has yet provided coverage of linguistics in quite the way Pesetsky dreams of. So far, they've mostly covered topics that are reactive to popular gripes or misconceptions about language. In some respect, it'd be hard for them to do otherwise, because the popular understanding of language science is far below that of almost any natural science, or so it seems from this angle.

I hope, though, that they might find a way to approach linguistic topics which are not just reactive. Just addressing the idea that there are functional elements which have no phonological realization would be enormous. Garfield could play the skeptic, believing that what you see is what you get.

So linguists, listen in, get a feel for the show, and maybe if you have a topic which could be nicely formatted into a 20 minute conversation, send it in to them!

Sunday, February 3, 2013

Does language "cool"?

A few months ago, I posted about how I was relatively unimpressed by a paper arguing that the observed Zipfian distribution of words in a corpus is due to "preferential attachment" aka the Matthew Effect aka the rich get richer. The author of that paper is apparently also a co-author of a paper called "Languages cool as they expand: Allometric scaling and the decreasing need for new words." The writeup in Inside Science summarizes it like this:
[A] recent analysis has found that as a language grows over time, it becomes more set in its ways. New words are always being added, according to this study, but few become widely used and part of the standard vocabulary.
My linguist hackles immediately raised at this statement, and that's because there is a large and fundamental difference between what a linguist understands the term "language" to refer to, and what the authors of the column and paper understand it to refer to. What the physicists and the reporter mean by "language" is roughly "a set of words," and in the context of the paper, they almost seem to mean "the set of words which have been published."

This "language is words" axiom is part of most people's folk linguistics that we have to train people out of when they take Intro to Linguistics. That's why it's a little hard to take the work of these physicists seriously at first glance. It is as if they were trying to write a serious paper on biological evolution with the assumption that traits acquired by an organism during its life were inheritable.

But there is an aspect of linguistic knowledge relating to the set of words and morphemes a speaker knows, which linguists call the "lexicon". So, I'll just go ahead and reread the paper mentally replacing each instance of "language" with "lexicon" in order to get through it.

Overall Thoughts

This paper seems to be a relatively competent (modulo Mark Liberman's concerns about OCR errors) description of the statistical properties of large corpora. But that's really as far as I think any of the claims can go. I am totally unconvinced that their results shed any light on language change, development, evolution, etc. I'm not even sure that the simplest statement that "the lexicon of languages has grown over the past 200 years" can be supported by the results reported.

The key problem that I see with the paper is the conflation of "new to the corpus" and "new to the lexicon." Here's how the problem of sampling language was describe to me, and I believe it goes back to Good (1953) and is key to Good-Turing Smoothing. Say you are a entomologist working in a rain forest, trying to make a survey of insect life. You put out your net for a night to collect a sample, then count up all the species in your net. Some bug species are going to be a lot more frequent than others. You'll have some species that show up many times in the net, but even more species will show up in the net with only one member. Now, let's say that you come back to the same rain forest two years later, and repeat the sample. You are nearly guaranteed to observe new species in your net this time around, but the key question is whether they are just new to the net, or are they new to the rain forest. If they're new to the rain forest, did they migrate in, or are they hybrids of two other species, or has a species you saw previously evolved really rapidly so that you're seeing it as different now?

These are really interesting and important questions for our entomologist to answer, but you cannot arrive at a definitive answer based simply on the fact that this new species has now showed up in your net. In fact, depending on a few factors, the answer with the highest probability is that the new species is simply new to your net. The Good-Turing estimate of the probability that the very next bug you catch will be a new species is that it's roughly equivalent to the proportion of bugs you've already caught that belong to a species you've only seen once.

The situation gets even more confusing if you come back to the same rain forest two years later with a net twice the size.

The paper has a figure plotting the increase in lexicon size over time. My first thought when I saw it was that it must be the case that the overall size of the corpus at each time point must also be going up. Coming back to the entomologist in the rain forest, the number of species in his net is merely a sample of how many species there are in forest. In the same exact way, the number of words in a lexicon can only be estimated by the words which people happened to write down. As you increase the size of the net, you're going to find more species which were already in the forest, but not in your net. As you increase the size of your corpus, you're going to find more words which were already in the lexicon, but not in the corpus.

Now, you need to add to this that at any given point in time, the true maximum number of possible words you could potentially observe in any given language is ∞. Yes, in fact, the whole reason language is interesting to study is because given a finite set of mental objects, and a finite set of operations to combine them, you can come up with an infinite set of stings, and that goes for words too, not just sentences. In 1951, "iPod" was a possible word of English, it just wasn't used, or at least not for the same purpose it is now.

Regarding the question of whether the "active" (as I'll call it) lexicons of languages have grown over the past 200 years, well, indeed, the overall number of printed words has also increased. Almost all of their results seem to have more to do with the technological development of publishing than it does with any other linguistic or cultural development. It is as if the entomologist said that over the past decade, the biodiversity in his rainforest has exploded, when really what's going on is his nets have been getting progressively larger.

Now, it might be the case that the active lexicon has grown more than would be expected given the increase in the size of the corpus year over year, but as far as I can tell, the authors did not try to estimate whether this was the case.

What about this cooling down?

The "cooling" effect referred to by the paper is the suggestion that as a language "grows" (which as I just said is dubious), the frequency with which particular words are used becomes more stable. Some words are more frequent than others, but words are less likely to move up and down in frequency over time/as the lexicon grows. Back to entomology, the suggestion is that as more species cram into a rainforest, each species is less likely to become more or less populous.

Again, though, the frequency, even relatively frequency, of a word in a corpus is merely an estimate of its true frequency. As the size of the corpus increases, so should the reliability of its frequency estimates, and we would predict decreasing volatility of those frequency estimates. The authors check for this, and find exactly this relationship between corpus size and frequency volatility, but I can't tell whether there was excess "cooling" left over. I wish they had said, "there was x proportion of cooling left unaccounted for by simply accounting for the size of the corpus," but I think this is perhaps another symptom of the assumption that the corpus=the lexicon=the language that I complained about before.

The Allure of Big Data

The reporter who wrote the Inside Science article did what it appears that the editors of Scientific Reports did not, asked a linguist to comment on the paper. Bill Kretzschmar was "underwhelmed," saying that most of these results are not new to linguists. I would take this as a word of warning about the allure of big data. The results discussed in this paper are not, by and large, new, but rather have never been done with data of this scale. But unfortunately, a fact which is already known does not get more interesting when it is reestablished with data 100 or 1000 times larger than before.