Val Systems: July 2011

Sunday, July 31, 2011

A Review of Project Nim

Project Nim is a new documentary out about the life of Nim Chimpsky, the chimpanzee that a group of researchers at Columbia tried to teach sign language. Here's a brief synopsis.

"Let's take a chimpanzee, put it in a house in the upper west side with a psychoanalyst who doesn't know anything about chimpanzees, language, language acquisition, or sign language. Also, she has 7 other children in that house. What could go wrong?"

To put Project Nim in some perspective, Nim Chimpsky was born in 1973, which is two years after the Stanford Prison Experiment, and one year before the first legislation requiring Institutional Review Boards for institutions carrying out human subjects research. This is not to say that most social science research was so by-the-seat-of-their-pants back then, but it was a different time.

I came away from this film with a few different lessons.

Don't sleep with your advis(or/ee).

Just don't do it. Twice in the film, two different interviewees said about two different sexual entanglements, "I don't think it affected the science." But, as I heard Christopher Hitchens once say about interview subjects, a guilty mind wants to confess.

The movie starts out with Nim being placed in the home of Stephanie LaFarge to be raised as a human child. Stephanie had 3 children of her own, and her husband had 4, bringing the total residency of her Manhattan brownstone to 7 human children, 2 adults, and 1 baby chimp. This frankly sounds a lot more like a reality TV show than a scientific experiment. Add to that the fact that they gave baby Nim alcohol and pot, and that Stephanie breast fed Nim, I'm not sure MTV could even air it.

Why on earth was Stephanie LaFarge recruited to be Nim's mother? As far as I can tell, her only qualification was her sexual history with Project Nim PI, Herb Terrace. Her graduate degree was in psychoanalysis. She had no experience with chimpanzee research, or language research of any kind, and in fact, she was hostile to the scientific goals. She wouldn't keep logs, didn't have a project plan, and eventually tried to restrict the other researchers' access to Nim.

The second affair which came up was, again, between the PI, Herb Terrace, and the head teacher on the project, who was only an undergrad at the time. The fallout of this brief relationship led to the head teacher leaving the project.

First of all, I just don't think it's possible to pursue a relationship between a professor and an advisee (especially an undergraduate) in an ethical way. Given the power dynamic, some form of coercion is nearly impossible to avoid. I feel a little uneasy saying so in a public forum, which I think goes to say that this is not a problem that academia has left behind in the 70's.

Secondly, all sorts of strange and bad things happened to the science because of the sex aspect. Nim would have never had such a strange early childhood, and would have had greater constancy with the project if the PI had not pursued inappropriate relationships.

Beware those with media savvy.

One frequently hears that scientists in general, and linguists in particular, don't do enough to popularize their research. Occasionally, we are scolded for holing up in our ivory towers, since we are too arrogant to try to share our love of science broadly.

However, I think Project Nim has a lot to say about the perils of researchers who are a little too keen to popularize their research. One of the ASL teachers on the project described Herb Terrace as an "absentee landlord," who only showed up for photo-ops and media interviews. All in all, the project appears to have been planned far better from a media perspective than from a research perspective.

In case you were unaware, research, even really cool and good research, doesn't just show up on TV out of nowhere. It takes deliberate attempts on the part of the researcher or the university to drum up attention. And everything about this project seems perfectly constructed to be media fodder.

In the meantime, there were serious problems with the project, mostly having to do with Nim mauling research assistants, which Herb Terrace didn't really address, and had a hard time recollecting in the documentary interviews. The most serious incident, where Nim nearly bit through an interpreter's face, Terrace's reported reaction was that he was worried she would sue him, or that "it would get out."

It was a little hard for me not to think of Marc Hauser during the movie, another high profile non-human primate researcher who has recently fallen on hard times due to questionable ethics. The connection between Terrace and Hauser is tenuous, but they run together in my mind, I guess, because they both worked hard to popularize their research.

And this is why I, at least, am frequently wary of active researchers who are also active popularizers of their own research. It seems almost synonymous with sloppy research and compromised ethics in my mind.

Humans are not socialized chimpanzees

This certainly isn't a new lesson for me, because I've never really thought that humans are just socialized chimpanzees. However, I really like how this point was hammered home in a real way.

In discussions about "human nature," the notion that our "true" nature is somehow more brutish and violent seems to come up a lot. In this conception, society is merely a veneer over top our inner chimp.

Well, society didn't do too much to cover over Nim's external chimp. Our "true" human nature is manifest in the activity of all humans, meaning it must be very broad, and non-uniform, but non-arbitrary at the same time.

Interestingly, I've also heard of research trying to figure out if dogs are just socialized wolves. A bunch of researchers tried to raise wolf pups as if they were dogs, a much more achievable task, I think, than raising a chimp as a human. The results were much the same as for Nim. After infancy, the wolves went nuts and tore the place apart, and the experiment had to be abandoned.

Conclusion

I really liked the movie, and would suggest it to anyone who appreciates a good documentary.

Tuesday, July 26, 2011

The Philadelphian Dialect is Punk Rock

Before there was Youtube and the accent meme, there was, I guess, punk rock.

In this music video from 1988, the Dead Milk Men, a Philadelphia area punk band, give a rather hyper-Philadelphian performance. For the most part, Philadelphians aren't that aware of what marks their dialect as distinct from other regions, nor are most non-Philadelphias aware that there is a unique Philadelphia dialect.

Now, I say hyper-Philadelphian for a few reasons. The lead singer for this song, Joe Genaro, definitely Philadelphia dialect speaker, born about an hour outside of the city in Wagontown, PA.

View Larger Map

But, local dialect features are one of those things that tend to get leveled a little when singing, and there is no hint of that in this performance. Some things even seem exaggerated to me, which is fitting with the song itself, which was shot in Philadelphia, and makes references to culturally relevant locations in the lyrics.

So here is Punk Rock Girl. Dialectal analysis immediately follows.

Get More: The Dead Milkmen, Music, More Music Videos

/ow/ fronting

/ow/ fronting is, perhaps, the most salient dialect feature on display in this song. It's certainly not unique to Philadelphia. In fact, it's what qualifies Philadelphia as the Northern-most Southern city. While Philadelphia has many other Northern features, like a very raised /ɔ/, stereotyped in coffee talk, we depart from the rest of the North by fronting /ow/, and Joe Genaro does this to an extreme degree in this song. Right off the bat at 0:28, he says

And she almost knocked me dead.

Then he immediately follows this up with

I tapped her on the shoulder
And said do you have a beau?
She looked at me and smiled and said she did not know

In fact, all of his /ow/s in this song are incredibly fronted, except for the two tokens in rollin and stolen which, of course, are effected by the following /l/.

Canadian Raising

The song isn't filled with Canadian Raising tokens. In fact, there are only two, but the one is so stressed and clear and wonderful. At 1:01, the waitress says

Well no, we only have it iced.

Canadian Raising continues to be a favorite variable of mine.

Short-a pattern

Philadelphia is known for its complicated pattern of tensing /æ/, which is similar to New York City. The tense version pops up expectedly in

0:46
Punk rock girl
Give me a chance
Punk rock girl
Let's go slam dance

and

1:54
We went to a shopping mall
And laughed at all the shoppers

and

2:01
We asked for Mojo Nixon

Unfortunately, mad, bad and glad, which are exceptionally tense, don't appear anywhere in the song. However, at 1:29, he says dad, which is definitely lax as expected.

Tokens of /æ/ which are lax in Philadelphia where they are tense in many other dialects show up in

1:03
So we jumped up on the table and shouted anarchy

and

1:24
Her father took one look at me and he began to squeal

and

2:26
Eat fudge banana swirl

/ey/ split

This one is pretty subtle. Most of his tokens of /ey/ don't sound very different from standard, but one word final token at 1:15 is pretty low, almost [æɪ].

On such a winter's day.

Data suggests that all /ey/ used to have this quality in Philadelphia, which is another reason why it's related to the Southern and Midland dialects. A sound change has been raising /ey/ higher and higher, but not in word final position.

on = dawn

Philadelphia maintains the distinction between cot and caught by raising the vowel in caught, similar to New York City. One way in which Philadelphia differs from New York City is in the vowel class of the word on. In most locations North of Philly, on rhymes with the man's name Don. But in most locations South of Philly, at least where a contrast is maintained, on rhymes with the woman's name Dawn. You can hear this in

0:38
I tapped her on the shoulder

1:03
So we jumped up on the table and shouted anarchy
And someone played a Beach Boys song on the jukebox
It it was "California Dreamin"
So we started screamin
On such a winter's day

l-vocalization/darkening

Now, if you think you can reliably code l-vocalization embedded in a punk rock song, god bless you. But, there are a few tokens that are pretty clear. For instance, I don't think there's any /l/ in

0:38
I tapped her on the shoulder

The thing that makes Philadelphia pretty unique is our tendency to darken and vocalize /l/ intervocalically (so balance is pretty confusable with bounce) and in initial clusters (like cluster). I don't want to make any strong claim about being able to reliably hear it in this song, but listen to

2:12
We got into her car away we started rollin
I said how much you pay for this
Said nothin man it's stolen

and compare it to

0:49
Let's go slam dance

There is definitely not as much /l/ in rollin and stolen as there is in let's.

* * *

So, do you think I missed anything important? As a side note, I think I have the same shirt as the drummer.

Tuesday, July 19, 2011

Language Change, Animated

This is the visualization of language change that I've always wanted to produce! And now that I've made it, there are all sorts of aesthetic things I'd like to change, but c'est la out-of-the-box-tools-from-google!

I should note that the data underlying this graph would not exist but for the sweat, blood and tears of Bill Labov, Ingrid Rosenfelder, a team of undergraduate transcribers, the NSF, and 3 decades' worth of graduate research teams.

Depicted below is data from the in-development Philadelphia Neighborhood Corpus. We have analyzed 235 speakers who were interviewed as part of the Researching the Speech Community course between 1973 and 2010. That gives us dates of birth between 1889 and 1991, a 102 year timespan! Actually, raw data isn't depicted. Rather, it's the smoothing curve that I fit to F1 and F2.

Hit play to watch it go. You can select particular vowels, and toggle on and off trails. You can also adjust how the bubbles are colored in the top right corner.

The particular vowels on display are /ay/ and /ay0/. /ay0/ is the pre-voiceless allophone, a personal favorite, and look at that thing go! I've also split up men and women, since that has been an important factor in this particular change. The other vowels are there just for context, and are held at fixed points.

Not displayed is the extreme uniformity of this change across speakers. This thing is changing fast, and everyone in our corpus is marching along in surprising uniformity. Can you say "speech community" anyone?

Saturday, July 16, 2011

More Dialects and Communication Density

I'm not sure if it was there before, but there's a tab on the Senseable Cities lab's Connected States of America page with some of their data. Specifically, they provide an .svg of the United States with ID numbers which are cross referenced to .csv files, which label the calling and sms-ing communities. Hopefully, they'll also publish rawer data eventually.

Data munging

So, I took some of the Atlas of North American English data which labels cities and their dialect classification. I don't think I'll look at finer grained ANAE data, like particular vowels' quality, because I don't think that would be too great with the the granularity of the data available from Senseable. I had to associate city names with counties to merge the data with the .svg, and thankfully Google Refine + Freebase was able to get me 2/3 of the way there. There are a few strange errors in the .svg file that no amount of automation was going to get around ("Orandge County, FL" Really?). I also pulled the coordinate data out of the .svg so that I could do this all in R, which is where I feel the most comfortable.

For the ANAE data, I collapsed some sub-dialects together, like Inland North and North, and Inland South and South.

Mis-match Measure

So, I have counties with dialect classification, and counties with calling and sms-ing classifications. I want to come up with a way of evaluating the mis-match between these. Here's a sketch of how I did that.

for D in Dialects:
     for C in Calling_Communities:
          Within = D ∪ C
          Outside = C - D
          ratio_d,c = |Outside|/|Within|

So, "Within" is the set of counties that are both in dialect D and calling community C. "Outside" is the set of counties that are in calling community C and in some other dialect than D. You might have thought that I'd also include the set of counties that are in dialect D and in some other calling community than C, but that's actually not so important. As I said before, these dialect regions are rather large, so I'd expect there to be many calling communities within one dialect. What's stranger is calling communities which span dialects.

So, for interpreting the ratio, as it reaches 0 or ∞, the fit between dialects and calling communities is pretty good. At 0, a calling community is contained entirely within a dialect. As it approaches ∞, a dialect is more and more marginally part of a calling community.

Next step, I took abs(log(ratio_d,c)). Now I have a measure that runs from 0 to ∞, and the closer it is to 0, the bigger the mismatch. I also wanted to boost the match score of smaller dialect regions. I forget why, but it made sense at the time. So, I weighted these absolute log-odds by 1/|D|.

Results

Here are the median results per dialect compared to calling communities, from best to worst match:

West - ∞
St. Louis Corridor - 0.45
Florida - 0.35
Western New England - 0.19
Eastern New England - 0.08
Western PA - 0.07
Texas - 0.06
South - 0.03
North - 0.02
Midland - 0.01
Mid-Atlantic - 0
NYC - 0

And for the sms data:

West - ∞
South - ∞
St. Louis Corridor- 0.5
Florida - 0.34
Eastern New England - 0.17
Western New England - 0.15
Western PA - 0.07
Texas - 0.06
Midland - 0.05
North - 0.02
Mid-Atlantic - 0
NYC - 0

I'd not put so much stock into the Mid-Atlantic and NYC scores. To a large degree this is due to them cannibalizing each other, and they're not that different dialectally anyway.

What's really interesting is the poor Midland and Northern scores. While I haven't worked out a measurement for which dialects are most mixed within calling communities, I suspect their poor scores are related to each other.

Graphs!

In this first graph, each facet is for a calling community in which there is a Northern dialect county. The filled in bits are the counties which are within the calling community, and the colored counties are ones we have dialect data for.

Calling data

In 4 out of 7 calling communities in which there is a northern dialect county, there is also a Midland dialect county. That's basically along the entire border region between the two dialects.

Here's the same graph for sms-ing communities.

SMS data

Conclusions

Yup, these communication communities don't line up with dialect boundaries like you'd expect.

Monday, July 11, 2011

Communication Density and Dialect Boundaries

One linguistics topic which non-specialists are almost always interested in is dialect geography, and I don't think that's strictly due to their desire to have regional biases confirmed. It seems like almost everybody has a genuine interest in where and how people speak differently from themselves. Granted, once you move away from fairly shallow lexical differences into phonetic and phonological ones, a lot of people's eyes glaze over.

When it comes explaining why dialect boundaries are in one place, rather than another, dialect geographers tend to have two answers. First, different regions have different historical settlement patterns. Bill Labov frequently points out that the current phonological boundary between the North and the Midland in the United States coincides with boundary between where log cabins were built versus A-frame houses, which itself coincides with two different immigration streams with different points of origin on the East coast.

Second, there are differential rates of communication between regions. Langauge appears to be transferred crucially by face-to-face communication. If two regions have stronger ties of communication between themselves than with other regions, then we think they're probably going to have more similar dialects. This was basically Keelan Evanini's argumentation about why Erie, PA basically has a Western Pennsylvania dialect, even though it had historically been part of the North.

Given this second hypothesis about why dialect boundaries exist where they do, I was pretty excited to see these results coming out of the Senseable City Lab, which in collaboration with AT&T and IBM Research, has produced maps illustrating how US counties cluster together in terms of cell phone traffic and sms traffic.

The lines between communication clusters are exactly those that I would expect to define dialect boundaries. So, I took the call and sms community maps, and superimposed the major dialect boundaries from the Atlas of North American English. Here are the results.

Communication clustering by Calls

Communication Clustering by SMS

Honestly, I'm a little disappointed with the outcome. I expected that for very large dialect regions, like the West and the South, they would would contain many different communication clusters, so that's fine. Where both a dialect boundary and a communication boundary line up with a state boundary, I don't think it should be counted as an alignment. If there's any tendency for people to be more likely to move within state lines than across state lines, then this alignment along state lines is probably better explained by the first factor, settlement history, than communication density.

The crucial place to look for an alignment between communication and dialects seems to be the Ohio, West Virgina, Pennsylvania trifecta. In neither map does it look like communication density lines up quite right. Certainly, Pennsylvania is cut in half into a Western and Eastern region, but it seems like the Western PA dialect extends further East, almost to the threshold of Philadelphia.

Ohio doesn't seem to be sliced up quite right either. In the calls data, Cleveland clusters with the rest of the state, while with the SMS data, it clusters with Western PA. Dialectally, Cleveland is neither like the rest of Ohio nor Western PA. Rather, it is more similar to Toledo and Detroit to the West, and Buffalo to the East.

There are other unfortunate non-alignments, like how Baltimore is clustered with Virginia, while dialectally it's more similar to Philadelphia, and New England isn't chopped up communicationally the way it is dialectally.

I'll conclude by saying that first, pat answers to explain natural phenomena don't always work out, and second, these communication clusters make some dialect boundaries pretty mysterious. If everyone in Ohio is clustered together into a cell phone calling community, then why don't they all talk the same? The answer to this probably has to do with a third factor: meaningful social divisions which are distinct from communication divisions, but remember what I said about pat answers?

Sunday, July 10, 2011

Estimated international population of gay men

I recently learned about the "fraternal birth order effect," where apparently for every older brother a man has, his probability of being gay as an adult increases. Here's a wikipedia entry.

Now, apparently there's some debate over how real or how strong this effect really is, so I'm almost certainly taking some numerical result a little too seriously. But, it occurred to me that data such as total fertility rate, and birth sex ratios are attainable international statistics. If this fraternal birth order effect is pretty strong and reliable, you should be able to estimate what percent of the male population of a country is gay.

So, I grabbed some data on international total fertility rate from here, and data on birth sex ratios here. Now, I have to make some assumptions. First, all of these calculations take the average total fertility rate as a country level descriptor, but there's almost certainly a unique probability distribution for different fertility rates for every country. Second, I have to treat the probability of having a male baby as being independent from the sex of the prior babies a woman has had. Third, and most importantly, I'm treating fraternal birth order as the only determinant of sexual orientation.

These are all pretty drastic assumptions. For instance, there's some evidence that my second assumption (birth sex of babies from the same mother are independent processes) is false. From the UN data I have, here's the total fertility rate of the country by the sex ratio:

This seems to suggest that as women have more babies, they're more likely to have girls. Note: I've left out data from four countries with highly skewed birth sex ratios, since these countries apparently have high rates of abortion of female fetuses.

So, I'm thinking about this as a very rough back of the envelope estimate, not to be taken too seriously, but maybe some sort of indicator of the shape of the world.

Here's the math:

babies = 1, 2, ... total.fertility.rate
boy.probability = male.ratio/2
boy.babies = boy.probability^(babies)
prob.gay.first.born = 0.12 (more on this below)
prob.gay.n.born = prob.gay.n-1.born * 1.3 (from wikipedia)
prob.gay = sum(prob.gay.1-to-n.born * boy.babies)

I hope that makes some sense. I grabbed 1.3 from wikipedia, which says "each older brother increases a man's odds of developing a homosexual orientation by 28–48%." I basically made up the probability that a first born son is gay. This was the one number that I couldn't seem to find, so I adjusted and played with it until the predicted percent of gay men in the United States was about 10%.

Here are my results for the top 10 countries for percent of gay men.

Afghanistan (19%)
Niger (18%)
Liberia (18%)
Mali (18%)
Nigeria (18%)
Burkina Faso (17%)
Guinea (17%)
Yemen (17%)
Iraq (17%)
Uganda (17%)

Unsurprisingly, the percent of gay men in a country is highly correlated with total fertility rate. I think this top 10 list highlights the importance of gay rights activism in Africa, especially in Uganda, which is considering making homosexuality a capital offense.

And for the self obsessed, the United States looked like this:

Smaller percent than 100 countries > tied with 17 countries > larger percent than 43.