Monday, February 15, 2016

The Sound of Silence

One counterintuitive thing about doing linguistic analysis is how much time we spend analyzing structures and elements that are silent. That's probably pretty confusing for people when they first learn about it. For example, compare these three sentences:
  1. I want a cookie.
  2. I ate the cookie.
  3. I ate cookies all day.
In your grammar lessons in school, you probably learned that the words a and the were called "articles," but linguists usually call them "determiners." Most people describing the sentences in 1, 2 and 3 would say something like:
"Sentences 1 and 2 have determiners in them. Sentence 1 had a and sentence 2 has the. Sentence 3 doesn't have a determiner."
A linguist, on the other hand, when describing these three sentences would be likely to say:
"All three sentences have determiners. Sentence 1 has a, sentence 2 has the, and sentence 3 has a silent determiner."
Silent determiners is just scratching the surface of all of the possible silent words linguists have postulated. A really common kind of reaction to all the silent words is "Bullshit!" Actually, that was my reaction when I took my first syntax course, but eventually, I was convinced.1 For a lot of the silent elements linguists have proposed, there's usually some good reason or evidence for doing so, but we do have to be careful not to over-hypothesize silent elements to make the data work. This is the really interesting tension of abstractness.

Here's an example I'm dealing with in my own work, involving how Philadelphians have traditionally pronounced their short-a in words like mat and man. Usually, Philadelphians have a "tense" or "nasal" sounding short-a when the sound following is an /m/ or an /n/. For example, man is tense, but mat is lax. But this only happens if the /m/ or /n/ is in the same syllable. So the word ham comes out tense, but the word hammer comes out lax, because the /m/ is in the next syllable, if you sound it out (ha-mer). Plan is tense, but planet is lax (pla-net).

One weird exception to this pattern is the word exam. Exam usually comes out lax, even though the /m/ is in the same syllable when you sound it out (ihg-zam). One way of analyzing this is just to say "Ok, exam is just an exception to the rule." But, I'm going to make a different argument, which is that every time a Philadelphian says exam, they're actually saying an abbreviated form of examination, and in examination, the /m/ is not in the same syllable (ihg-za-mih-ney-shun). So the word exam is lax, because it's really examination.

One objection to this argument is that examination looks like it's exam+ination, so how do we know that what people are saying is examination and not just exam without -ination added to it. Well, let's look at some other words that end in -ination. We have imagination. If we abbreviate imagination, like "I've totes got a wild imag," we get out ih-majh which means the same thing as imagination, and it has stress on the second syllable. This looks like exam (ihg-zam), which means the same thing as examination, and has stress on the second syllable. What happens if we never add -ination to imagination? We get out image (ih-mijh), which doesn't mean the same thing as imagination, and has stress on the first syllable. That doesn't look like exam at all. This makes it seem more likely that exam is an abbreviation like "I've totes got a wild imag", and is not just a bare word, like image.

But there is no word ehg-zum, so does that mean that the root exam can only ever appear attached to -ination? That's not so weird if you look at other words that end in -ination, like destination. Dest never shows up as its own word. But it seems to show up in related words like destined or destiny.

There's also a little bit of historical evidence that exam was originally an abbreviation. If you look at some of the earlier examples of exam in Google books, authors used to end it in a period, like you would for any abbreviation. For example, J.M. Barrie (more famous for writing Peter Pan) in his 1889 book An Edinburgh Eleven wrote:
I knew a Snell man who was sent back from the Oxford entrance exam., and he always held himself that the Biblical questions had done it.
That period followed immediately by a comma makes it pretty clear that in this sentence, exam is an abbreviation. And if you look at Google ngram patterns, exam seems to be increasing in frequency relative to examination.

So exam looks like it was historically an abbreviation of examination, and the fact that Philadelphians have traditionally pronounced it with a lax short-a suggests that it still is an abbreviation for us, unconsciously.

This is just one example of how language is not a "what-you-see-is-what-you-get" game. There is a lot of silent structure to language that we're not consciously aware of. It takes a mixture of clever reasoning and empirical data to work it out.

But I'm guessing there's still a few people saying "Bullshit!" out there.
1. I actually once wrote a post on my undergraduate blog about how I thought PRO was absurd, and Norvin Richards left some really nice and patient comments on it. I'm kind of embarrassed about that now.

Thursday, October 29, 2015


I'm just recently back from the New Ways of Analyzing Variation (NWAV) conference, the top variationist sociolinguistics conference in North America. This was NWAV44, hosted by the University of Toronto, and as usual it was a lot of fun, and really exhausting.

When you go to a conference regularly enough, you start making conference friends, people who you only know and really only ever see in the context of the conference. Seeing my conference friends again is always one of the things I look forward to when going to NWAV, and it's what makes it all the more disappointing when I can't go for some reason or another. Of course, nowadays, if you can't make it to NWAV in person, you can always follow along at home on the Twitter hashtag. Really, it's almost like a parallel conference going on on Twitter. I go to a lot of conferences that get tweeted about, but I feel like NWAV tends to have a much higher rate of twitter traffic, and this was especially true of NWAV44. The Manchester LEL twitter account speculated that this might be the most tweeted about conference of all time.

It definitely spawned the most parody accounts (@GreatHallBird@nwavAVghost). But, I thought I'd take this up and pick over the Twitter traffic on the #NWAV44 stream. I pulled down all the tweets I could with the twitteR package, excluding retweets.

In the 4 days of the conference, there were 3,196 tweets on #NWAV44. Here's the distribution of tweets, binned into 10 minute intervals, color coded by what was happening at the conference at the time (according to the official schedule).

The highest number of tweets in any 10 minute period was 53, for a rate of 5.3 tweets a minute, during the second paper session on Saturday.

There wasn't much tweeting during the poster session, which is too bad. First of all, it means a bit less exposure for people in the poster sessions. Second of all, posters are an intrinsically visual medium, and conventional wisdom is that tweets with pictures attached get more attention. I can't tell whether a tweet has a picture attached in this data, but I can tell if it has a link. I broke tweets down into two categories: tweets with an "https" link and does not contain the words "slide" or "talk", and all others. Winds up looking like this:
might have image n average retweets average favorites
yes 378 1.29 3.31
no 2,818 0.82 1.89

Maybe something to think about next time you're live tweeting a conference.

Another thing I was interested in was what the average tweeting rate was during any given 20 minute talk + 10 minute question period. So, I took each talk period, and counted up how many tweets were sent each minute. Here's what it looks like.

During a talk, it looks like there's a pretty steady average of 2 to 3 tweets per minute persisting into the Q&A period, dropping off precipitously during the lat 5 minutes of Q&A as the speakers switched over. It's really pretty striking that during the meat of any given talk period, the most common number of tweets in a minute is something like 1 or 2, not 0.

One last thing I looked at was tweets about the birds. In the Great Hall in Hart House, there were at least two birds flying around the rafters and eating the crumbs after coffee sessions. It was a major topic of twitter conversation, and spawned the parody account @GreatHallBird. The volume of tweeting about the birds reached its peak on the third day of the conference, when about 8% of all tweets contained the string [bB]ird.

There was initially some competing ways of referring to the bird. Some people originally decided to call it "Ferdinand", but midday on the 24th, the @GreatHallBird account started tweeting, and eventually that standard became the most popular. Here is what people were calling the birds out of the options of just "[bB]ird", "[fF]erdinand" and "[gG]reat\s?[hH]all\s?[bB]ird"

Unfortunately, I don't have access to historical twitter data to check how #NWAV44 compared to other large conferences, like maybe #ICPHS2015.

UPDATE: I just realized that I didn't filter out tweets by the @GreatHallBird account itself when estimating the rate of tweets about the birds. When I exclude its tweets, as well as any line initial @-mentions, nothing really changes that much qualitatively, but the third day of the conference tops out at about 7% tweets about the bird, instead of 8%.

Wednesday, August 5, 2015

Gender in the Wasteland

One thing I do with the bit of leisure time I have is play video games. I feel like I need to be a bit self defensive about it, given my age and place in life, but according to surveys the average age of gamers is somewhere between 30 and 37, so I fit right in there (although, the age distribution probably has a strong leftwards skew, so it'd be nice to know the median age). Many mainstream games have socially problematic themes, like violence being the only available recourse to progression in the game. A really cool video series on that point is the Grand Theft Auto Pacifist. Another common problem games have is their portrayal of gender (the topic of this post), and on that topic, you should obviously go watch Tropes vs Women in Video Games.

I think these socially problematic themes are a bigger deal than when they show up in other media. Video games are unique in requiring an alignment of the audience's motivations with the main character. For example, the experience of watching Breaking Bad and observing Walter White's moral descent would be very different from playing a Breaking Bad game and controlling Walter White. What I've learned playing video games is that while it may be projected on your TV screen, the game is really in your head.

I've recently been playing Fallout Shelter, a mobile game set in the post nuclear apocalypse Fallout universe from Bethesda Studios. Fallout 4 is maybe the most highly anticipated game coming out in time for Christmas this year, and Fallout Shelter is fun diversion for fans to play in the meantime. Your role is the overseer of a Vault-Tec vault where dwellers escape the radioactive wasteland. You build out and populate your vault, and assign dwellers tasks like food, water and energy production.
It's kinda like an ant colony.

Our story begins when I assigned a dweller to the Science Station I'd built so she could produce RadAways (a treatment for the omnipresent radiation). I wanted to equip her with the Professor Outfit, which would boost her Intelligence stat, speeding up the production of RadAways. I scrolled through the inventory a few times, and couldn't find the Professor Outfit. The only possibility I considered was that I'd forgotten to unequip it from the dweller who was wearing it before. But as I messed around, it became clear that you're unable to equip female dwellers with the Professor Outfit... Yeah, I know.

Here are two dwellers, Alexander and Judy, in the Science Station.

When I select Alexander and scroll through the outfits inventory, the Professor Outfit is right there between the Nightwear and Radiation Suit. When I select Judy, it's just absent. It's not even greyed out, just invisible.

This is a problem. There is a very strong cultural expectation that "Professors are Men", and this expectation gets visited upon my female friends and colleagues in really unfortunate ways. The most common and obvious day-to-day experience I've been told about is people assuming that "Prof Smith" is a man in e-mails. Or it's assumed that they're administrative support staff instead of faculty at meetings. Sometimes people assume female profs are students showing up for the first day of class, instead of being the instructor. One friend of mine said they got student feedback on a course saying that they were a great teacher and would doubtless be successful "in whatever career she pursues."

These are all obvious examples of women not being taken seriously in their professional roles, and the "Professors are Men" expectation has some obvious impacts on their careers. Women are grossly underrepresented at the highest faculty levels, and get paid about 90% of what men at equivalent levels do. So what's it matter that in Fallout Shelter, Alexander can wear the Professor Outfit and Judy isn't even presented with the option?

An all too common scene.
Well, first, it's just re-enforcing the "Professors are Men" expectation. Who is this message reaching? For one, me, and probably many of my students, and we're the ones who do a lot of the damage with the "Professors are Men" assumption. But it's also probably reaching a larger portion of women than even other games with poor gender portrayals do. It's a mobile game, and mobile games are disproportionately popular among women. Also, it has an undeniably The Sims-like element to it, a game which was also disproportionately popular among women. So, it's a fairly negative message about women, in all likelihood being disproportionately directed at women.

This is also a game with a bit of cultural reach. There is some speculation that it's out earning Candy Crush Saga, and it topped the App Store charts for a while. Bethesda is also a really big and popular game studio, and Fallout Shelter bears the "Editor's Choice" badge.

Preventing female dwellers from equipping the Professor Outfit is also the only gender based equipping restriction I've come across in the game. All of the other outfits can be worn by male and female dwellers. Both Alexander and Judy can wear the Combat Armor, but it's drawn a bit differently for their different bodies.

Sometimes, with things like gender representation, there is an in-game explanation for why things are the way they are, and we need to have a conversation about why gender representation in fictional worlds is an important issue for our real world. But I don't think that's what's going on here. The fact that the outfits are drawn differently for male and female dwellers, and the fact that the Professor Outfit is just unceremoniously absent without any kind of in-game explanation suggests to me that they just didn't bother drawing a Professor Outfit for female dwellers. That is, we're observing here a real world instantiation of the "Professors are Men" assumption rather than some kind of intentional fictional representation of that assumption.

But just because it isn't intentional doesn't mean it isn't sexist, and doesn't mean it shouldn't be changed. Being unintentionally overlooked is exactly the problem my female colleagues are facing. I've sent Bethesda an email briefly outlining this, and asking them to rectify it in an update. Don't know if I'll hear back from them, or if I'm better off yelling at no one in the Wasteland.

Edit: The pregnancy mechanics are also pretty messed up too.

Video games are a broad medium though (this is more self defensiveness), and can be utilized for all sorts of purposes. For example, the Iñupiat Cook Inlet Tribal Council had the game Never Alone made, in which they embedded parts of their oral traditions for their younger generation. It's a really pretty and fun game. There's also the episodic game Life is Strange, which is essentially a game about friendship, family, the near constant threat of vitimization women face, and the anxieties we all face in trying to make the right choices in life.

Thursday, December 4, 2014

A Silly "Name" Generator

tl;dr: It looks like names aren't well modeled as a Markov process, but you can install my R package that does model names as Markov processes and mess around with it.

I don't know how I wound up writing a "name" generator yesterday, but I did. And now it's an R package (just on github for the moment), so you can play around with it too (

I was messing around with some other research questions when I decided to see what would happen if I tried to model given first names as a Markov process. Here's a picture of a Markov chain that contains just the letters of my own name: Joe.

First, you start out in a start state. Then, you move, with some probability, either to the o,  j, or e character. Then, you move, with some probability, to one of the other states (one of the other letters or end), or stay in the same state. In this figure, I've highlighted the path that my own name actually takes, but there are actually an infinite number of possible paths through these states, including "names" such as "Eej", "Jeoeojo", "Jojoe", etc.

A Markov chain for all possible names would look a lot like this figure, but would have one state for every letter. Now, I keep saying that you move from one state to the next with "some probability," but with what probability? If you have a large collection of names, you can estimate these probabilities from the data. You just calculate for each letter what the probability is of any following letter. So for the letter "j", you count how many times a name went from "j" to any other letter. For boys names in 2013, that looks like this table.

from to count
j a 176452
j o 84485
j e 26118
j u 25616
j i 1920
j h 425
j r 121
j c 118
j d 98
j end 55
... ... ...

As it turns out, a whole bunch of name data is available in the babynames R package put together by Hadley Wickham. So, I wrote a few functions where it estimates the transition probabilities from the data for a given year (from 1880 to 2013) for a given sex, and then generates random "names", or just returns the most probable path through the states. How often does this return a for real name? Sometimes, but not usually. For example, the most probable path through character states for boys born in 1970 is D[anericha] with that"anericha" bit repeating for infinity. For boys born in 1940, it's just an infinite sequence of Llllllll...

So, that introduces a problem where the end state is just not a very likely state to follow any given letter, so when generating random names from the Markov chain, they come out really really long. I introduced an additional process that probabilistically kills the chain as it gets longer based on the probability distribution of name lengths in the data, but that's just one more hack that goes to show that names aren't well modeled as a Markov chain. 

Here's a little sample of random "names" generated by the transition probabilities for girls in 2013:
  • Elicia
  • Annis
  • Ttlila
  • Halenava
  • Amysso
  • Menel
  • Seran
  • Pyllula
  • Paieval
  • Anicrl

And heres a random sample of "names" generated by the transition probabilities for girls in 1913:
  • Lbeana
  • Peved
  • Math
  • Bysenen
  • Viel
  • Lelinen
  • Jabbesinn
  • Mabes
  • Drana
  • Lystha
The feeling I get looking at these is that they don't seem particularly gendered, even though there are clear gendered name trends. They don't even seem like they're from different times from each other. A lot of them aren't even orthographically valid. I don't know how they'd perform on "name likeness" tasks, but I don't even know what the point of doing such a task would be, since the Markov process has already failed at being a good model of names.

Maybe there's a lesson to be learned from the Markov process' failure to model names well, but for me it wound up being a silly diversion.

Update: I've now updated the package to generate names on bigram -> character transition probabilities. The generate_n_names2() function generates things that look more like names. It's kind of fun!

10 generated girl names based on 2010 data:

  • Rookenn.
  • Lilein.
  • Hayla.
  • Dailee.
  • Bri.
  • Samila.
  • Abeleyla.
  • Eline.
  • An.
  • Rese.

10 generated boy names based on 2010 data:
  • Briah.
  • Dason.
  • Jul.
  • Messan.
  • Kiah.
  • Jax.
  • Se.
  • Frayden.
  • Dencorber.
  • Gel.

Wednesday, October 1, 2014

America's Ugliest Accent: Something's ugly alright.

I should really blog more often, instead of just when I feel compelled to slap down some nonsense, because the general tone of Val Systems turns towards scolding and away from my genuine positive passion for linguistics. That said, guess what I'm doing in this post!

If you read past the headline, it gets even worse. I won't always reply to examples I find of gross linguistic discrimination like this, because if I did it'd be a full time job. But I noticed that in the introduction they'd linked to a New York Times column that references a paper that I co-authored on the Philadelphia dialect. I didn't think the NYT column was appropriately respectful, and I said so on Language Log at the time.

The NYT columnist wasn't too happy about what I said, but I feel that I have an ethical obligation to the people who invite us into their homes and are generous with their time and stories, to provide them with a vigorous public defense if their communities and the way they speak are ridiculed as a result. Moreover, language shaming pieces like this Gawker tournament only poison the waters for future sociolinguistic research, especially if our names as researchers are attached onto them in some way.

And as I was writing up some notes for this response, and followed more links from the Gawker pieces, I was really shocked by how many articles they've linked to that are popular writeups of sociolinguistic research, usually including interviews with one or more sociolinguists! It's like half my facebook friends list in there! It feels so defeating to see these generally positive articles and interviews utilized to prop up an exercise as ugly and mean spirited as this one.

But what's the harm...?

Anticipating some reactions to this post, no, I'm not some grey humorless lump. But just because something is framed as a game doesn't make it fun, and it doesn't make it funny. For example, take Gawker's paragraph about New Orleans:
New Orleans is a steaming, fetid stew of aural bile, home to everything from the deep Cajun bayou accent to the Yat dialect, which derives from Irish, French, German, and even Italian into one completely incomprehensible mess. You need only watch this clip on the number of ways residents pronounce the city's name and neighborhoods and read this excellent article on the hodgepodge of New Orleans' accents to see how varied, and uniformly ugly, it all is.

Even if there was some intrinsic humor to that paragraph, instead of just raw nastiness, it would be important to reflect on who we're making fun of, and why. And on that count, I think the fella who compiled the Yat dictionary summarizes it well:
It's a working class language, probably, is what it amounts to.
And of course, that's what linguistic discrimination is really about. Maybe it's not always about class, but it's never really about language. It's about the kind of people who speak it. Predictably, the kinds of accents and languages which get dumped on the most, and get branded the "ugliest," always wind up being spoken by socially disadvantaged people. What exactly did this woman in particular do to deserve having a candid video of her slapped up on Gawker as an example of just how "ugly" the Chicago accent is? She works in a warehouse supermarket, that's what.

And this isn't a consequenceless game either. "America's Ugliest Accent Tournament" just puts a laughing face on a serious problem of discrimination that has economic and personal consequences for real people. To choose one example I'm familiar with, Anita Henderson did a study where she surveyed hiring managers in Philadelphia, playing them tapes of potential job applicants, and asked them to rate them on their job suitability. The topline summary from the abstract:
Those who sound Black are rated as less intelligent and ambitious and less favorably in job level.
In her textbook English with an Accent, Rosina Lippi-Green sums up my own opinion on the matter, but I've added some emphasis.
If as a nation we are agreed that it is not acceptable or good to discriminate on the grounds of skin color or ethnicity, gender or age, then by logical extension it is equally unacceptable to discriminate against language traits which are intimately linked to an individual's sense and expression of self

How's this different from these other examples?

A few of the supporting links from the Gawker piece are personal websites that are called "How to talk [City]" or "The [Dialect] Dictionary," put together by enthusiastic speakers of the area themselves. They tend to have a self deprecating tone, so isn't that similar to the Ugliest Accent Tournament? It sure as hell isn't! First of all, even if those personal sites do have a poking fun tone, the fact is the dialect must be important to the person putting together the site, or else they wouldn't have spent the time documenting it! Their self deprecating tone could either be due to the general difficulty of expressing seriously how important a topic is to you, or to their internalized linguistic insecurities driven by things like America's Ugliest Accent Tournament. Moreover, Gawker is a really large media organization, and should be taken to task if only due to their profile and influence.

Sociolinguists ask people what they think about accents and dialects too. It's a subfield sometimes called Perceptual Dialectology. Isn't that kind of the same? Don't even start! A goal of sociolinguists is to understand the social landscape of language as well as we can, and that includes people's sometimes crummy attitudes about it. But if we have a goal, it's to critique those attitudes, not revel in them in some kind of user engagement experiment so that we can go cash in our pageviews with advertisers.

The Fundamental Sociolinguistic Outlook

Real quick, let's contrast the overall tone of the America's Ugliest Accent Tournament with what I would call the fundamental sociolinguistic outlook on speakers. I think Bill Labov summed it up nicely at the end of his 2009 Haskins Prize lecture (go listen to it if you haven't already).

Versus Gawker
No matter who you are, you all sound disgusting.

What to do about it?

I for one will be writing a polite e-mail to Gawker asking them to remove the link that references my research, and to avoid linking to anything that references my research in the future. I'd encourage anyone else whose research they mentioned to do the same.

Thursday, September 25, 2014

The new iOS Health app is disappointing

I've been getting into using a few different health tracking apps, and have been getting tired of needing to punch the same data into 3 different places every time I step on a scale. So, I was reasonably excited about the new Health app in iOS8, which would act as one central repository for this information that the individual apps could pull from. The fact that release of the HealthKit API has been delayed, meaning my 3 different health apps can't access the Health data yet, is disappointing, but I'm pretty patient about these things.

However, the Health app itself is really disappointing all on its own. It is not a success of data reporting or visualization. For example, here is what the record of the number of steps I've taken each day for the past month looks like.

So, riddle me this: How many steps did I take yesterday? What was the date that I took the most steps? What day of the week was that weird dip? Not only are answers to basic questions like these not "glanceable," they are totally inaccessible. There is, in fact, no way within the Health app to find these answers, but back to that later.

Let's get a bit more detailed. What is the range of the y-axis. It looks like the bottom horizontal like corresponds to 1,500 steps. That's already a questionable data reporting decision. It should probably correspond to 0 steps. How about the top of the y-axis range? The top horizontal like looks like it corresponds to 13,951 steps, but I'm actually pretty sure that is the maximum number of steps in this data. But the maximum data point doesn't touch the top line?

But let's talk about how Apple really failed to meet baseline expectations with these graphs. When I realized I couldn't read the data precisely off the graph, my first instinct was to drag my finger across the line, assuming that more detailed contextual data would pop up. Sort of like how this Google Ngrams graph works. It should even work on mobile if you tap on it. Or, take this excellent bit of interactive visualization from the New York Times Upshot blog. Or any line graph out there with any bit of polish. Users are more or less trained by this point that hovering over line graphs activates some kind of additional contextual information, whether it's more detailed labeling, brushing, or like that NYT visualization, additional graphs! So you might expect that on the baddest touch screen device ever in the world (as Apple would have us believe), there's going to be some wild and crazy touch interaction, pinch-to-zoom pizzazz. Or at least it might have the same baseline functionality as some silly web widget that I can embed in my blog.

No such luck, and the data viz nerd in me sees this as one of the biggest missed opportunities I've seen in a while. It is just a static image, with some minimal transition animations when you switch between different time scales. If you tap on the graph, you get taken to the raw data, which looks like this.

As far as I can tell, this is the really raw data offered up by the motion co-processor. Ludicrously, you can select and delete any individual bout of steps. So, if I felt that actually, one of the 8 groups of steps logged all in the minute of 10:08 AM was inaccurate, I could delete it!

What really frustrates me about the fact that I can see this data is that I can't touch it. Data at this granularity is pointless other than to show off the fact that there's a lot of it. It needs to a little bit aggregated before it gets interesting. And the fact that I don't like the Health visualizations as it is, I'd really go to town on this raw data. But conspicuously absent here is any export utility. I can look at, but not touch my own data. I guess I also couldn't access the data before iOS8, but they didn't waggle it tantalizingly in front of my nose like this!

So sure, maybe someone will make a third party app that will access the data from the HealthKit API and allow me to export it from there. As if what I'm really dying to do is clutter up my phone with an inevitably junky ad riddled app that contributes functionality that really should've been there in the first place.

To sum up, the static figures are poorly designed and minimally informative, but static figures are hardly what I would expect from a corporate entity like Apple anyway. On top of that, waving this raw data in my face is equal parts useless and infuriating.

Sunday, April 13, 2014

Baby Naming Trends: Now With More Linguistics!

This animated graph about the rise in boys names ending in <n> has been making the rounds lately.

It comes from this blog post by David Taylor.

It's a really cool graph, but then, I tend to find analysis of baby names a bit frustrating because they almost always rely strictly on the written, or orthographic, forms of the names. It's not that the way people spell their children's names doesn't matter, but it's half of the puzzle. For example, I'm named after my grandfather. He was German (more specifically, a Donauschwob), so he spelled his name <Josef>, and pronounced the initial sound like <y>, which in the IPA is /j/. When naming me, my parents had a whole bunch of options. Would the pronounce my name like my grandfather did, or like most English speakers would? And how would they spell it? They wound up settling on the English pronunciation, and the German spelling. I've made a little diagram displaying a very partial set of options my parents had in choosing my name.

And of course, Sarah Jessica Parker played a woman named /sændi/ who spelled it <SanDeE☆> in Steve Martin's LA Story, so clearly the spelling of proper names is an important expressive dimension, but still just half the picture.

So, I decided to look at a bit more at popular linguistic structures in baby names. Hadley Wickham has already compiled the top 1000 baby names in the US per year since 1880 (, and Kyle Gorman has a nice python module that syllabifies CMU dictionary entries ( So I put together some sloppy code to analyze it ( The biggest weakness to my approach is the number of names which are not to be found in the CMU dictionary. 2525 out of the total 6782 names in the data (about 40%) aren't in CMU, so this post should be understood as being for entertainment purposes only.

One other thing that bugged me about the name final <n> plot is that it seemed kind of arbitrary to focus on the final letter of the name. I suspect that it's a real trend that people noticed eyeballing lists of names, but that it wasn't compared against other kinds of trends. I went ahead and labeled name initial and name final syllables, codas, onsets and rhymes as being special, but I'm not going to single them out.

Kicking things off, there's a graph of popular syllables between 1880 and 2008. To be included in the graph, a syllable had to be in the top 3 most popular in any given year. The y-axis is how many times more frequent the syllable is than if syllable selection were random. It's not frequency rated, that is, this is just the distribution over names that have that syllable, not babies.
It's a bit chaotic, I know. It's a time like this that I wish I'd learned a little JavaScript so I could make an interactive version with brushing. Here's another version where each syllable gets a facet. They're ordered by their decreasing maximum ratio.
It looks like at the syllable level, name final /nə/ and /li/ for girls are both long time favorites, as well as more popular syllables than any boy's name final /n/ syllable. The most popular boy's name final /n/ syllable looks like it's always been /tən/, but maybe it's flagging a bit compared to the recent surges in /sən/ and /dən/. It also looks like popularity in syllables is pretty evenly split between name initial and name final syllables. For both boys and girls, some kind of initial between /e/ ~ /ɛ/ ~ /æ/ is pretty popular, but I can't be sure what's going on there, because the CMU dictionary has the same entry for both <Aaron> and <Erin>.

But maybe the reason boy's name final /n/ isn't shining through like you might expect is because of phonological reasons. A boy's name ending in a word final syllabic /n/ is necessarily going to pull the preceding consonant into the syllable with it. Looking at the plot above, it's not likely that the preceding consonant is totally random either, cause we've only got /t, s, d/ (all coronals) and vowels preceding the /n/. But for the hell of it, here's the same kind of plot as the ones above, but this time with syllable rhymes.
There's a lot less volatility in the rhymes data, probably because there's fewer different kinds of syllable rhymes. Complex rhymes don't seem to be that popular ever. We've mostly got vowels from open syllables, and syllabic consonants. At any rate, the popularity of name final /n/ for boys is pretty clear, taking over from /i/ (from names like Billy and Jonny). The boy's trend towards name final /n/ seems to be about on par with the trend for girls names to end in /ə/.

I'd like to play around with this data a bit more if I get some time. It occurred to me that you could come up with a few different ways of generating popular names from different eras by randomly sampling popular syllables, or by estimating transition probabilities between syllables and going on a random walk that way.

All my code and the data are up on github, if anyone else wants to play around with it:

Disqus for Val Systems