Saturday, February 11, 2017

A Scottish Narrative of Personal Experience

In a web extra from Samantha Bee's show, we get an extended interview with the farmer Michael Forbes telling a story about how he chased Donald Trump Jr. away from his Aberdeenshire farm. It was actually an excellent example of a narrative of personal experience, very similar to the paradigmatic type I was taught about in my sociolinguistics classes.

It's been a long time since I've tried any kind of narrative analysis, so apologies for any errors, but I'm going to give it a shot here cause I think it's such a good example, and indicative of how rich even simple narratives like this can be.

Check out the video, it's worth it.

Here's a transcript of the narrative, with the contributions of the interviewer simplified a bit:

a. I chased his son once from here.
b. (Which one?) The one who's got the greasy hair.
c. (Which one?) Young Donald is it?
d. I was having a cup of tea.
e. And they knock knock knock on the door.
f. And eh, my mother answered it.
g. And she says "We've got visitors".
h. And she shut the door
i. So they tapped a bit louder, you know?
j. And, mother answered it again.
k. She says "I told you, we've got visitors"
l. Shut the door again.
m. Rattled louder.
n. Well, I answered it the next time.
o. I says "Get the fuck out of here"
p. I says "If you come back here again"
q. I says "I'll have you charged with harrassment"
r. "Don't get angry mate, don't get a--"
s. I says "Fucking angry"
t. I says "I'll show you fucking angry"
u. And they go out the gate.
v. (You loved it) Yeah I did, Yeah
This satisfies the Labov & Waletzky definition of a narrative in that the sequence of clauses is the same as the temporal sequence of events, except for line (a). Lines (a-c) would be called the "Abstract", outlining the most reportable event of the narrative. After we have the abstract, we have the orientation in (d) I was having a cup of tea. Then it is just complicating action after complicating action until the resolution in (u) And they go out the gate.

One interesting thing about this narrative is it is devoid of any evaluation. All of the turns are devoted to what events happened, and what was said, but there is no turn contributing what his state of mind was, or any other kind of evaluation of the situation. You could imagine the addition of a "It was rude to keep knocking" turn, for example, but that is absent. I was told once that avoiding evaluation makes for a better narrative for the listener, and is more typical of working class narratives. The interviewer, Amy Hoggart, tries to extract an iota of evaluation out of him (You loved it), but he doesn't repeat the evaluation, he just agrees with it (Yeah I did, yeah).

The repetition of the knocking and answering events is really interesting as well. They serve the obvious narrative function of heightening the tension, to good effect as we can see on the face of Amy Hoggart at turn (n) when Michael finally answers the door. I would hazard a guess that if Donald Trump Jr. knocked on the door twice, or four times, this story would (and should) still be told with three knocks. It's also interesting that in the first two knocking-answering events, DTJr and his entourage don't actually say anything. They knock, they are rebuffed, and the door is closed. There's no other exchange of words.

There's also a complex piece of cultural information being conveyed here, regarding when it is appropriate to call on someone. Michael's mother indirectly tells DTJr to go away twice by saying We've got visitors. No visitors were mentioned in the orientation of the narrative, so I'm assuming that there were no actual visitors in the house. The fact that you should not call on someone when they have visitors is presented as so obvious that it is a de facto instruction to leave within the narrative, and in the telling of the narrative it goes unexplained.

Finally, his verbs of quotation are awesome. He exclusively uses say in the historical present, and it seems like he is only reporting on speech which was said. D'Arcy (2012) has argued that the new verbs of quotation, like "be like" have been buoyed up by an increasing tendency to include quoted thought and mimesis in narratives. Mimesis isn't absent from Michael's telling of the story, but it is absent from the text. He physically acts out each knocking and each door shutting event, and he acts out chasing them out the front gate.

All in all, a pretty good narrative.

Monday, October 10, 2016

Tina Fey Nailed The Philly Accent

Saturday night, Tina Fey and Jimmy Fallon appeared on Weekend Update as "White Women from Suburban Philadelphia." Apparently suburban white women are the "the swing vote within the swing vote" which was the premise for having them on the show. It seemed like it was also at least partially to make fun of Jimmy Fallon's softball treatment of Trump.

I'm blogging about the sketch in appreciation of Tina Fey's performance of the Philadelphia Accent. First off, you should watch the sketch:

Over at AV Club, they criticized both Fey and Fallon for their accents:
Despite how the two pronounced “hoagies,” the performances (and accents) were all over the place, with Fey not bothering to do hers once she launched a Fey-esque attack on Mike Pence’s anti-gay, anti-woman agenda.
The sketch might not have been funny, and Fallon clearly didn't know what he was doing, but Tina Fey was on point, in my professional opinion. It's not surprising Fey should be able to perform the accent since she's originally from Upper Darby, which is right next to Clifton Heights in Delaware County. That corner of DelCo actually has an interesting place in the study of the Philadelphia Dialect. R. Whitney Tucker wrote one of the earlier descriptions of the dialect in 1942 while he was at Pennsylvania Military College, now called Widener University. At the time, he said
I think that the real heart and centre of this [Philadelpia] dialect was originally, and still is, a few miles to the south, in the eastern part of Delaware County.
So how did Fey (and Fallon) deal with the accent? I previously blogged about Chris Matthew's native performance of the dialect by running through a list of the dialect features, but this time I ran the sketch through the FAVE-suite! So up ahead is a detailed phonetic analysis of Fey's performance.

Overall Vowels

First up, here's Fey's over-all vowel system. I look at plots like these all the time, so this is inherently meaningful to me, but I'll try to break it down a bit.

There's  a few top-line things I see here. First off, she's moved her /ahr/ vowel (as in start and far) way up to be right next to her /ɔ/ (you can hear it when she says "in charge"). That is exactly correct, but her /owr/ could be higher, since pore, poor, pour are all merged in Philly, and is usually the highest backest vowel in the system.

She's also got a nice separation of /ay/ (as in ride) from /ay0/ (as in write) in a Canadian Raising pattern (you can hear it when she says mice). This is one of the more established features of the Philly accent now. There's also an /ey/ (as in face) and /eyF/ (as in gay) separation. This is less commonly discussed, but before consonants, eight is pretty similar to eat in Philly, while word finally (like gay) it's much lower, almost southern sounding. But, when I look at Fey's /ey/, /eyF/ difference in detail, it's not especially consistent.

One thing that looks like she's overly doing it is /uw/ fronting (as in scooter). She has it as far front as her /Tuw/ (as in do). Philadelphia is known for its /uw/ fronting, but usually it is much fronter following coronals than it is elsewhere, so I would expect to see /uw/ further back here.

She's also got some kind of /ɛ/ /æ/ merger that I can't account for. That's not a Philly thing that I know of. One last thing that caught my ear is some kind of consistent /ʌ/ backing (as in Trump). That isn't usually described as a feature or ongoing change of the dialect, but it sounds authentic to my ears.

Vowels in Detail

But let's get into more of the guts of the system. In the over-all vowel plot, there's a really good split between the tense and lax short-a (/æh/ and /æ/ respectively). But the Philadelphia split-æ system is complicated, but partially overlapping with related systems. So how'd she do in detail?

Based on limited data, I'd say she more or less nailed it. Her /æ/ before /s/ in jackass is tenser than the  one before /k/ (although, a friend on facebook said they didn't think it sounded authentic). Her /æ/ in grandpa is a bit lax, but I think that's forgivable given the preceding consonant-liquid cluster. Most impressively, she's got a properly lax [æ] in her two repetitions of Indiana, and in banging. I would expect those to be the most likely to be messed up by someone trying to do an impersonation.

Incidentally, I think it could be the fact that she correctly had lax [æ] in Indiana that the AV Club thought she stopped doing the accent, since most American accents would have tensing there.

Honestly, this distribution of data makes me think that she's just got the split system natively. I haven't tried to compare this performance to something less affected, so I can't tell if she just always does this.

I also looked at her vowel dynamics a bit.

A few things jump out. First, I think she did a really good job on the phonetic quality of /aw/ (as in clown). I've found that it's basically a falling diphthong, and the most advanced tokens have a later F1 maximum, which she pretty much nails here. She also has a clear Canadian Raising pattern between /ay/ and /ay0/, with maybe some monophthongization of the pre-voiced tokens that's producing a strange trajectory. It's also possible that she's doing the /ey/, /eyF/ split in the formant dynamics, but I wouldn't put too much stock in that.

Here's what her /ɛ/ /æ/ and /æh/ vowel dynamics are like:

Clearly very good dynamic differences between /æ/ and /æh/. I included /ɛ/ here (labelled "e" in the plot) just to get another look at that weird /æ/~/ɛ/ merger. It seems to be there more or less in the dynamics as well. I really don't know what's happening there.


I also did some quick and dirty coding of Fey's consonants. One of the things I noticed is that Fey was doing quite a bit of (dh) stopping (a dental stop in words like the and this). When I tallied it up, it looked like it was actually about 20% of her tokens, which is a bit low for Philadelphia.

stop deleted fricative
count 4 4 13
proportion 0.19 0.19 0.62

She also had only one /str/ sequence, but she backed it to a very clear [ʃtr] in street.

I haven't tried to touch her /l/ vocalization or darkening at all, cause I think I have a tough time with that perceptually, and I don't have a fancy script for analyzing it like I do for vowels.


Meanwhile there's Jimmy Fallon. I'm not going to go into detail with him, cause as he said in the sketch, his accent was all messed up. Here's his overall vowels, and his dynamics.

Yeah, so it looks like he's just vacating his back vowel space, and also monopthongizing a lot of things? Basically, this isn't anything.


In conclusion, Tina Fey nailed it, and Fallon didn't know if he was coming or going. 

Friday, September 2, 2016

Open Question: "I'm a big fan of yours"

Linguists, I've got a question. Is there any distinction to be made between (1) and (2)?
  1. I'm a big fan of Rhianna.
  2. I'm a big fan of Rhianna's.
I find both acceptable, but think I think I would prefer (1)? For pronouns, though, I think they must be possessive pronouns.
  1. *I'm a big fan of you.
  2. I'm a big fan of yours.
  3. *I'm a big fan of her.
  4. I'm a big fan of hers.
But I think it's strictly animacy based, since for inanimates I think the possessive form is ruled out, or is at least worse.
  1. I'm a big fan of coffee.
  2. *I'm a big fan of coffee's.
  3. I'm a big fan of Star Trek.
  4. *I'm a big fan of Star Trek's.
  5. I'm a big fan of it.
  6. *I'm a big fan of its.
I don't think the acceptability of (1) is related to a sort of corporate entity reading of "Rhianna." Both of the following seem fine to me:
  1. I'm a friend of Joel.
  2. I'm a friend of Joel's.

Monday, February 15, 2016

The Sound of Silence

One counterintuitive thing about doing linguistic analysis is how much time we spend analyzing structures and elements that are silent. That's probably pretty confusing for people when they first learn about it. For example, compare these three sentences:
  1. I want a cookie.
  2. I ate the cookie.
  3. I ate cookies all day.
In your grammar lessons in school, you probably learned that the words a and the were called "articles," but linguists usually call them "determiners." Most people describing the sentences in 1, 2 and 3 would say something like:
"Sentences 1 and 2 have determiners in them. Sentence 1 had a and sentence 2 has the. Sentence 3 doesn't have a determiner."
A linguist, on the other hand, when describing these three sentences would be likely to say:
"All three sentences have determiners. Sentence 1 has a, sentence 2 has the, and sentence 3 has a silent determiner."
Silent determiners is just scratching the surface of all of the possible silent words linguists have postulated. A really common kind of reaction to all the silent words is "Bullshit!" Actually, that was my reaction when I took my first syntax course, but eventually, I was convinced.1 For a lot of the silent elements linguists have proposed, there's usually some good reason or evidence for doing so, but we do have to be careful not to over-hypothesize silent elements to make the data work. This is the really interesting tension of abstractness.

Here's an example I'm dealing with in my own work, involving how Philadelphians have traditionally pronounced their short-a in words like mat and man. Usually, Philadelphians have a "tense" or "nasal" sounding short-a when the sound following is an /m/ or an /n/. For example, man is tense, but mat is lax. But this only happens if the /m/ or /n/ is in the same syllable. So the word ham comes out tense, but the word hammer comes out lax, because the /m/ is in the next syllable, if you sound it out (ha-mer). Plan is tense, but planet is lax (pla-net).

One weird exception to this pattern is the word exam. Exam usually comes out lax, even though the /m/ is in the same syllable when you sound it out (ihg-zam). One way of analyzing this is just to say "Ok, exam is just an exception to the rule." But, I'm going to make a different argument, which is that every time a Philadelphian says exam, they're actually saying an abbreviated form of examination, and in examination, the /m/ is not in the same syllable (ihg-za-mih-ney-shun). So the word exam is lax, because it's really examination.

One objection to this argument is that examination looks like it's exam+ination, so how do we know that what people are saying is examination and not just exam without -ination added to it. Well, let's look at some other words that end in -ination. We have imagination. If we abbreviate imagination, like "I've totes got a wild imag," we get out ih-majh which means the same thing as imagination, and it has stress on the second syllable. This looks like exam (ihg-zam), which means the same thing as examination, and has stress on the second syllable. What happens if we never add -ination to imagination? We get out image (ih-mijh), which doesn't mean the same thing as imagination, and has stress on the first syllable. That doesn't look like exam at all. This makes it seem more likely that exam is an abbreviation like "I've totes got a wild imag", and is not just a bare word, like image.

But there is no word ehg-zum, so does that mean that the root exam can only ever appear attached to -ination? That's not so weird if you look at other words that end in -ination, like destination. Dest never shows up as its own word. But it seems to show up in related words like destined or destiny.

There's also a little bit of historical evidence that exam was originally an abbreviation. If you look at some of the earlier examples of exam in Google books, authors used to end it in a period, like you would for any abbreviation. For example, J.M. Barrie (more famous for writing Peter Pan) in his 1889 book An Edinburgh Eleven wrote:
I knew a Snell man who was sent back from the Oxford entrance exam., and he always held himself that the Biblical questions had done it.
That period followed immediately by a comma makes it pretty clear that in this sentence, exam is an abbreviation. And if you look at Google ngram patterns, exam seems to be increasing in frequency relative to examination.

So exam looks like it was historically an abbreviation of examination, and the fact that Philadelphians have traditionally pronounced it with a lax short-a suggests that it still is an abbreviation for us, unconsciously.

This is just one example of how language is not a "what-you-see-is-what-you-get" game. There is a lot of silent structure to language that we're not consciously aware of. It takes a mixture of clever reasoning and empirical data to work it out.

But I'm guessing there's still a few people saying "Bullshit!" out there.
1. I actually once wrote a post on my undergraduate blog about how I thought PRO was absurd, and Norvin Richards left some really nice and patient comments on it. I'm kind of embarrassed about that now.

Thursday, October 29, 2015


I'm just recently back from the New Ways of Analyzing Variation (NWAV) conference, the top variationist sociolinguistics conference in North America. This was NWAV44, hosted by the University of Toronto, and as usual it was a lot of fun, and really exhausting.

When you go to a conference regularly enough, you start making conference friends, people who you only know and really only ever see in the context of the conference. Seeing my conference friends again is always one of the things I look forward to when going to NWAV, and it's what makes it all the more disappointing when I can't go for some reason or another. Of course, nowadays, if you can't make it to NWAV in person, you can always follow along at home on the Twitter hashtag. Really, it's almost like a parallel conference going on on Twitter. I go to a lot of conferences that get tweeted about, but I feel like NWAV tends to have a much higher rate of twitter traffic, and this was especially true of NWAV44. The Manchester LEL twitter account speculated that this might be the most tweeted about conference of all time.

It definitely spawned the most parody accounts (@GreatHallBird@nwavAVghost). But, I thought I'd take this up and pick over the Twitter traffic on the #NWAV44 stream. I pulled down all the tweets I could with the twitteR package, excluding retweets.

In the 4 days of the conference, there were 3,196 tweets on #NWAV44. Here's the distribution of tweets, binned into 10 minute intervals, color coded by what was happening at the conference at the time (according to the official schedule).

The highest number of tweets in any 10 minute period was 53, for a rate of 5.3 tweets a minute, during the second paper session on Saturday.

There wasn't much tweeting during the poster session, which is too bad. First of all, it means a bit less exposure for people in the poster sessions. Second of all, posters are an intrinsically visual medium, and conventional wisdom is that tweets with pictures attached get more attention. I can't tell whether a tweet has a picture attached in this data, but I can tell if it has a link. I broke tweets down into two categories: tweets with an "https" link and does not contain the words "slide" or "talk", and all others. Winds up looking like this:
might have image n average retweets average favorites
yes 378 1.29 3.31
no 2,818 0.82 1.89

Maybe something to think about next time you're live tweeting a conference.

Another thing I was interested in was what the average tweeting rate was during any given 20 minute talk + 10 minute question period. So, I took each talk period, and counted up how many tweets were sent each minute. Here's what it looks like.

During a talk, it looks like there's a pretty steady average of 2 to 3 tweets per minute persisting into the Q&A period, dropping off precipitously during the lat 5 minutes of Q&A as the speakers switched over. It's really pretty striking that during the meat of any given talk period, the most common number of tweets in a minute is something like 1 or 2, not 0.

One last thing I looked at was tweets about the birds. In the Great Hall in Hart House, there were at least two birds flying around the rafters and eating the crumbs after coffee sessions. It was a major topic of twitter conversation, and spawned the parody account @GreatHallBird. The volume of tweeting about the birds reached its peak on the third day of the conference, when about 8% of all tweets contained the string [bB]ird.

There was initially some competing ways of referring to the bird. Some people originally decided to call it "Ferdinand", but midday on the 24th, the @GreatHallBird account started tweeting, and eventually that standard became the most popular. Here is what people were calling the birds out of the options of just "[bB]ird", "[fF]erdinand" and "[gG]reat\s?[hH]all\s?[bB]ird"

Unfortunately, I don't have access to historical twitter data to check how #NWAV44 compared to other large conferences, like maybe #ICPHS2015.

UPDATE: I just realized that I didn't filter out tweets by the @GreatHallBird account itself when estimating the rate of tweets about the birds. When I exclude its tweets, as well as any line initial @-mentions, nothing really changes that much qualitatively, but the third day of the conference tops out at about 7% tweets about the bird, instead of 8%.

Wednesday, August 5, 2015

Gender in the Wasteland

One thing I do with the bit of leisure time I have is play video games. I feel like I need to be a bit self defensive about it, given my age and place in life, but according to surveys the average age of gamers is somewhere between 30 and 37, so I fit right in there (although, the age distribution probably has a strong leftwards skew, so it'd be nice to know the median age). Many mainstream games have socially problematic themes, like violence being the only available recourse to progression in the game. A really cool video series on that point is the Grand Theft Auto Pacifist. Another common problem games have is their portrayal of gender (the topic of this post), and on that topic, you should obviously go watch Tropes vs Women in Video Games.

I think these socially problematic themes are a bigger deal than when they show up in other media. Video games are unique in requiring an alignment of the audience's motivations with the main character. For example, the experience of watching Breaking Bad and observing Walter White's moral descent would be very different from playing a Breaking Bad game and controlling Walter White. What I've learned playing video games is that while it may be projected on your TV screen, the game is really in your head.

I've recently been playing Fallout Shelter, a mobile game set in the post nuclear apocalypse Fallout universe from Bethesda Studios. Fallout 4 is maybe the most highly anticipated game coming out in time for Christmas this year, and Fallout Shelter is fun diversion for fans to play in the meantime. Your role is the overseer of a Vault-Tec vault where dwellers escape the radioactive wasteland. You build out and populate your vault, and assign dwellers tasks like food, water and energy production.
It's kinda like an ant colony.

Our story begins when I assigned a dweller to the Science Station I'd built so she could produce RadAways (a treatment for the omnipresent radiation). I wanted to equip her with the Professor Outfit, which would boost her Intelligence stat, speeding up the production of RadAways. I scrolled through the inventory a few times, and couldn't find the Professor Outfit. The only possibility I considered was that I'd forgotten to unequip it from the dweller who was wearing it before. But as I messed around, it became clear that you're unable to equip female dwellers with the Professor Outfit... Yeah, I know.

Here are two dwellers, Alexander and Judy, in the Science Station.

When I select Alexander and scroll through the outfits inventory, the Professor Outfit is right there between the Nightwear and Radiation Suit. When I select Judy, it's just absent. It's not even greyed out, just invisible.

This is a problem. There is a very strong cultural expectation that "Professors are Men", and this expectation gets visited upon my female friends and colleagues in really unfortunate ways. The most common and obvious day-to-day experience I've been told about is people assuming that "Prof Smith" is a man in e-mails. Or it's assumed that they're administrative support staff instead of faculty at meetings. Sometimes people assume female profs are students showing up for the first day of class, instead of being the instructor. One friend of mine said they got student feedback on a course saying that they were a great teacher and would doubtless be successful "in whatever career she pursues."

These are all obvious examples of women not being taken seriously in their professional roles, and the "Professors are Men" expectation has some obvious impacts on their careers. Women are grossly underrepresented at the highest faculty levels, and get paid about 90% of what men at equivalent levels do. So what's it matter that in Fallout Shelter, Alexander can wear the Professor Outfit and Judy isn't even presented with the option?

An all too common scene.
Well, first, it's just re-enforcing the "Professors are Men" expectation. Who is this message reaching? For one, me, and probably many of my students, and we're the ones who do a lot of the damage with the "Professors are Men" assumption. But it's also probably reaching a larger portion of women than even other games with poor gender portrayals do. It's a mobile game, and mobile games are disproportionately popular among women. Also, it has an undeniably The Sims-like element to it, a game which was also disproportionately popular among women. So, it's a fairly negative message about women, in all likelihood being disproportionately directed at women.

This is also a game with a bit of cultural reach. There is some speculation that it's out earning Candy Crush Saga, and it topped the App Store charts for a while. Bethesda is also a really big and popular game studio, and Fallout Shelter bears the "Editor's Choice" badge.

Preventing female dwellers from equipping the Professor Outfit is also the only gender based equipping restriction I've come across in the game. All of the other outfits can be worn by male and female dwellers. Both Alexander and Judy can wear the Combat Armor, but it's drawn a bit differently for their different bodies.

Sometimes, with things like gender representation, there is an in-game explanation for why things are the way they are, and we need to have a conversation about why gender representation in fictional worlds is an important issue for our real world. But I don't think that's what's going on here. The fact that the outfits are drawn differently for male and female dwellers, and the fact that the Professor Outfit is just unceremoniously absent without any kind of in-game explanation suggests to me that they just didn't bother drawing a Professor Outfit for female dwellers. That is, we're observing here a real world instantiation of the "Professors are Men" assumption rather than some kind of intentional fictional representation of that assumption.

But just because it isn't intentional doesn't mean it isn't sexist, and doesn't mean it shouldn't be changed. Being unintentionally overlooked is exactly the problem my female colleagues are facing. I've sent Bethesda an email briefly outlining this, and asking them to rectify it in an update. Don't know if I'll hear back from them, or if I'm better off yelling at no one in the Wasteland.

Edit: The pregnancy mechanics are also pretty messed up too.

Video games are a broad medium though (this is more self defensiveness), and can be utilized for all sorts of purposes. For example, the Iñupiat Cook Inlet Tribal Council had the game Never Alone made, in which they embedded parts of their oral traditions for their younger generation. It's a really pretty and fun game. There's also the episodic game Life is Strange, which is essentially a game about friendship, family, the near constant threat of vitimization women face, and the anxieties we all face in trying to make the right choices in life.

Thursday, December 4, 2014

A Silly "Name" Generator

tl;dr: It looks like names aren't well modeled as a Markov process, but you can install my R package that does model names as Markov processes and mess around with it.

I don't know how I wound up writing a "name" generator yesterday, but I did. And now it's an R package (just on github for the moment), so you can play around with it too (

I was messing around with some other research questions when I decided to see what would happen if I tried to model given first names as a Markov process. Here's a picture of a Markov chain that contains just the letters of my own name: Joe.

First, you start out in a start state. Then, you move, with some probability, either to the o,  j, or e character. Then, you move, with some probability, to one of the other states (one of the other letters or end), or stay in the same state. In this figure, I've highlighted the path that my own name actually takes, but there are actually an infinite number of possible paths through these states, including "names" such as "Eej", "Jeoeojo", "Jojoe", etc.

A Markov chain for all possible names would look a lot like this figure, but would have one state for every letter. Now, I keep saying that you move from one state to the next with "some probability," but with what probability? If you have a large collection of names, you can estimate these probabilities from the data. You just calculate for each letter what the probability is of any following letter. So for the letter "j", you count how many times a name went from "j" to any other letter. For boys names in 2013, that looks like this table.

from to count
j a 176452
j o 84485
j e 26118
j u 25616
j i 1920
j h 425
j r 121
j c 118
j d 98
j end 55
... ... ...

As it turns out, a whole bunch of name data is available in the babynames R package put together by Hadley Wickham. So, I wrote a few functions where it estimates the transition probabilities from the data for a given year (from 1880 to 2013) for a given sex, and then generates random "names", or just returns the most probable path through the states. How often does this return a for real name? Sometimes, but not usually. For example, the most probable path through character states for boys born in 1970 is D[anericha] with that"anericha" bit repeating for infinity. For boys born in 1940, it's just an infinite sequence of Llllllll...

So, that introduces a problem where the end state is just not a very likely state to follow any given letter, so when generating random names from the Markov chain, they come out really really long. I introduced an additional process that probabilistically kills the chain as it gets longer based on the probability distribution of name lengths in the data, but that's just one more hack that goes to show that names aren't well modeled as a Markov chain. 

Here's a little sample of random "names" generated by the transition probabilities for girls in 2013:
  • Elicia
  • Annis
  • Ttlila
  • Halenava
  • Amysso
  • Menel
  • Seran
  • Pyllula
  • Paieval
  • Anicrl

And heres a random sample of "names" generated by the transition probabilities for girls in 1913:
  • Lbeana
  • Peved
  • Math
  • Bysenen
  • Viel
  • Lelinen
  • Jabbesinn
  • Mabes
  • Drana
  • Lystha
The feeling I get looking at these is that they don't seem particularly gendered, even though there are clear gendered name trends. They don't even seem like they're from different times from each other. A lot of them aren't even orthographically valid. I don't know how they'd perform on "name likeness" tasks, but I don't even know what the point of doing such a task would be, since the Markov process has already failed at being a good model of names.

Maybe there's a lesson to be learned from the Markov process' failure to model names well, but for me it wound up being a silly diversion.

Update: I've now updated the package to generate names on bigram -> character transition probabilities. The generate_n_names2() function generates things that look more like names. It's kind of fun!

10 generated girl names based on 2010 data:

  • Rookenn.
  • Lilein.
  • Hayla.
  • Dailee.
  • Bri.
  • Samila.
  • Abeleyla.
  • Eline.
  • An.
  • Rese.

10 generated boy names based on 2010 data:
  • Briah.
  • Dason.
  • Jul.
  • Messan.
  • Kiah.
  • Jax.
  • Se.
  • Frayden.
  • Dencorber.
  • Gel.

Disqus for Val Systems