Saturday, March 31, 2012

More on Philadelphia Homicide

I've been doing more analysis of the Philadelphia Homicide data that the Philadelphia Inquirer has published, and presented some of it at the Philadelphia UseR group yesterday. My slides [pdf] and source [knitr .Rnw] are on github.

I should be clear that I am not an expert on crime and murder. In fact, I'm not even fairly knowledgeable. If anyone out there with more expertise has strong criticism of my "analysis" (really, it's just a rough exploration of the data), I'll eat it, and I'll look forward to your own analysis of the data (again, it's right here). Here are some of the most striking patterns that I found.


First, here is the total number of murders that occurred over the past 23 years, broken down by the day of the week. The weekends are worse than the weekdays.

Next, here are the total number of murders by hour of the day. The hour of the day was not included in the data until 2006, so this only represents murders between 2006 and 2011. The plot is centered around midnight, so the afternoon of Day 1 is on the left, and the morning of Day 2 is on the right.
It looks like there's something weird going on around 11pm and midnight, which I have to chalk up to the reporting patterns of the PPD. For some reason, it seems like murders which occurred in the midnight hour are more likely to be logged as occurring at 11PM.

Here is the most striking plot that I produced this time around. It plots, by month, the average frequency of murders. The y-axis represents 1 murder every X days.

Since 1988, the African American community has been living in a Philadelphia with approximately a murder every day, or every other day. The White community, on the other hand, has been living in a Philadelphia with a murder once a week.

I also did some meager statistical analysis, specifically poisson regression with terms for the month (that is, January, February, etc, to look for a seasonal pattern), race of the victim, and weapon used. There was a significant month effect, but the coefficients didn't have much of a pattern to them. I did use number of days in the month as an offset in the regression, so it's not that. More importantly, there was an unsurprising main effect of race, but also a big interaction between race and weapon. Specifically, African American victims were way more likely to be killed by a gun.

Guns and knives are the two most common weapons used in murders in the data. White murder victims are 2.54x more likely to have been shot than stabbed, while an African American murder victim is 7.19x more likely to have been shot than stabbed, meaning that African American murder victims are 2.83x more likely to have been shot than a White murder victim was.
Update: There was a pretty serious flaw in my regression, in that if there was a Month where, say, no African Americans were murdered with a knife (and there were plenty), that month's data was missing, rather than 0. Filling in the data appropriately to reflect months with 0 murders for a particular race x weapon combination, the estimates are pretty different. White murder victims are 5.71x times more likely to be murdered with a gun than a knife, while African American murder victims were 8.62x times more likely to be murdered with a gun than a knife, meaning African Americans are 1.51x times more likely to be shot than stabbed. So, that's a pretty serious revision approximately halving the multiplier. I've already updated the linked code and slides.
So, gun deaths are an especially acute problem in the African American community. In fact, if you exclude gun deaths from the data, it actually looks like the racial disparity in murder rates has been narrowing.

It is purely coincidental that I'm posting this on the same day that the Philadelphia Police Department are doing a gun buyback. You can bring in a gun and receive a $100 Shoprite voucher, no questions asked. Seems like a good initiative.

Analysis Discussion

I spent a bit of time trying to figure out what I thought the most meaningful way to represent the murder rate was. First, I calculated the murder frequency by counting how many n murders there were a month, then divided that by the number of days in the month for (n murders/n days)=murders per day. But the resulting measure had values like 0.14 murders per day, which isn't too informative. What people want to know about murders, or at least what I want to know, is how often murders happen, not how many happened in a given time window. So, instead, I calculated (n days/n murders)=days per murder.

The y-axis for the murder rate figures is also a logarithmic scale, which is both reasonable given the distribution of the data, and the impression of the timescale. From a human perspective, the difference between 1 day and 2 days feels larger than the difference between 3 weeks and 4 weeks. The y axis is also flipped, to indicate that smaller numbers mean "more often".  I managed the reversed log transformation by writing my own coordinate transformation using the new scales package. Here's the R code.


  1. I am reorganizing part of my book ( so that I can mention this. Mind if I include your line chart?

  2. You hearby have my permission. I'll e-mail you the .pdf for high quality.

  3. hi, many thanks. I'll read it later on, but you have an error in the pdf you attached. It is from another presentation!

  4. Thanks! I've fixed it now.

  5. i'm trying to reproduce your example but i'm getting an error in estimating the model...

    > w.mod <- glm(freq ~ month * race * weapon,
    +        offset = ndays,        family = "poisson",
    +        data =
    Error en `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
      contrasts can be applied only to factors with 2 or more levels

  6. it seems to be a problem with the offset parameter... if i remove it, the model is estimated... (in any case: congratulations for your blog! very interesting!)

  7. That's strange. Are you sure that you ran this piece of code before creating

    ndays <- data.frame( = seq(as.Date("1988-01-01"), 
        as.Date("2011-12-01"), by = "month"),
        ndays = as.numeric(diff(seq(as.Date("1988-01-01"), 
        as.Date("2012-01-01"), by = "month"))))
    philly <- join(philly, ndays, type="left")

    Since you say the issue is in the offset parameter, I'd have to guess that the problem has to be with the ndays column, and this is the code that creates that column.


Disqus for Val Systems