Saturday, March 31, 2012

More on Philadelphia Homicide

I've been doing more analysis of the Philadelphia Homicide data that the Philadelphia Inquirer has published, and presented some of it at the Philadelphia UseR group yesterday. My slides [pdf] and source [knitr .Rnw] are on github.

I should be clear that I am not an expert on crime and murder. In fact, I'm not even fairly knowledgeable. If anyone out there with more expertise has strong criticism of my "analysis" (really, it's just a rough exploration of the data), I'll eat it, and I'll look forward to your own analysis of the data (again, it's right here). Here are some of the most striking patterns that I found.

Results

First, here is the total number of murders that occurred over the past 23 years, broken down by the day of the week. The weekends are worse than the weekdays.

Next, here are the total number of murders by hour of the day. The hour of the day was not included in the data until 2006, so this only represents murders between 2006 and 2011. The plot is centered around midnight, so the afternoon of Day 1 is on the left, and the morning of Day 2 is on the right.
It looks like there's something weird going on around 11pm and midnight, which I have to chalk up to the reporting patterns of the PPD. For some reason, it seems like murders which occurred in the midnight hour are more likely to be logged as occurring at 11PM.

Here is the most striking plot that I produced this time around. It plots, by month, the average frequency of murders. The y-axis represents 1 murder every X days.

Since 1988, the African American community has been living in a Philadelphia with approximately a murder every day, or every other day. The White community, on the other hand, has been living in a Philadelphia with a murder once a week.

I also did some meager statistical analysis, specifically poisson regression with terms for the month (that is, January, February, etc, to look for a seasonal pattern), race of the victim, and weapon used. There was a significant month effect, but the coefficients didn't have much of a pattern to them. I did use number of days in the month as an offset in the regression, so it's not that. More importantly, there was an unsurprising main effect of race, but also a big interaction between race and weapon. Specifically, African American victims were way more likely to be killed by a gun.

Guns and knives are the two most common weapons used in murders in the data. White murder victims are 2.54x more likely to have been shot than stabbed, while an African American murder victim is 7.19x more likely to have been shot than stabbed, meaning that African American murder victims are 2.83x more likely to have been shot than a White murder victim was.
Update: There was a pretty serious flaw in my regression, in that if there was a Month where, say, no African Americans were murdered with a knife (and there were plenty), that month's data was missing, rather than 0. Filling in the data appropriately to reflect months with 0 murders for a particular race x weapon combination, the estimates are pretty different. White murder victims are 5.71x times more likely to be murdered with a gun than a knife, while African American murder victims were 8.62x times more likely to be murdered with a gun than a knife, meaning African Americans are 1.51x times more likely to be shot than stabbed. So, that's a pretty serious revision approximately halving the multiplier. I've already updated the linked code and slides.
So, gun deaths are an especially acute problem in the African American community. In fact, if you exclude gun deaths from the data, it actually looks like the racial disparity in murder rates has been narrowing.


It is purely coincidental that I'm posting this on the same day that the Philadelphia Police Department are doing a gun buyback. You can bring in a gun and receive a $100 Shoprite voucher, no questions asked. Seems like a good initiative.

Analysis Discussion

I spent a bit of time trying to figure out what I thought the most meaningful way to represent the murder rate was. First, I calculated the murder frequency by counting how many n murders there were a month, then divided that by the number of days in the month for (n murders/n days)=murders per day. But the resulting measure had values like 0.14 murders per day, which isn't too informative. What people want to know about murders, or at least what I want to know, is how often murders happen, not how many happened in a given time window. So, instead, I calculated (n days/n murders)=days per murder.

The y-axis for the murder rate figures is also a logarithmic scale, which is both reasonable given the distribution of the data, and the impression of the timescale. From a human perspective, the difference between 1 day and 2 days feels larger than the difference between 3 weeks and 4 weeks. The y axis is also flipped, to indicate that smaller numbers mean "more often".  I managed the reversed log transformation by writing my own coordinate transformation using the new scales package. Here's the R code.

Wednesday, March 7, 2012

Philadelphia Schools

I'm on spring break, and yesterday I took some time to check off some items on my to-do list, namely:
  1. Start getting acquainted with all the new features of ggplot2 [PDF].
  2. Get a handle on dealing with geographic data in R.
I've done some furtive geographic analysis using R [pdf], but the code behind it was very hacky. There is a whole field of geospatial data analysis out there that I am really ignorant of, and still am, but I've made a little bit of progress.

I mostly followed the tutorial laid out here for making maps in ggplot2. The most difficult part was getting the rgdal package installed. It's one of these packages that relies on other,  non-R libraries being installed. I managed to get GDAL and Proj.4 installed (even though I honestly don't know what they do,), and got rgdal installed (I had to work around an apparently non-standard installation location for Proj.4).

Now, it's all about getting some good data, and fortunately, I stumbled across opendataphilly.org yesterday as well! I found a shapefile of all schools in Philadelphia, and a separate data set about how many public and charter high school graduates in 2010 went on to postsecondary education of various sorts. Unfortunately, there weren't any shared IDs of any sort between the two data sets, so to join them I had to hack it by hand, mostly.

So, here is the result.
I'm not sure what I expected to see, which certainly weakens any conclusions I'd like to draw, but I am surprised at how little geographic patterning there is. I'm also almost certain that there are some data reporting problems. For example, that huge dark blue dot in the Northeast is Northeast High School, which reports that of their 652 graduates, 0 went on to any postsecondary education. I just don't think that can be true, and not because I'm an idealist. Northeast is right down the street from where I grew up, and while its not a fancy prep school by any means, it has both a Magnet program, and an International Baccalaureatte program.

There's no way that zero students from Northeast went on to postsecondary education, a category which includes non-degree granting programs and specialized training programs. It's a lot more likely that they either didn't report the numbers, or the Pennsylvania Department of Education lost them, and then didn't distinguish between missing data and 0. Unfortunately, that calls all schools with reports of 0% postsecondary education into question, even though some schools probably did have 0 students go on to further education.

Looking at the distribution of the proportion of graduates going on to postsecondary education, the numbers are hugely bimodal (at least for the public schools).


Even after excluding the schools which reported 0 students going on to postsecondary education, there are still 3 schools with basically 0 students getting further education out of high school: Frankford (1/341),  West Philly (1/208) and University City (2/205).

Excluding the schools which reported less than 1% of students going on the further education (assuming either that they have faulty data, or have acute problems of other sorts), I replotted the map (note that the colors now run from 50% to 100%).


Still no huge geographic patterns.

Here's the R code that I used (including links to the data).

Sunday, March 4, 2012

My Pocket Change

I'm playing around with some personal data collection, and using some cloud computing to visualize it. Following the directions in this blog post, I've written an R function which visualizes data it draws from a Google Docs spreadsheet, and uploaded it to OpenCPU's servers. The plots you're seeing in this post were actually generated by OpenCPU when you loaded this page, meaning they're live!


So, I've been logging, daily, my pocket change. The first plot shows the cumulative growth of the change in my change jar by 3 different measures, raw number of each kind of coin, total value as contributed by each kind of coin, and total mass contributed by each kind of coin (based on official data on how much each kind of coin should weigh).


This plot shows the proportional contribution each coin makes to each measure. The first panel shows what percent of all my coins belong to each type, the second panel shows how much each coin contributes to the over-all value proportionally, and the third how much each kind of coin contributes to  the over-all mass.


So, depending on how long I keep this habit up, if you keep checking in on this post, you'll see new plots every day.

I have two primary motivations for logging my coins. First, last time I cashed in all my change, someone asked me how long it took me to save it up, and I had no idea! Second, I'm curious to see how much effort I'm putting into carrying around relatively heavy coins, like pennies, for their small contribution to the over-all value of my coin jar.

Friday, March 2, 2012

A terrible 2000 words

I've only just started looking at the homicide data made available by the Philadelphia Inquirer in my free time (which is hard to come by lately). I've been thinking about what sorts of statistics I could do, or what kinds of additional data sets I could merge in, but I think these simple plots already tell a terrible story about what is happening to who.



I should point out that for the plot with month on the x-axis is also missing a whole year's worth of data, because apparently in 1991 the day of a reported homicide wasn't recorded.