Do the police in the US really shoot black people more often than white people? Is use of lethal force connected with race? How is crime related to race? What are the odds of getting shot by the police if you are white and if you are black? We're taking public data and python with pandas to shed some light on these questions, propaganda and politics set far aside.
Knowing how 'political' the following narrative may appear and how much personal opinion on certain present-day subject may vary, I'd like to introduce some preliminary reservations:
- The author is not a racist and does not think that representatives of some races should possess any privileges or preferences over those of other races. I see all people as brothers!
- The author does not intend to give any political or social color to this narrative by supporting this or that popular opinion on political or social aspects which exceed the framework of this research.
- The goal of this research is a purely statistical analysis of publicly available data to discover internal correlations and trends; I leave it to my readers to generalize.
- All data used here were taken from public sources explicitly cited in the text. Any of those who read this may verify them if necessary. The author, however, bears no responsibility for the validity of the source data, taking them as-is, without any modifications. Hence all doubts that may be cast on this research should redound to the source data which the author cannot affect.
- I do not call myself a professional data scientist and use here only the most basic analytical tools (and sometimes, I fear, not altogether efficiently). I would appreciate any tips to optimize this or that or make this research more profound.
During the Soviet days , our fathers, mothers and grandparents were incessantly reminded everywhere how 'imperialists' oppressed and tyrannized other races; how even after serfdom had been abolished in Russia, American 'capitalists' had kept exploiting the labor of Africans and their progeny; how even then, in the 20-th century, oppression was still going on even after the formal abolition of slavery, showing itself in the most hideous forms of apartheid, humiliation, racism and hatred… Classic novels like Harriet Beecher Stowe's Uncle Tom's Cabin and Harper Lee's To Kill a Mockingbird did much to aggravate the indignation of the champions of liberty around the world. Indeed, white supremacism / racism was rife in the US till about the 1960-ies or 1970-ies. But of course, it was also excellent fuel for Socialist propaganda that laid it on with a shovel when depicting the atrocities of the 'capitalist sharks'. Starting from the mid-1950-ies, a strong movement started in the US against racial inequalities which was eventually supported by the government and changed the situation with civil rights dramatically by the 1980-ies. Well, you can read about everything in Wikipedia, say. But what now?
Almost everything that our parents used to read in the Pravda newspaper back in the 60-ies is now pouring on us from all American mass media! Violence by the police and other law enforcement agencies! As we have all seen, after the killing of George Floyd the US streets were flooded by mass protests which degenerated in some places into violence and lootings under the Black Lives Matter banner. The final verdict that seems to be officially supported in the US is that the police use lethal force against black civilians because of
Like many of you (I am sure), I am eager to investigate some subject on my own, particularly if:
- the subject is being widely discussed and brings about much argument;
- the subject is covered by the mass media in an apparently lopsided manner (revealing propaganda of this or that point of view);
- there is sufficient source data available for analysis.
It's interesting to note that these three points are interrelated: 1) topical subjects almost invariably receive biased media coverage, for was there ever unbiased media? 2) such hot potatoes engender activist communities that start amassing and analyzing data to ground their views (or for fairness' sake); official bodies will also start releasing / declassifying materials, so they can't be blamed for foul play. We'll talk about such materials in a minute, but now for the goals.
Initially I put myself the following questions:
- What are the statistics of lethal force used by the police against whites and blacks in absolute values (number of cases) and in unit values (per capita for each race)? Can one say the police kill blacks more frequently than whites?
- What are the statistics of crimes committed by representatives of both races (in absolute and per capita values)? Which race is statistically more prone to crime?
- Is there a connection between the use-of-force data and the crime data (both for the entire country and for each of the analyzed races)? Can one say the police kills in proportion to the number of crimes?
- How are the trends found for questions 1 — 3 distributed across the individual states?
That's it for the time being, although I can't say other questions won't be added during the research, which now only scrapes the surface.
Qualifications and Assumptions
Did you read the Disclaimer at the top? :) Beside that, here are some other assumptions and concessions made for the research mostly to simplify things:
- The research concerns only the United States of America and no other countries.
- Throughout the narrative, I will use the shortened form 'Black' / 'Blacks' to denote people belonging to the Black / African-American population, and 'White' / 'Whites' to denote people belonging to the White American population. These abbreviated terms are used merely for shortness and must not be treated in a context of disrespect.
- The White population analyzed here ('Whites') includes the Hispanic/Latino population of the USA, but excludes Asians, American Indians / Natives, Native Hawaii and other Pacific Islanders, and mixed race population, in accordance with the US population by races data in Wikipedia, which quotes the US Census Bureau. And since many readers have been telling me that such integration is not correct, let me stress once again that this is a least-evil solution, for the crime source data also don't single out the Hispanic/Latino ethnicity, but use the strictly racial classification.
- The research has only Blacks and Whites as the object; population belonging to other races, as well as population whose race is unknown in the sources, fall out of the research. This is a deliberate constraint made for simplification, based on the fact that these two categories make up together some 80% of the total US population. I don't however completely rule out the chance of adding the other race categories in the future.
Let's see what source data we need for the research. In keeping with the mentioned goals, we need:
- data on committed crimes, with the perpetrators' race, crime categories and locations (states);
- data on use of lethal (police) force, with the victims' race and event locations (states);
- data on population by years and races (to calculate per capita values).
The FBI Crime Data Explorer public database is used as the source of crime data; it has an extended API and features data on US crimes, arrests and victims from 1991 to 2018.
The use of lethal force data is taken from the FatalEncounters public project run by community. The downloadable dataset currently contains over 28 thousand entries starting from the year 2000, with detailed information on each fatality, a brief description of the circumstances, links to media stories, event locations, etc. There are other public sources with the same purpose, for example MappingPoliceViolence (ca. 8,400 entries from 2013 on), or the Washington Post database (ca. 5,600 entries from 2015 on). However, the FatalEncounters (FENC) database is so far the most comprehensive of these, with a 20-year record of events, so I opted for that one. As a sidenote, FBI has also announced its Use Of Force project, but the database will go public only when the share of reporting agencies reaches a statistically valid figure.
Last but not least, the total population stats for different races in the US come from Wikipedia which in turn quotes the official sources — the US Census Bureau. The data is available though only for the 2000 — 2018 timeframe. This constraint made it necessary to: 1) set the final datapoint year to 2018; 2) use predicted population data for the 2000 — 2009 period obtained by simple linear regression (which is justified by the linear nature of population growth). We will thus investigate the period from 2000 (the starting year in the FENC use-of-force database) through 2018 (the end year in the population data). All conclusions are drawn from observations for these 19 years.
Preparing Source Data
Before going further, we must download the source datasets and make them suitable for investigation.
The use-of-force source is simple enough: we just download the complete database from the FENC website and save as a CSV file (you may leave it in the original Excel format, but I prefer CSV for unification). Here is the direct link to the original file on Google Spreadsheets; you can download it as CSV from here.
FENC data fields (those used in the research are in bold)
- Unique ID
- Subject's name
- Subject's age
- Subject's gender
- Subject's race
- Subject's race with imputations
- Imputation probability
- URL of image of deceased
- Date of injury resulting in death (month/day/year)
- Location of injury (address)
- Location of death (city)
- Location of death (state)
- Location of death (zip code)
- Location of death (county)
- Full Address
- Agency responsible for death
- Cause of death
- A brief description of the circumstances surrounding the death
- Dispositions/Exclusions INTERNAL USE, NOT FOR ANALYSIS
- Intentional Use of Force (Developing)
- Link to news article or photo of official document
- Symptoms of mental illness? INTERNAL USE, NOT FOR ANALYSIS
- Unique ID formula
- Unique identifier (redundant)
- Date (Year)
The population data was saved from Wikipedia into Excel, where they were next prepended with 2000 — 2009 data predicted with simple linear (least-squares) regression. Download the Excel and resulting CSV using this link.
Population data fields (those used in the research are in bold)
- White_pop — population of Whites
- Black_pop — population of Blacks
- Asian_pop — population of Asians
- Native Hawaiian_pop — population of Hawaii Natives
- American Indian_pop — population of American Indians
- Unknown_pop — population of people with mixed / unknown races
The most interesting part is downloading and preparing crime data from the FBI database. I wrote a Python program for that purpose (download link), which connects to the FBI Crime Data Explorer (CDE) database using an individual API key (you can get one here). The API uses REST to handle requests to the various target endpoints and returns results in JSON. The Python script downloads and integrates data into a single pandas DataFrame which is then saved as a CSV file. The same file is merged with population data to calculate per capita crime counts. Download the resulting CSV here.
Crime data fields (those used in the research are in bold)
- Year — crime year
- Offense — crime (offense) category, one of:
- All Offenses
- Assault Offenses
- Drugs Narcotic Offenses
- Larceny Theft Offenses
- Murder And Nonnegligent Manslaughter
- Sex Offenses
- Weapon Law Violation
- Class — classifier (race for this research, but it could be age or gender)
- Offender/Victim — whether the data point is related to the offender or the victim(s), we filter by Offender here
- Asian — number of crimes committed by Asians
- Native Hawaiian — number of crimes committed by Native Hawaiians
- Black — number of crimes committed by Black / African-American people
- American Indian — number of crimes committed by American Indians
- Unknown — number of crimes committed by people of other / unknown races
- White — number of crimes committed by White people (including Hispanic/Latinos)
- White_pop — total population of Whites for that year
- Black_pop — total population of Blacks for that year
- Asian_pop — total population of Asians for that year
- Native Hawaiian_pop — total population of Native Hawaiians for that year
- American Indian_pop — total population of American Indians for that year
- Unknown_pop — total population of people of other / unknown races for that year
- Asian pro capita — per capita number of crimes committed by Asians
- Native Hawaiian pro capita — per capita number of crimes committed by Native Hawaiians
- Black pro capita — per capita number of crimes committed by Black / African-American people
- American Indian pro capita — per capita number of crimes committed by American Indians
- Unknown pro capita — per capita number of crimes committed by people of other / unknown races
- White pro capita — per capita number of crimes committed by White people (including Hispanic/Latinos)
All data analysis is made with Python 3.8 in an interactive Jupyter Notebook. I also use these packages:
- pandas 1.0.3 (for numerical data analysis)
- folium 0.11 (for US heat maps)
All this stuff is available via the WinPython distro that I've been using on my Windows machine for a number of years already due to its many advantages and features. You can use another one if you want (like Anaconda), or a pure Python installation with additional packages.
In fact, this analysis can be easily replayed with any other stats / maths software like R, MatLab, SAS or even Excel. Pick your weapon, as the phrase goes :)
We'll dive into the data analysis in the next part of this publication.