S0mbre Sep 18 2020 at 02:00

Crime, Race and Lethal Force in the USA — Part 2

14 min

2.3K

Big Data*Data Mining*Open source*Python*

Translation

In the previous part of this article, I talked about the research background, goals, assumptions, source data, and used tools. Today, without further ado, let's say together…

Chocks Away!

We start by importing the required packages and defining the root folder where the source data sit:

import pandas as pd, numpy as np

# root folder path (change for your own!)
ROOT_FOLDER = r'c:\_PROG_\Projects\us_crimes'

Lethal Force Fatalities

First let's look into the use of lethal force data. Load the CSV into a new DataFrame:

# FENC source CSV
FENC_FILE = ROOT_FOLDER + '\\fatal_enc_db.csv'

# read to DataFrame
df_fenc = pd.read_csv(FENC_FILE, sep=';', header=0, usecols=["Date (Year)", "Subject's race with imputations", "Cause of death", "Intentional Use of Force (Developing)", "Location of death (state)"])

You will note that not all the fields are loaded, but only those we'll need in the research: year, victim race (with imputations), cause of death (not used now but may come in useful later on), intentional use of force flag, and the state where the death took place.

It's worth understanding what «subject's race with imputations» means. The fact is, the official / media sources that FENC uses to glean data don't always report the victim's race, resulting in data gaps. To compensate these gaps, the FENC community involves third-party experts who estimate the race by the other available data (with some degree of error). You can read more on this on the FENC website or see notes in the original Excel spreadsheet (sheet 2).

We'll then give the columns handier titles and drop the rows with missing data:

df_fenc.columns = ['Race', 'State', 'Cause', 'UOF', 'Year']
df_fenc.dropna(inplace=True)

Now we hava to unify the race categories with those used in the crime and population datasets we are going to match with, since these datasets use somewhat different racial classifications. The FENC database, for one, singles out the Hispanic/Latino ethnicity, as well as Asian/Pacific Islanders and Middle Easterns. But in this research, we're focusing on Blacks and Whites only. So we must make some aggregation / renaming:

df_fenc = df_fenc.replace({'Race': {'European-American/White': 'White', 
                          'African-American/Black': 'Black', 
                          'Hispanic/Latino': 'White', 'Native American/Alaskan': 'American Indian',
                          'Asian/Pacific Islander': 'Asian', 'Middle Eastern': 'Asian',
                          'NA': 'Unknown', 'Race unspecified': 'Unknown'}}, value=None)

We are leaving only White (now including Hispanic/Latino) and Black victims:

df_fenc = df_fenc.loc[df_fenc['Race'].isin(['White', 'Black'])]

What's the purpose of the UOF (Use Of Force) field? For this research, we want to analyze only those cases when the police (or other law enforcement agencies) intentionally used lethal force. We leave out cases when the death was the result of suicide (for example, when sieged by the police) or pursuit and crash in a vehicle. This constraint follows from two criteria:

1) the circumstances of deaths not directly resulting from use of force don't normally allow of a transparent cause-and-effect link between the acts of the law enforcement officers and the ensuing death (one example could be when a man dies from a heart attack when held at gun-point by a police officer; another common example is when a suspect being arrested shoots him/herself in the head);
2) it is only intentional use of force that counts in official statistics; thus, for instance, the future FBI database I mentioned in the previous part of the article will collect only such cases.

So to leave only intentional use of force cases:

df_fenc = df_fenc.loc[df_fenc['UOF'].isin(['Deadly force', 'Intentional use of force'])]

For convenience we'll add the full state names. I made a separate CSV file for that purpose, which we're now merging with our data:

df_state_names = pd.read_csv(ROOT_FOLDER + '\\us_states.csv', sep=';', header=0)
df_fenc = df_fenc.merge(df_state_names, how='inner', left_on='State', right_on='state_abbr')

Type df_fenc.head() to peek at the resulting dataset:

	Race	State	Cause	UOF	Year	state_name	state_abbr
0	Black	GA	Gunshot	Deadly force	2000	Georgia	GA
1	Black	GA	Gunshot	Deadly force	2000	Georgia	GA
2	Black	GA	Gunshot	Deadly force	2000	Georgia	GA
3	Black	GA	Gunshot	Deadly force	2000	Georgia	GA
4	Black	GA	Gunshot	Deadly force	2000	Georgia	GA

Since we're not going to investigate the individual cases, let's aggregate the data by years and victim races:

# group by year and race
ds_fenc_agg = df_fenc.groupby(['Year', 'Race']).count()['Cause']
df_fenc_agg = ds_fenc_agg.unstack(level=1)
# cast numericals to UINT16 to save memory
df_fenc_agg = df_fenc_agg.astype('uint16')

The resulting table is indexed by years (2000 — 2020) and contains two columns: 'White' (number of white victims) and 'Black' (number of black victims). Let's take a look at the corresponding plot:

plt = df_fenc_agg.plot(xticks=df_fenc_agg.index, color=['olive', 'g'])
plt.set_xticklabels(df_fenc_agg.index, rotation='vertical')
plt.set_xlabel('')
plt.set_ylabel('Number of police victims')
plt

Intermediate conclusion:

White police victims outnumber black victims in absolute figures.

The average difference factor between the two is about 2.4. It's not a far guess that this is due to the difference between the population of the two races in the US. Well, let's look at per capita values then.

Load the population data:

# population CSV file (1991 - 2018 data points)
POP_FILE = ROOT_FOLDER + '\\us_pop_1991-2018.csv'
df_pop = pd.read_csv(POP_FILE, index_col=0, dtype='int64')

Then merge the data with our dataset:

# take only Black and White population for 2000 - 2018
df_pop = df_pop.loc[2000:2018, ['White_pop', 'Black_pop']]

# join dataframes and drop rows with missing values
df_fenc_agg = df_fenc_agg.join(df_pop)
df_fenc_agg.dropna(inplace=True)

# cast population numbers to integer type
df_fenc_agg = df_fenc_agg.astype({'White_pop': 'uint32', 'Black_pop': 'uint32'})

OK. Finally, create two new columns with per capita (per million) values dividing the abosulte victim counts by the respective race population and multiplying by one million:

df_fenc_agg['White_promln'] = df_fenc_agg['White'] * 1e6 / df_fenc_agg['White_pop']
df_fenc_agg['Black_promln'] = df_fenc_agg['Black'] * 1e6 / df_fenc_agg['Black_pop']

Let's see what we get:

	Black	White	White_pop	Black_pop	White_promln	Black_promln
Year
2000	148	291	218756353	35410436	1.330247	4.179559
2001	158	353	219843871	35758783	1.605685	4.418495
2002	161	363	220931389	36107130	1.643044	4.458953
2003	179	388	222018906	36455476	1.747599	4.910099
2004	157	435	223106424	36803823	1.949742	4.265861
2005	181	452	224193942	37152170	2.016112	4.871855
2006	212	460	225281460	37500517	2.041890	5.653255
2007	219	449	226368978	37848864	1.983487	5.786171
2008	213	442	227456495	38197211	1.943229	5.576323
2009	249	478	228544013	38545558	2.091501	6.459888
2010	219	506	229397472	38874625	2.205778	5.633495
2011	290	577	230838975	39189528	2.499578	7.399936
2012	302	632	231992377	39623138	2.724227	7.621809
2013	310	693	232969901	39919371	2.974633	7.765653
2014	264	704	233963128	40379066	3.009021	6.538041
2015	272	729	234940100	40695277	3.102919	6.683822
2016	269	723	234644039	40893369	3.081263	6.578084
2017	265	743	235507457	41393491	3.154889	6.401973
2018	265	775	236173020	41617764	3.281493	6.367473

The two rightmost columns now contain per million victim counts for both races. Time to visualize that:

plt = df_fenc_agg.loc[:, ['White_promln', 'Black_promln']].plot(xticks=df_fenc_agg.index, color=['g', 'olive'])
plt.set_xticklabels(df_fenc_agg.index, rotation='vertical')
plt.set_xlabel('')
plt.set_ylabel('Number of police victims\nper 1 mln. within race')
plt

We'll also display the basic stats for this data by running:

df_fenc_agg.loc[:, ['White_promln', 'Black_promln']].describe()

	White_promln	Black_promln
count	19.000000	19.000000
mean	2.336123	5.872145
std	0.615133	1.133677
min	1.330247	4.179559
25%	1.946485	4.890977
50%	2.091501	5.786171
75%	2.991827	6.558062
max	3.281493	7.765653

Intermediate conclusions:

Lethal force results on average in 5.9 per one million Black deaths and 2.3 per one million White deaths (Black victim count is 2.6 greater in unit values).

Data deviation (scatter) for Blacks is 1.8 higher than for Whites — you can see that the green curve representing White victims is considerably smoother.

Black victims peaked in 2013 at 7.7 per million; White victims peaked in 2018 at 3.3 per million.

White victims grow continuously from year to year (by 0.1 — 0.2 per million on average), while Black victims rolled back to their 2009 level after a climax in 2011 — 2013.

Thus, we can answer our first question:

— Can one say the police kill Blacks more frequently than Whites?
— Yes, it is a correct inference. Blacks are 2.6 times more likely to meet death by the hands of law enforcement agencies than Whites.

Bearing in mind this inference, let's go ahead and look at the crime data to see if (and how) they are related to lethal force fatalities and races.

Crime Data

Let's load our crime CSV:

CRIMES_FILE = ROOT_FOLDER + '\\culprits_victims.csv'
df_crimes = pd.read_csv(CRIMES_FILE, sep=';', header=0, 
                        index_col=0, usecols=['Year', 'Offense', 'Offender/Victim', 'White', 
                                       'White pro capita', 'Black', 'Black pro capita'])

Again, as before, we're using only the relevant fields: year, offense type, offender / victim classifier and offense counts for each race (absolute — 'White', 'Black' and per capita — 'White pro capita', 'Black pro capita').

Let's look what we have here (with df_crimes.head()):

	Offense	Offender/Victim	Black	White	Black pro capita	White pro capita
Year
1991	All Offenses	Offender	490	598	1.518188e-05	2.861673e-06
1991	All Offenses	Offender	4	4	1.239337e-07	1.914160e-08
1991	All Offenses	Offender	508	122	1.573958e-05	5.838195e-07
1991	All Offenses	Offender	155	176	4.802432e-06	8.422314e-07
1991	All Offenses	Offender	13	19	4.027846e-07	9.092270e-08

We won't need data on offense victims so far, so get rid of them:

# leave only offenders
df_crimes1 = df_crimes.loc[df_crimes['Offender/Victim'] == 'Offender']
# leave only 2000 - 2018 data years and remove redundant columns
df_crimes1 = df_crimes1.loc[2000:2018, ['Offense', 'White', 'White pro capita', 'Black', 'Black pro capita']]

Here's the resulting dataset (1295 rows * 5 columns):

	Offense	White	White pro capita	Black	Black pro capita
Year
2000	All Offenses	679	0.000003	651	0.000018
2000	All Offenses	11458	0.000052	30199	0.000853
2000	All Offenses	4439	0.000020	3188	0.000090
2000	All Offenses	10481	0.000048	5153	0.000146
2000	All Offenses	746	0.000003	63	0.000002
...	...	...	...	...	...
2018	Larceny Theft Offenses	1961	0.000008	1669	0.000040
2018	Larceny Theft Offenses	48616	0.000206	30048	0.000722
2018	Drugs Narcotic Offenses	555974	0.002354	223398	0.005368
2018	Drugs Narcotic Offenses	305052	0.001292	63785	0.001533
2018	Weapon Law Violation	70034	0.000297	58353	0.001402

Now we need to convert the per capita (per 1 person) values to per million values (in keeping with the unit data we use throughout the research). Just multiply the per capita columns by one million:

df_crimes1['White_promln'] = df_crimes1['White pro capita'] * 1e6
df_crimes1['Black_promln'] = df_crimes1['Black pro capita'] * 1e6

To see the whole picture — how crimes committed by Whites and Blacks are distributed across the offense types, let's aggregate the absolute crime counts by years:

df_crimes_agg = df_crimes1.groupby(['Offense']).sum().loc[:, ['White', 'Black']]

	White	Black
Offense
All Offenses	44594795	22323144
Assault Offenses	12475830	7462272
Drugs Narcotic Offenses	9624596	3453140
Larceny Theft Offenses	9563917	4202235
Murder And Nonnegligent Manslaughter	28913	39617
Sex Offenses	833088	319366
Weapon Law Violation	829485	678861

Or in a graph:

plt = df_crimes_agg.plot.barh(color=['g', 'olive'])
plt.set_ylabel('')
plt.set_xlabel('Number of offenses (sum for 2000-2018)')

We can observe here that:

drug offenses, assaults and 'All Offenses' dominate over the other offense types (murder, weapon law violations and sex offenses)
in absolute figures, Whites commit more crimes than Blacks (exactly twice as much for the 'All Offenses' category)

Again we realize that no robust conclusions can be made about 'race criminality' without population data. So we're looking at per capita (per million) values:

df_crimes_agg1 = df_crimes1.groupby(['Offense']).sum().loc[:, ['White_promln', 'Black_promln']]

	White_promln	Black_promln
Offense
All Offenses	194522.307758	574905.952459
Assault Offenses	54513.398833	192454.602875
Drugs Narcotic Offenses	41845.758869	88575.523095
Larceny Theft Offenses	41697.303725	108189.184125
Murder And Nonnegligent Manslaughter	125.943007	1016.403706
Sex Offenses	3633.777035	8225.144985
Weapon Law Violation	3612.671402	17389.163849

Or as a graph:

plt = df_crimes_agg1.plot.barh(color=['g', 'olive'])
plt.set_ylabel('')
plt.set_xlabel('Number of offenses (sum for 2000-2018) per 1 mln. within race')

We've got quite a different picture this time. Blacks commit more crimes for each analyzed category than Whites, approaching a triple difference for 'All Offenses'.

We will now leave only the 'All Offenses' category as the most representative of the 7 and sum up the rows by years (since the source data may feature several entries per year, matching the number of reporting agencies).

# leave only 'All Offenses' category
df_crimes1 = df_crimes1.loc[df_crimes1['Offense'] == 'All Offenses']
# could also have left assault and murder (try as experiment!)
#df_crimes1 = df_crimes1.loc[df_crimes1['Offense'].str.contains('Assault|Murder')]

# drop absolute columns and aggregate data by years
df_crimes1 = df_crimes1.groupby(level=0).sum().loc[:, ['White_promln', 'Black_promln']]

The resulting dataset:

	White_promln	Black_promln
Year
2000	6115.058976	17697.409882
2001	6829.701429	20431.707645
2002	7282.333249	20972.838329
2003	7857.691182	22218.966500
2004	8826.576863	26308.815799
2005	9713.826255	30616.569637
2006	10252.894313	33189.382429
2007	10566.527362	34100.495064
2008	10580.520024	34052.276749
2009	10889.263592	33954.651792
2010	10977.017218	33884.236826
2011	11035.346176	32946.454471
2012	11562.836825	33150.706035
2013	11211.113491	32207.571607
2014	11227.354594	31517.346141
2015	11564.786088	31764.865490
2016	12193.026562	33186.064958
2017	12656.261666	34900.390499
2018	13180.171893	37805.202605

Let's see how it looks on a plot:

plt = df_crimes1.plot(xticks=df_crimes1.index, color=['g', 'olive'])
plt.set_xticklabels(df_fenc_agg.index, rotation='vertical')
plt.set_xlabel('')
plt.set_ylabel('Number of offenses\nper 1 mln. within race')

Intermediate conclusions:

Whites commit twice as many offenses as Blacks in absolute numbers, but three times as fewer in per capita numbers (per 1 million population within that race).

Criminality among Whites grows more or less steadily over the entire period of investigation (doubled over 19 years). Criminality among Blacks also grows, but by leaps and starts, showing steep growth from 2001 to 2006, then abating slightly over 2007 — 2016 and plummeting again after 2017. Over the entire period, however, the growth factor is also 2, like with Whites.

But for the period of decrease in 2007 — 2016, criminality among Blacks grows at a higher rate than that among Whites.

We can therefore answer our second question:

— Which race is statistically more prone to crime?
— Crimes committed by Blacks are three times more frequent than crimes committed by Whites.

Criminality and Lethal Force Fatalities

We've now come to the most important part. Let's see if we can answer the third question: Can one say the police kills in proportion to the number of crimes?

The question boils down to looking at the correlation between our two datasets — use of force data (from the FENC database) and crime data (from the FBI database).

We start by routinely merging the two datasets into one:

# glue together the FENC and CRIMES dataframes
df_uof_crimes = df_fenc_agg.join(df_crimes1, lsuffix='_uof', rsuffix='_cr')
# we won't need the first 2 columns (absolute FENC values), so get rid of them
df_uof_crimes = df_uof_crimes.loc[:, 'White_pop':'Black_promln_cr']

The resulting combined data:

	White_pop	Black_pop	White_promln_uof	Black_promln_uof	White_promln_cr	Black_promln_cr
Year
2000	218756353	35410436	1.330247	4.179559	6115.058976	17697.409882
2001	219843871	35758783	1.605685	4.418495	6829.701429	20431.707645
2002	220931389	36107130	1.643044	4.458953	7282.333249	20972.838329
2003	222018906	36455476	1.747599	4.910099	7857.691182	22218.966500
2004	223106424	36803823	1.949742	4.265861	8826.576863	26308.815799
2005	224193942	37152170	2.016112	4.871855	9713.826255	30616.569637
2006	225281460	37500517	2.041890	5.653255	10252.894313	33189.382429
2007	226368978	37848864	1.983487	5.786171	10566.527362	34100.495064
2008	227456495	38197211	1.943229	5.576323	10580.520024	34052.276749
2009	228544013	38545558	2.091501	6.459888	10889.263592	33954.651792
2010	229397472	38874625	2.205778	5.633495	10977.017218	33884.236826
2011	230838975	39189528	2.499578	7.399936	11035.346176	32946.454471
2012	231992377	39623138	2.724227	7.621809	11562.836825	33150.706035
2013	232969901	39919371	2.974633	7.765653	11211.113491	32207.571607
2014	233963128	40379066	3.009021	6.538041	11227.354594	31517.346141
2015	234940100	40695277	3.102919	6.683822	11564.786088	31764.865490
2016	234644039	40893369	3.081263	6.578084	12193.026562	33186.064958
2017	235507457	41393491	3.154889	6.401973	12656.261666	34900.390499
2018	236173020	41617764	3.281493	6.367473	13180.171893	37805.202605

Let me refresh you memory on the individual columns here:

White_pop — White population
Black_pop — Black population
White_promln_uof — White lethal force victims per 1 million Whites
Black_promln_uof — Black lethal force victims per 1 million Blacks
White_promln_cr — Number of crimes committed by Whites per 1 million Whites
Black_promln_cr — Number of crimes committed by Blacks per 1 million Blacks

We next want to see how the police victim and crime curves compare on one plot. For Whites:

plt = df_uof_crimes['White_promln_cr'].plot(xticks=df_uof_crimes.index, legend=True)
plt.set_ylabel('Number of White offenses per 1 mln. within race')
plt2 = df_uof_crimes['White_promln_uof'].plot(xticks=df_uof_crimes.index, legend=True, secondary_y=True, style='g')
plt2.set_ylabel('Number of White UOF victims per 1 mln. within race', rotation=90)
plt2.set_xlabel('')
plt.set_xlabel('')
plt.set_xticklabels(df_uof_crimes.index, rotation='vertical')

The same on a scatter plot:

plt = df_uof_crimes.plot.scatter(x='White_promln_cr', y='White_promln_uof')
plt.set_xlabel('Number of White offenses per 1 mln. within race')
plt.set_ylabel('Number of White UOF victims per 1 mln. within race')

A quick look at the graphs shows that some correlation is present. OK, now for Blacks:

plt = df_uof_crimes['Black_promln_cr'].plot(xticks=df_uof_crimes.index, legend=True)
plt.set_ylabel('Number of Black offenses per 1 mln. within race')
plt2 = df_uof_crimes['Black_promln_uof'].plot(xticks=df_uof_crimes.index, legend=True, secondary_y=True, style='g')
plt2.set_ylabel('Number of Black UOF victims per 1 mln. within race', rotation=90)
plt2.set_xlabel('')
plt.set_xlabel('')
plt.set_xticklabels(df_uof_crimes.index, rotation='vertical')

And on a scatter plot:

plt = df_uof_crimes.plot.scatter(x='Black_promln_cr', y='Black_promln_uof')
plt.set_xlabel('Number of Black offenses per 1 mln. within race')
plt.set_ylabel('Number of Black UOF victims per 1 mln. within race')

Things are much worse here: the two trends duck and bob a lot, though the principle correlation is still visible, the proportion is positive, if non-linear.

We will make use of statistical methods to quantify these correlations, making correlation matrices estimated with the Pearson correlation coefficient:

df_corr = df_uof_crimes.loc[:, ['White_promln_cr', 'White_promln_uof', 
                         'Black_promln_cr', 'Black_promln_uof']].corr(method='pearson')
df_corr.style.background_gradient(cmap='PuBu')

We get this table:

	White_promln_cr	White_promln_uof	Black_promln_cr	Black_promln_uof
White_promln_cr	1.000000	0.885470	0.949909	0.802529
White_promln_uof	0.885470	1.000000	0.710052	0.795486
Black_promln_cr	0.949909	0.710052	1.000000	0.722170
Black_promln_uof	0.802529	0.795486	0.722170	1.000000

The correlation coefficients for both races are in bold: it is 0.885 for Whites and 0.722 for Blacks. Thus a positive correlation between lethal force victims and criminality is observed for both races, but it is more prominent for Whites (probably significant) and nears non-significant for Blacks. The latter result is, of course, due to the higher data heterogeneity (scatter) for Black crimes and police victims.

As a final step, let's try to estimate the probability of Black and White offenders to get shot by the police. We have no direct ways to do that, since we don't have information on the criminality of the lethal force victims (who of them was found to be an offender and who was judicially clear). So we can only take the easy path and divide the per capita victim counts by the per capita crime counts for each race and multiply by 100 to show percentage values.

# let's look at the aggregate data (with individual year observations collapsed)
df_uof_crimes_agg = df_uof_crimes.loc[:, ['White_promln_cr', 'White_promln_uof', 
                 'Black_promln_cr', 'Black_promln_uof']].agg(['mean', 'sum', 'min', 'max'])
# now calculate the percentage of fatal encounters from the total crime count in each race
df_uof_crimes_agg['White_uof_cr'] = df_uof_crimes_agg['White_promln_uof'] * 100. / 
                 df_uof_crimes_agg['White_promln_cr']
df_uof_crimes_agg['Black_uof_cr'] = df_uof_crimes_agg['Black_promln_uof'] * 100. / 
                 df_uof_crimes_agg['Black_promln_cr']

We get this table:

	White_promln_cr	White_promln_uof	Black_promln_cr	Black_promln_uof	White_uof_cr	Black_uof_cr
mean	10238.016198	2.336123	30258.208024	5.872145	0.022818	0.019407
sum	194522.307758	44.386338	574905.952459	111.570747	0.022818	0.019407
min	6115.058976	1.330247	17697.409882	4.179559	0.021754	0.023617
max	13180.171893	3.281493	37805.202605	7.765653	0.024897	0.020541

Let's show the means (in bold above) as a bar chart:

plt = df_uof_crimes_agg.loc['mean', ['White_uof_cr', 'Black_uof_cr']].plot.bar(color=['g', 'olive'])
plt.set_ylabel('Ratio of UOF victims to offense count')
plt.set_xticklabels(['White', 'Black'], rotation=0)

Looking at this chart, you can see that the probability of a White offender to be shot dead by the police is somewhat higher than that of a Black offender. This estimate is certainly quite tentative, but it can give at least some idea.

Intermediate conclusions:

Fatal encounters with law enforcement are connected with criminality (number of offenses committed). The correlation though differs between the two races: for Whites, it is almost perfect, for Blacks — far from perfect.

Looking at the combined police victim / crime charts, it becomes obvious that lethal force victims grow 'in reply to' criminality growth, generally with a few years' lag (this is more conspicuous in the Black data). This phenomenon chimes in with the reasonable notion that the authorities 'react' on criminality (more crimes > more impunity > more closeups with law enforcement > more lethal outcomes).

White offenders tend to meet death from the police more frequently than Black offenders, although the difference is almost negligible.

Finally, the answer to our third question:

— Can one say the police kills in proportion to the number of crimes?
— Yes, this proportion can be observed, though different between the two races: for Whites, it is almost perfect, for Blacks — far from perfect.

In the next (and final) part of the narrative, we will look into the geographical distribution of the analyzed data across the states.

Hubs:

If this publication inspired you and you want to support the author, do not hesitate to click on the button