30th October, 2023

Continuing from last time, my major ideas were to see if we could find any correlation between the density of shootings by having a related plot of population density with it.

This would help us identify more easily if more shootings have been taking place in more populated areas and in a way we could be able to work out if the frequency of shootings is a function of the population of the people.

Trying to code this has some challenges that I have been facing because I can’t seem to really actualize the data I’m looking for in terms of population densities hence it has been a challenge.

A lot of popular python libraries contain data regarding county level population, but I can’t plot data at a county level when looking at a state level. A lot of population data online however does not have coordinates but rather names for the cities which has been another challenge.

Trying to figure the two out for now and coding the lookups required to match the data do not seem to be working as of right now, but I am working on fixing it because I do find it an interesting direction to be going in.

Furthermore I would maybe like to have police station coordinate data plotted to see if there is any pattern of data there in terms of the distances from police stations.

 

The code below is not yet fully working but It’s a reference point for the work later

import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from sklearn.cluster import KMeans
import us
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Extract latitude and longitude columns
latitude_column = ‘latitude’  # Replace with your actual column name
longitude_column = ‘longitude’  # Replace with your actual column name
latitudes = df[latitude_column].tolist()
longitudes = df[longitude_column].tolist()
# Perform K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
df[‘cluster’] = kmeans.fit_predict(df[[latitude_column, longitude_column]])
# Create a map of the USA using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))
ax.set_extent([-125, -66, 24, 49])  # USA bounding box
# Plotting the coordinates with cluster colors
scatter = ax.scatter(df[longitude_column], df[latitude_column], s=10, c=df[‘cluster’], cmap=’viridis’, marker=’o’, alpha=0.7, edgecolor=’k’, transform=ccrs.Geodetic())
# Add colorbar
cbar = plt.colorbar(scatter, ax=ax, orientation=’vertical’, fraction=0.03, pad=0.05)
cbar.set_label(‘Cluster’)
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
# Get and plot capital cities using the us library
for state in us.STATES:
    capital = us.states.lookup(state.capital)
    ax.text(capital.longitude, capital.latitude, state.capital, transform=ccrs.PlateCarree(), fontsize=8, ha=’right’, va=’bottom’, color=’blue’)
# Draw state lines
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)
# Show the plot
plt.title(‘K-Means Clustering on the Map of the USA with State Capitals Highlighted’)
plt.show()

27th October, 2023

Deciding to go in a different direction, I tried my hand at looking at what the geographical data could tell me as I had exhausted all my other avenues working on the historical data.

To do this I had to plot the data to look at what we could find, inititally I need a map of the USA to plot this against or plotting the points is not going to make sense.

import pandas as pd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Create a map of the USA using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))
ax.set_extent([-125, -66, 24, 49])  # USA bounding box
# Extract latitude and longitude columns
latitude_column = ‘latitude’  # Replace with your actual column name
longitude_column = ‘longitude’  # Replace with your actual column name
latitudes = df[latitude_column].tolist()
longitudes = df[longitude_column].tolist()
# Plotting the coordinates
ax.scatter(longitudes, latitudes, s=10, c=’red’, marker=’o’, alpha=0.7, edgecolor=’k’, transform=ccrs.Geodetic())
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)  # Draw state lines
# Show the plot
plt.title(‘Coordinates Plot on the Map of the USA’)
plt.show()
This gives a good idea of how the points would look on the map of the USA and I have added state lines to get a better understanding of how the spread is concerning the country.
Shootings Plotted Against USA Map

It’s clear to see from the initial plots that the there’s not a lot shootings in the mid west where not a lot of people live. A lot of shootings seem to be concentrated around more populated areas on the coasts.

23rd October, 2023

So continuing on from last time, where I was doing monte carlo simulations on that comparison of white ages of people shot and black ages of people shot.

We saw that the monte carlo simulation showed that over a large number of trials, the data will show that for nearly over 50% of the time, the average ages of black people shot would be on an average 7 years younger than that of white people’s ages.

We do a similar monte carlo simulation on the group of hispanic people’s ages and we find interesting results for that as well. We had gotten similar results with an analysis of variance and T test when done on hispanic ages that there was a statistical significance to the difference in the means between the ages of hispanic people shot and white people shot.

Plotting the means of the several monte carlo simulations as a frequency distribution we get that on an average the hispanic person shot is 6.5 years younger than a white person shot.

Thus the data also prove that for 3rdd quartile or 95% of the time we will have a mean difference of 6.5 years younger for hispanic people.

 

20th October, 2023

Finally I managed to fix the part of my code that was not letting me plot histograms for my monte carlo simulation on the data set for the ages of black and whites from the washington post shooting data.

So finally having run the code I can tell with more confidence that the difference in means for the age groups was not just by chance as we had seen in the T tests, but we went one strp further to make it more concrete using the monte carlo simulation.

Frequency Distribution of Difference in Means

From the above graph we can clearly tell that over a large number of randomized simulations on our data using resampling and finding the difference in means for the two groups, we still find that the average age of black people in the data is roughy 7 years younger than that of the white people.

Thus after having combined the results of pair wise T tests, T tests, Tukey Methods, Difference in Covariances, Monte Carlo Simulations, we can say that the observed differences in the mean ages of the black people as compared to white people who were shot is not occuring by chance.

 

18th October, 2023

Hi, so continuing from last time, I was having some difficulties trying to code a monte carlo simulation for the test between black ages and white ages. This was to be done because it is not advised to do multiple T tests, this is just going to increase the error.

The problem with running a monte carlo simulation so far had been that it was tricky to control the test and control groups, as well as the sampling and this was leading to a lot of increased errors while running the code.

The errors were ocurring more often than not because of the issues of grouping and scaling for the plotting the histogram. While I continue to work on the plotting of the histogram, I will do a brief overview of what my code is trying to achieve with.

It starts with taking 3 data sets, we will be using them 2 at a time for  the most part and keeping the white ages or the ‘wage’ group as the control group for our hypotheses testing and while this is the case for most tests, we are only doing this as doing a lot of pariwise T tests is not advises due to the errors.

The solution to that is to take our control group and test group, draw a fixed size of samples from each of the groups with replacement and thus adding a nature of randomness to the test. We use this to calculate the difference in the means of the two groups and we store them.

This is done using the for loop for a huge number of iterations so that we have enough data to be able to plot a histogram for the frequency of the value, that is the difference in the means of the values for the two data sets that we have been using. The point of the test is to determine if on an average, over a large number of simulations that are completely random, if we see a recurring pattern that is, there is actually a diffference in the means, and it is the result for the monte carlo simulation as well.

This would be a good marker to represent something more substantial and visaully easy to grasp when it comes to defining if something is ocurring by complete chance or there is something causing it. So far we would just like to understand and reject the Null Hypotheses, that the difference in means for the age groups is by chance.

16th October, 2023

Last time I was looking at t tests to see if the differences observed in the frequency distributions for ages of different races was purely by chance or due to some other factors that we could not determine yet.

The t test values for ages of Black, Hispanic and White people show significant differences in means which is not occurring by chance and hence rejecting the null hypotheses.

The next thing we can do is look at the analysis of variance to determine if there is a significant difference between the different races.

Analysis of Variance, One way method
Analysis of Variance, One way method

Here we can tell that by the P value that there is a significant difference between the 5 degrees of freedom of race.

To further analyse which pairs are contributing the most to differences, we have tried tukey method in R to do a pairwise analysis on all the races to check which of the combinations are more significant in the differences.

Anova pairwise for every race combination
Anova pairwise for every race combination

Here we can see the P values for the W-N, W-B, W-H are the most significant of the values and all the rest just indicate chance. W-A also show some degree of significance but not as much as the first 3 mentioned.

More analysis later.

13th October, 2023

Well continuing from where we were last time comparing age data for different races and finding a discrepancy between the means for Black, Whites and Hispanics.

The case was that on average Blacks were found to be 7 years younger than Whites, in the data. We wanted to confirm if this is the case by chance or is there an actual contributing factor.

To do this we can conduct T tests to check if the P values are significant for us to reject the null hypotheses that there is no difference in the true means of the samples.

T test for ages of Blacks and Whites
T test for ages of Blacks and Whites

Here we can see that the ages for black people vary from nearly 9 years younger to 6.5 years younger for 95%  of the interval.

The T values are high and the P values are significantly low and we can say this is not just down to chance, it’s highly improbable.

T test for hispanic ages vs white ages
T test for hispanic ages vs white ages

As we can see here, the two ages are different on an average and the P values indicate this is highly unlikely to have happened by chance, the T values corroborate this as well.

 

11th October. 2023

Starting work on the Police shooting data, we don’t really have a very easy way through so we initially just look at a few key parameters of our interest and see if we can work our way through them and look for any patterns or insight.

So starting with some descriptive statistics for the analysis, we found that average and median ages of the victims were roughly 37 and 35 respectively.

It also showed a standard deviation of 13 years. Looking at the distribution of the ages we found something like this.

Distribution for all ages
Distribution for all ages

As we can see from this, there seems to be a right skewed distribution for the ages of the victims.

Looking further, what if we could look for any differences in the distributions for the ages of different races of victims.

So trying the same experiment for some of the races separately, we found this.

As you can see for the distribution of ages of Asians, there does not seem to be any visible pattern.

Age distribution for Asians
Age distribution for Asians

However when looking at data for the distributions of ages for African Americans who were shot, we can see that in terms of the distribution that had all races present, distribution for African American ages is even more right skewed, hence it could be a sign that on an average, younger people of this race are being killed.

Distribution for African American ages
Distribution for African American ages
Distribution for Hispanic Ages
Distribution for Hispanic Ages

Above we also have for comparison the distribution of the ages of all the people who are Hispanic and as we can see the graph for this race does not seem to be as skewed as those for African Americans, hence possibly indicating an average of relatively older Hispanic people being victims as compared to other races.

The distributions for native Americans and those with their races categorized as other do not show any meaningful patterns when visualized.

Looking at the age distributions for white people who were shot,

Age distribution for white people
Age distribution for white people

We can see that this graph is the most similar in spread to the one for Hispanic and African Americans, however it is not as skewed in nature as that for African Americans or Hispanic, in fact it almost looks the least skewed of the 3 thus indicating a central tendency of data points in terms of age.

Just from the descriptive stats of these different sections of age and race, we can tell that on average younger hispanic and black people are being shot as opposed to white people.

Descriptive Statistics
Descriptive Statistics

Will later be looking into more tools to see how to better quantify this discrepancy in the data when looking at the ages of the victims across different races.

Also most importantly need to take into account for the fact that not all of the variables have data for every row and thus can be somewhat inconsistent.

 

6th October, 2023

This will be more of a summarizing article because most of our work on the project was completed before this.

We had decided that based on what we had set out to do, the question we chose to answer with this dataset, was if there was any way to predict diabetes given that we had the data for corresponding inactivity and obesity levels.

Based off of this we looked for the counties that had data for all three parameters, and then we tried by fitting simple linear regression models which we found not to be adequate in predicting diabetes, and there was heteroskedasticity present in the data when looking at spread of residuals, also we did BP test to confirm this.

On further inspection it was concluded that a quadratic fit was indeed better for predicting diabetes, and was further improved by using an interaction term.

These various fits along with our final one, were tested using cross-validation with K=10 to provide their respective test errors. Even through this we could conclude that our latest quadratic model with an interaction term was the best fit, however due to the limited data only some 354 data points, we could only come up with up to a 0.42 correlation.

This is as far as we got and was all we could manage, if there were more data perhaps there would be a greater overarching trend that would be fit more easily with relatively less complexity.

4th October, 2023

Just a little bit of final exploration before we go ahead with our final model with our best correlation value that we had obtained.

I just wanted to test if it was possible to get a better fit by increasing the complexity of the model, just by trying this for diabetes and obesity, I would check for the P values to see if any of them were significant,

As seen here, the P values suggest that a quadratic fit would be best
As seen here, the P values suggest that a quadratic fit would be best

Thus we can see here, that all except the power 2 have the p values of not much significance and hence can be rejected, they are all clearly too high and the quadratic being the lowest.

This all coincides with our study so far the latest model we have built with quadratic models of obesity and inactivity.

2nd October, 2023

So continuing in a little less than ideal fashion because there was a lot of painstaking research that I did on the data by performing various different tests. However my system crashed so I will try to replicate what I can and remember.

I had reached a bit of a dead end with the data wondering why I was really trying to model diabetes on the basis of inactivity and obesity.

In a way I had set out with a goal in mind to look for and had spent all my time looking for it, when the fact is that it may not necessarily even exist.

So I was comparing how the fits are when the models are built from a different set of predictors and what is the value to be predicted.

In doing these tests, I was finding that the multiple linear model of diabetes and obesity, for predicting inactivity was giving the highest R squared term I had encountered so far of 0.42. Actually it was 0.39 for a linear model which was much higher than any other previous linear model I had encountered, and it jumps to 0.42 with the addition of the interaction term.

This was particularly interesting because transformations did not help better the correlation but instead worsened it, in nearly all cases where the model was made more complex, such as log or polynomials.

Summary for model of inactivity
Summary for model of inactivity

However when I was comparing test errors for this, using K fold cross validation this model produced an error much higher than that of the 0.3 seen in the log models.