3rd November, 2023 – gautammarathe

So further since my data was not able to be finalized for the plotting of the data of the cities and states with dense populations, I worked in a different direction to see that if once my data is actually matched I can make it so that I can plot the cities and then look at the heat maps of the population densities and accordingly work with that to see if the clusters and high population areas show any relation or closeness.

So to figure out clustering we went with K means one to start with and I picked California as the example as the same in class because it’s one of the few states where the data is seemingly isolated from other states and it has enough to seemily form some sort of legible clusters.

K means clustering with K=4 for the data of california

As we can see from the clustering, the clustering alone does not make a lot of sense to us and we can’t tell if this clustering is what we even need unless we have some other data to compare it to and make it more useful than just clusters on it’s own.

This is where either population density data or police station data can be input.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Filter shootings in the state of California and remove rows with missing latitudes or longitudes

df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]

# Extract latitude and longitude columns

coordinates = df_ca[[‘latitude’, ‘longitude’]]

# Perform K-means clustering with K = 4

kmeans = KMeans(n_clusters=4, random_state=42)

df_ca[‘cluster’] = kmeans.fit_predict(coordinates)

# Create a map of California using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-125, -113, 32, 37]) # California bounding box

# Plotting the clustered coordinates

for cluster in range(4):

cluster_data = df_ca[df_ca[‘cluster’] == cluster]

latitudes = cluster_data[‘latitude’].tolist()

longitudes = cluster_data[‘longitude’].tolist()

ax.scatter(longitudes, latitudes, label=f’Cluster {cluster + 1}’, s=20)

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

ax.legend()

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘K-means Clustering of Fatal Police Shootings in California (K=4)’)

plt.show()

Leave a Reply Cancel reply