Continuing from last time where we were taking a look at the different types of clustering that we can use for the analysis of the geographical data, we used K means where it’s not particularly helpful since we have to ourselves decide the number of clusters for which we would need to have some data before hand to determine if it is working properly.
Instead of this approach I tried another one which involved using DBSCAN which is a cclustering method working on the basis of density and how close to each other points are locally rather than globally.
So we apply DBSCAN to check if it can produce results different to that of KMeans or is our estimate of 4 cluster from K Means correct? To verify this we plotted the DBSCAN data for the state of california.
This shows that the data in our K means clustering was not showing the same clusters in terms of density that now DBSCAN shows when using this. Hence this further cements that we need some sort of underlying verification.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Filter shootings in the state of California and remove rows with missing latitudes or longitudes
df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]
# Extract latitude and longitude columns
coordinates = df_ca[[‘latitude’, ‘longitude’]]
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.2, min_samples=5) # Adjust eps and min_samples based on your data
df_ca[‘cluster’] = dbscan.fit_predict(coordinates)
# Create a map of California using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))
ax.set_extent([-130, -113, 20, 50]) # California bounding box
# Plotting the clustered coordinates
for cluster in df_ca[‘cluster’].unique():
if cluster != -1: # Skip noise points (cluster = -1)
cluster_data = df_ca[df_ca[‘cluster’] == cluster]
latitudes = cluster_data[‘latitude’].tolist()
longitudes = cluster_data[‘longitude’].tolist()
ax.scatter(longitudes, latitudes, label=f’Cluster {cluster}’, s=20)
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
ax.legend()
# Draw state lines
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)
# Show the plot
plt.title(‘DBSCAN Clustering of Fatal Police Shootings in California’)
plt.show()