8th November, 2023 – gautammarathe

Continuing from last time where we were taking a look at the different types of clustering that we can use for the analysis of the geographical data, we used K means where it’s not particularly helpful since we have to ourselves decide the number of clusters for which we would need to have some data before hand to determine if it is working properly.

Instead of this approach I tried another one which involved using DBSCAN which is a cclustering method working on the basis of density and how close to each other points are locally rather than globally.

So we apply DBSCAN to check if it can produce results different to that of KMeans or is our estimate of 4 cluster from K Means correct? To verify this we plotted the DBSCAN data for the state of california.

DBSCAN results of clusters in california

This shows that the data in our K means clustering was not showing the same clusters in terms of density that now DBSCAN shows when using this. Hence this further cements that we need some sort of underlying verification.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Filter shootings in the state of California and remove rows with missing latitudes or longitudes

df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]

# Extract latitude and longitude columns

coordinates = df_ca[[‘latitude’, ‘longitude’]]

# Perform DBSCAN clustering

dbscan = DBSCAN(eps=0.2, min_samples=5) # Adjust eps and min_samples based on your data

df_ca[‘cluster’] = dbscan.fit_predict(coordinates)

# Create a map of California using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-130, -113, 20, 50]) # California bounding box

# Plotting the clustered coordinates

for cluster in df_ca[‘cluster’].unique():

if cluster != -1: # Skip noise points (cluster = -1)

cluster_data = df_ca[df_ca[‘cluster’] == cluster]

latitudes = cluster_data[‘latitude’].tolist()

longitudes = cluster_data[‘longitude’].tolist()

ax.scatter(longitudes, latitudes, label=f’Cluster {cluster}’, s=20)

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

ax.legend()

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘DBSCAN Clustering of Fatal Police Shootings in California’)

plt.show()

Leave a Reply Cancel reply