29th November, 2023

So continuing on from last time’s analysis which we used to introudce one new extra variable to a model and we tried to check how it affected the fit and we can do so by analysing it’s underlying characteristic graphs.

We look at the residuals vs fitted curves first to check for heteroskedasticity and we can check that by the balloning of the values at the end.

Residuals vs Fitted

In the below grah we see the QQ plots of the normalized behavior and we don’t see anything largely out of the ordinary here as we did with the heteroskedastic nature of the residuals vs fitted graph.

QQ Plots

We see more of the same heteroskedastic nature from the standardized nature of the residuals plotted against the fitted values of the model.

Standardized Residuals vs Fitted

We take a look at the residuals vs leverage graph to check how the outlier values are affeccting any chances of making a good read in the final model but we do not see any such behavior on a large scale here in the below graph.

27th November, 2023

We concluded from our previous analyses of the various plots that we had obtained from the analysis of the hotel occupation rates and the hotel average daily rates and how we could relate the two to predict overall trends and how they can be seen.

So we saw that the linear model did not accurately predict a lot but it did do reasonably well for the flexibility that the model offered in terms of it’s simplicity. Compared to all our other data that we have we try to complicate the model further by making it so that we can add more variables.

To predict the hotel rate we try to do it using occupancy and international flights and total passengers and logan airport.

We hope these additions as most travelers should be staying in hotels should relate well to hotel rates and occupancy numbers.

Linear Model Including Passengers

From the R squared values we are able to tell that the the fit value got better because it is nearly 0.8 now and it shows a higher relation.

The p values for the introudced new term also shows that it’s significant due to a low enough P value.

24th November, 2023

Diving further into the analysis of last time of the linear model that we created to fit the hotel rates given we have the available occupancies for the related data. We found a high R squared value and adjust R value which meant it showed a high degree of relation however it does not mean it’s causing it.

We further investigate this phenomenon by looking at what we can do to further analyse the model by looking at the various different plots and what they actually tell us in terms of the erros and outliers that our model is equipped to handle or not since it is a linear model it is not terribly flexible but it does so by not introducing a lot of bias into the model and causing any amount of overfitting or large variances.

Looking at the plots for our data we can tell that there are not any important problems right away that we can tell of.

Heteroskedasticity Test

Our Residuals vs Fitted do show some amount of balloning in terms of the way the data is spread out with increase in value indicating that there may be some heteroskedasticity.

QQ Plots

Our QQ plots do not show any skewness and show mostly normalization in behavior which is fine for us for now.

Scale vs Location

This is just a normalized version of the residuals vs fitted curves and this does not show any better than our first which still shows heteroskedaticity.

Leverage Plots

From leverage plots we are able to tell if there are any outliers in our data and how they affect the fit in terms of the leverage or pull they provide.  It provides that there’s not a lot of outliers pulling a lot of leverage.

22nd November, 2023

Continuing on from last time where we further analysed the linear model for unemployemnet and total jobs we could satsifactoryily conclude that the data was actually good for the linear model fit that we would be using and it did not actually have to be changed due to the higher value of R squared that is shown.

We further ccomplicate the model by looking at how hotel occcupancy rates can be affected by the hoted avg rates that have been following and how that trends differently.

So we put this in a linear model and try to analyse the fit and model it gives us.

Linear Model for Hotel Rates Given Occupancy
Occupanccy vs Rates

Clearly a linear model can more or less account for overall trends in predicting how average rates can trend on the basis of occupancy.

20th November, 2023

In our previous analysis we were looking at what we could do with international passenger data and how we could be using it to be combined with total number off flights data to make sure that we could find a model that could fit the data we had in order to be able to predict the type of data that we would be looking at and how the fit would really work.

We found the fit to be reasoanbly good and it was linear in nature with a very high R squared term indicating a high degree of relation between the two without a lot of chance to indicate the same.

Hence we chose to complicate the subject further by looking at a few more analyses of the same linear model to make sure we were not missing out on anything.

QQ Plots

 

Residuals vs Fitted
Scale vs Location for the Linear Model

18th November, 2023

For another variable analysis we were looking at if we could see somehow if there was any relation to be found between the numbers of total passengers through logan airport and if that varied through the number of international flights that month and if at all that would be affecting hotel occupancies and other related data.

So we start by first inputting an intital linear model comparing simply the total passengers and number of flights.

Linear Model of Flights vs Passengers

Clearly the data shows high relation between the two with a very high R squared value and clearly the low P value indicates that it is extremely rare to get this outcome by chance.

Looking at the plot of the model against the data we can further verify that it shows the model is a resoanbly good fit for what we are trying to look at.

Linear Model Plotted for Flights

As we can see from the plot that it is more or less linear in nature to get this sort of outcome and hence it works for our current analysis.

15th November, 2023

Moving on from last time where we were simply just looking at the type of the data that we could be considering and all the drawbacks with using data that is not actually useful because of the amount of data present as it is nearly just 200 rows of data.

Still using this we decided to choose unemployment numbers and job numbers to see if we could  find any models that could fit it reasoably well without any of the pitfalls of the rest of the modelling.

So we looked at initially just fitting a few simple linear models to give us an idea of how things were related and how they looked really.

Fit of Unemploment and Jobs
Fit of Unemploment and Jobs

Clearly the R squared value here shows that it’s highly related and the plots show the same as well.

Plot of Linear Model of Unemployment

 

13th November, 2023

So to start with the analysis process for the project 3 we are looking at Public Boston data which is the analysis of the various economic indicators for the city of boston, collected over months at a time for 7 years from 2013 to 2020.

We initially wanted to check what was possible in terms of making it possible to look at the data we wanted to see in terms of what we were actually really looking at because it was not going to be possible to look at nearly 8 different variables and looking at the type of input that would be shown by each.

Also we wanted to make sure that we kept the interaction to a minimum to start off with, which is basically individiually explore each parameter. What is happening in this data is that by month wise it does give us the information for a variety of different things for the city of boston that help indicate it’s major economic and social health.

After looking at the entire data set we determined that there are not a lot of data points that we can use as it is a relatively small data set with only about 200 entries and nearly half of them are empty for a quite a few of the variables that have been described in the excel.

Hence it would not make a lot of sense to try and fit statistical learning methods on this data to try and fit or predict any sort of model. However we will so far just look at the descriptive statistics of each of the columns to see if we can find anything worthwhile to look at.

Clearly plotting the distributions for a lot of this data would not show much except that there is a tendency around certain time periods and regions since a lot of this data is mostly indexed by time.

8th November, 2023

Continuing from last time where we were taking a look at the different types of clustering that we can use for the analysis of the geographical data, we used K means where it’s not particularly helpful since we have to ourselves decide the number of clusters for which we would need to have some data before hand to determine if it is working properly.

Instead of this approach I tried another one which involved using DBSCAN which is a cclustering method working on the basis of density and how close to each other points are locally rather than globally.

So we apply DBSCAN to check if it can produce results different to that of KMeans or is our estimate of 4 cluster from K Means correct? To verify this we plotted the DBSCAN data for the state of california.

DBSCAN results of clusters in california

This shows that the data in our K means clustering was not showing the same clusters in terms of density that now DBSCAN shows when using this. Hence this further cements that we need some sort of underlying verification.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Filter shootings in the state of California and remove rows with missing latitudes or longitudes
df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]
# Extract latitude and longitude columns
coordinates = df_ca[[‘latitude’, ‘longitude’]]
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.2, min_samples=5)  # Adjust eps and min_samples based on your data
df_ca[‘cluster’] = dbscan.fit_predict(coordinates)
# Create a map of California using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))
ax.set_extent([-130, -113, 20, 50])  # California bounding box
# Plotting the clustered coordinates
for cluster in df_ca[‘cluster’].unique():
    if cluster != -1:  # Skip noise points (cluster = -1)
        cluster_data = df_ca[df_ca[‘cluster’] == cluster]
        latitudes = cluster_data[‘latitude’].tolist()
        longitudes = cluster_data[‘longitude’].tolist()
        ax.scatter(longitudes, latitudes, label=f’Cluster {cluster}’, s=20)
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
ax.legend()
# Draw state lines
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)
# Show the plot
plt.title(‘DBSCAN Clustering of Fatal Police Shootings in California’)
plt.show()

6th November, 2023

Continuing from last time we try to do the same comparison but across multiple different types of clustering.

The point of this exercise is just more or less compare how the different clusters look but it’s not strictly useful unless we have any other data for comparison that we are looking a.

K = 4, K means for california

 

K = 3 K Means clustering for California

 

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Filter shootings in the state of California and remove rows with missing latitudes or longitudes
df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]
# Extract latitude and longitude columns
coordinates = df_ca[[‘latitude’, ‘longitude’]]
# Perform K-means clustering with K = 4
kmeans = KMeans(n_clusters=3, random_state=42)
df_ca[‘cluster’] = kmeans.fit_predict(coordinates)
# Create a map of California using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(100, 28))
ax.set_extent([-130, -106, 20, 50])  # California bounding box
# Plotting the clustered coordinates
for cluster in range(4):
    cluster_data = df_ca[df_ca[‘cluster’] == cluster]
    latitudes = cluster_data[‘latitude’].tolist()
    longitudes = cluster_data[‘longitude’].tolist()
    ax.scatter(longitudes, latitudes, label=f’Cluster {cluster + 1}’, s=20)
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
ax.legend()
# Draw state lines
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)
# Show the plot
plt.title(‘K-means Clustering of Fatal Police Shootings in California (K=4)’)
plt.show()

3rd November, 2023

So further since my data was not able to be finalized for the plotting of the data of the cities and states with dense populations, I worked in a different direction to see that if once my data is actually matched I can make it so that I can plot the cities and then look at the heat maps of the population densities and accordingly work with that to see if the clusters and high population areas show any relation or closeness.

So to figure out clustering we went with K means one to start with and I picked California as the example as the same in class because it’s one of the few states where the data is seemingly isolated from other states and it has enough to seemily form some sort of legible clusters.

K means clustering with K=4 for the data of california

As we can see from the clustering, the clustering alone does not make a lot of sense to us and we can’t tell if this clustering is what we even need unless we have some other data to compare it to and make it more useful than just clusters on it’s own.

This is where either population density data or police station data can be input.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Filter shootings in the state of California and remove rows with missing latitudes or longitudes
df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]
# Extract latitude and longitude columns
coordinates = df_ca[[‘latitude’, ‘longitude’]]
# Perform K-means clustering with K = 4
kmeans = KMeans(n_clusters=4, random_state=42)
df_ca[‘cluster’] = kmeans.fit_predict(coordinates)
# Create a map of California using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))
ax.set_extent([-125, -113, 32, 37])  # California bounding box
# Plotting the clustered coordinates
for cluster in range(4):
    cluster_data = df_ca[df_ca[‘cluster’] == cluster]
    latitudes = cluster_data[‘latitude’].tolist()
    longitudes = cluster_data[‘longitude’].tolist()
    ax.scatter(longitudes, latitudes, label=f’Cluster {cluster + 1}’, s=20)
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
ax.legend()
# Draw state lines
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)
# Show the plot
plt.title(‘K-means Clustering of Fatal Police Shootings in California (K=4)’)
plt.show()

1st November, 2023

Continuing from last time I am trying to see if we can somehow correlate our geographical data with how far or how close we are from densely populated urban areas.

This work involved a lot of mapping data from different sources to make sure that they are all aligned to be able to plot the grographical data in accordance.

This is a big hurdle for me at the moment because the data sources and libraries of python that I have found for using US state wise data do not have the required granularity of being able to show population density. There is a way to map it by county level but that would require boundaries in terms of coorrdinates to follow which is immensely difficult to find for a county.

So for now I am trying to show density by creating my own dataset of cities with populations greater than 1 million and finding their geographical coordinates manually and mapping them to the city names

This will help me process the data efficiently and be able to overlay the densely populated cities over the data of the shootings.

I wish I could show the output for the overlay however the code for the data cleaning still has some minor issues which I am trying to fix for the time being because it does not always translate as 1:1 when creating a new data set and trying to match it with an existing one.

This is in fact just a huge lookup proocess that needs to complete properly before the plotting functions can work without errors.

import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import us
# Specify the full path to your Excel file using a raw string
excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’
# Read data from Excel file
df = pd.read_excel(excel_file_path)
# Create a map of the USA using Cartopy
fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))
ax.set_extent([-125, -66, 24, 49])  # USA bounding box
capitals = [“Washington, D.C.”, “Montgomery”, “Juneau”, “Phoenix”, “Little Rock”, “Sacramento”, “Denver”, “Hartford”, “Dover”, “Tallahassee”, “Atlanta”, “Honolulu”, “Boise”, “Springfield”, “Indianapolis”, “Des Moines”, “Topeka”, “Frankfort”, “Baton Rouge”, “Augusta”, “Annapolis”, “Boston”, “Lansing”, “St. Paul”, “Jackson”, “Jefferson City”, “Helena”, “Lincoln”, “Carson City”, “Concord”, “Trenton”, “Santa Fe”, “Albany”, “Raleigh”, “Bismarck”, “Columbus”, “Oklahoma City”, “Salem”, “Harrisburg”, “Providence”, “Columbia”, “Pierre”, “Nashville”, “Austin”, “Salt Lake City”, “Montpelier”, “Richmond”, “Olympia”, “Charleston”, “Madison”, “Cheyenne”]
geolocator = Nominatim(user_agent=”my_geocoder”)
coordinates = []
for capital in capitals:
    location = geolocator.geocode(capital)
    if location:
        coordinates.append((capital, location.latitude, location.longitude))
    else:
        coordinates.append((capital, “Not found”, “Not found”))
# Print the results
for data in coordinates:
    print(data)
# Extract latitude and longitude columns
latitude_column = ‘latitude’  # Replace with your actual column name
longitude_column = ‘longitude’  # Replace with your actual column name
latitudes = df[latitude_column].tolist()
longitudes = df[longitude_column].tolist()
# Plotting the coordinates
ax.scatter(longitudes, latitudes, s=10, c=’red’, marker=’o’, alpha=0.7, edgecolor=’k’, transform=ccrs.Geodetic())
# Add map features
ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)
ax.add_feature(cfeature.BORDERS, linestyle=’:’)
# Get and plot capital cities using the us library
for state in us.STATES:
    capital = us.states.lookup(state.capital)
    ax.text(capital.longitude, capital.latitude, state.capital, transform=ccrs.PlateCarree(), fontsize=8, ha=’right’, va=’bottom’, color=’blue’)
# Draw state lines
ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)
# Show the plot
plt.title(‘Coordinates Plot on the Map of the USA with State Capitals Highlighted’)
plt.show()