gautammarathe

After having split the data across the different levels of granularity by grouping the data by the time index and using different counts and to plot the frequency by time and check for seasonality in data.

We plot the time series data for different granularities and we see what we can see from the visual trends.

Clearly the difference in granularity can be seen the details of the graphs for the differences in the weekly and monthly graphs.

We can also check for stationarity by using the ADF test and we can use the P value from the test to confirm stationarity.

After Performing the ADF tests on the data we can tell very clearly that they are not stationary and we can check on how the different components of the monthly view compare when decomposed into it’s different components.

Decomposition of the Time Series for Monthly

December 6, 2023December 9, 2023

6th December, 2023

So looking at the time series data we had to start the analysis in a way that would enable us to look at trends. Since the data that we have is present in the a granular level, we cannot use it directly. We need to apply some level of transformations for the data to be accessible for us initially.

This can be done in the form of using the existing time data and we can use it to a perform a group by on the data level of the required granularity that we need and hence we can check for trends for the respective granularity level wether that is daily, weekly or monthly.

We start of by doing a monthly, 6monthly, and yearly, split. We also use grouped by day just to check out the levelof granularity plotting that is evident at the daily stage.

We used the following code below to make sure we accurately split our plotting across the diffferent levels of granularity to plot the time series model.

import pandas as pd

import os

# Set the directory where your CSV files are located

directory_path = r”C:\Users\91766\Desktop\stats3″

# Get a list of all files in the directory starting with “tpm” and ending with “.csv”

csv_files = [file for file in os.listdir(directory_path) if file.startswith(‘tmp’) and file.endswith(‘.csv’)]

# Initialize an empty DataFrame to store the concatenated data

concatenated_df = pd.DataFrame()

# Loop through each CSV file and concatenate its data to the main DataFrame

for file in csv_files:

file_path = os.path.join(directory_path, file)

df = pd.read_csv(file_path)

df[[‘date’, ‘time’]] = df[‘OCCURRED_ON_DATE’].str.split(‘ ‘, expand=True)

df[‘date’] = pd.to_datetime(df[‘date’])

concatenated_df = pd.concat([concatenated_df, df], ignore_index=True)

# Display the concatenated DataFrame

#print(concatenated_df)

#print(df[‘date’])

# Save the concatenated DataFrame to an Excel file

#output_excel_path = r”C:\Users\91766\Desktop\stats3\concatenated_data.csv”

#concatenated_df.to_csv(output_excel_path, index=False)

#print(f”Concatenated data saved to {output_excel_path}”)

grouped_by_day = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.date).size().reset_index(name=’count’)

grouped_by_day.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_day.xlsx”, index=False)

# Output 2: Group by month

grouped_by_month = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.to_period(‘M’)).size().reset_index(name=’count’)

grouped_by_month.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_month.xlsx”, index=False)

# Output 3: Group by 6 months

grouped_by_6_months = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.to_period(‘6M’)).size().reset_index(name=’count’)

grouped_by_6_months.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_6_months.xlsx”, index=False)

grouped_by_week = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.to_period(‘W’)).size().reset_index(name=’count’)

grouped_by_week.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_week.xlsx”, index=False)

print(“complete”)

December 4, 2023December 9, 2023

4th December, 2023

Due to the severely limited nature of the data that we were working with in the analyze boston data set we further moved onto the work where we would be looking into another dataset with large enough data sets because that would help us train more accurate and reliable models and provide us a way to test the model as well on the previous data.

So the alternate dataset that we are looking at is crime reporting data set because we have data across multiiple columns using multiple factors and introducing time and location data.

We would mostly like to look at the time data and try to create some sort of time series and analyse that for further insight.

We can try this but this would initiallly require a lot of data transformation because the data in the crime reporting data set is at a very granular reoprting level and considering that we have day level for the last 8 years it’s going to be very low level data. Plottng this as a timeseries would not yield much because trying to predict something like this at a daily level would not be very useful unless done very accurately.

Besides the analyses would not be of much use as it would merely indicate how crimes occurr as a function of time and that is not the reality for crime in real life because there are a variety of factors that go into it but we can highlight from historical plotting of data some sort of trends to avoid certain stretches of time.

December 1, 2023December 9, 2023

1st December, 2023

We look at a final few linear models before we move on to some level of time series forecasting because the nature of the data largely shows the same results and I find it largely due to the nature and manner of the data collection is what causes the the outcomes in the high R squared values and the adjust R values.

So as a final linear model in the economic indicators data we can look at the same linear model that we have been analysing before but we can add to its complexity by adding interaction terms and other things such as the logan passengers and international flights data to futher enhance the complexity of the model and perhaps better enchance the prediction capabilities of the model.

We add the passenger term to our 3 term linear model and therefore making it 4 terms now.

Linear Model combining Passenger and Flight Data

Thus from the value we can tell while the value of P are low enough for passsenger data they are the only significant values as the passenger data even though marginally does increase the R square adjusted value to be into the 0.8s and it was not there before.

The international flights and passenger data have high enough P values that we can ignore them while considering our model so we can just stick with out initial 2 parameter model.

November 29, 2023December 9, 2023

29th November, 2023

So continuing on from last time’s analysis which we used to introudce one new extra variable to a model and we tried to check how it affected the fit and we can do so by analysing it’s underlying characteristic graphs.

We look at the residuals vs fitted curves first to check for heteroskedasticity and we can check that by the balloning of the values at the end.

In the below grah we see the QQ plots of the normalized behavior and we don’t see anything largely out of the ordinary here as we did with the heteroskedastic nature of the residuals vs fitted graph.

We see more of the same heteroskedastic nature from the standardized nature of the residuals plotted against the fitted values of the model.

We take a look at the residuals vs leverage graph to check how the outlier values are affeccting any chances of making a good read in the final model but we do not see any such behavior on a large scale here in the below graph.

November 27, 2023December 9, 2023

27th November, 2023

We concluded from our previous analyses of the various plots that we had obtained from the analysis of the hotel occupation rates and the hotel average daily rates and how we could relate the two to predict overall trends and how they can be seen.

So we saw that the linear model did not accurately predict a lot but it did do reasonably well for the flexibility that the model offered in terms of it’s simplicity. Compared to all our other data that we have we try to complicate the model further by making it so that we can add more variables.

To predict the hotel rate we try to do it using occupancy and international flights and total passengers and logan airport.

We hope these additions as most travelers should be staying in hotels should relate well to hotel rates and occupancy numbers.

From the R squared values we are able to tell that the the fit value got better because it is nearly 0.8 now and it shows a higher relation.

The p values for the introudced new term also shows that it’s significant due to a low enough P value.

November 24, 2023December 9, 2023

24th November, 2023

Diving further into the analysis of last time of the linear model that we created to fit the hotel rates given we have the available occupancies for the related data. We found a high R squared value and adjust R value which meant it showed a high degree of relation however it does not mean it’s causing it.

We further investigate this phenomenon by looking at what we can do to further analyse the model by looking at the various different plots and what they actually tell us in terms of the erros and outliers that our model is equipped to handle or not since it is a linear model it is not terribly flexible but it does so by not introducing a lot of bias into the model and causing any amount of overfitting or large variances.

Looking at the plots for our data we can tell that there are not any important problems right away that we can tell of.

Our Residuals vs Fitted do show some amount of balloning in terms of the way the data is spread out with increase in value indicating that there may be some heteroskedasticity.

Our QQ plots do not show any skewness and show mostly normalization in behavior which is fine for us for now.

This is just a normalized version of the residuals vs fitted curves and this does not show any better than our first which still shows heteroskedaticity.

From leverage plots we are able to tell if there are any outliers in our data and how they affect the fit in terms of the leverage or pull they provide. It provides that there’s not a lot of outliers pulling a lot of leverage.

November 22, 2023December 9, 2023

22nd November, 2023

Continuing on from last time where we further analysed the linear model for unemployemnet and total jobs we could satsifactoryily conclude that the data was actually good for the linear model fit that we would be using and it did not actually have to be changed due to the higher value of R squared that is shown.

We further ccomplicate the model by looking at how hotel occcupancy rates can be affected by the hoted avg rates that have been following and how that trends differently.

So we put this in a linear model and try to analyse the fit and model it gives us.

Linear Model for Hotel Rates Given Occupancy

Clearly a linear model can more or less account for overall trends in predicting how average rates can trend on the basis of occupancy.

November 20, 2023December 9, 2023

20th November, 2023

In our previous analysis we were looking at what we could do with international passenger data and how we could be using it to be combined with total number off flights data to make sure that we could find a model that could fit the data we had in order to be able to predict the type of data that we would be looking at and how the fit would really work.

We found the fit to be reasoanbly good and it was linear in nature with a very high R squared term indicating a high degree of relation between the two without a lot of chance to indicate the same.

Hence we chose to complicate the subject further by looking at a few more analyses of the same linear model to make sure we were not missing out on anything.

November 18, 2023December 9, 2023

18th November, 2023

For another variable analysis we were looking at if we could see somehow if there was any relation to be found between the numbers of total passengers through logan airport and if that varied through the number of international flights that month and if at all that would be affecting hotel occupancies and other related data.

So we start by first inputting an intital linear model comparing simply the total passengers and number of flights.

Clearly the data shows high relation between the two with a very high R squared value and clearly the low P value indicates that it is extremely rare to get this outcome by chance.

Looking at the plot of the model against the data we can further verify that it shows the model is a resoanbly good fit for what we are trying to look at.

As we can see from the plot that it is more or less linear in nature to get this sort of outcome and hence it works for our current analysis.

November 15, 2023December 9, 2023

15th November, 2023

Moving on from last time where we were simply just looking at the type of the data that we could be considering and all the drawbacks with using data that is not actually useful because of the amount of data present as it is nearly just 200 rows of data.

Still using this we decided to choose unemployment numbers and job numbers to see if we could find any models that could fit it reasoably well without any of the pitfalls of the rest of the modelling.

So we looked at initially just fitting a few simple linear models to give us an idea of how things were related and how they looked really.

Clearly the R squared value here shows that it’s highly related and the plots show the same as well.

November 13, 2023December 9, 2023

13th November, 2023

So to start with the analysis process for the project 3 we are looking at Public Boston data which is the analysis of the various economic indicators for the city of boston, collected over months at a time for 7 years from 2013 to 2020.

We initially wanted to check what was possible in terms of making it possible to look at the data we wanted to see in terms of what we were actually really looking at because it was not going to be possible to look at nearly 8 different variables and looking at the type of input that would be shown by each.

Also we wanted to make sure that we kept the interaction to a minimum to start off with, which is basically individiually explore each parameter. What is happening in this data is that by month wise it does give us the information for a variety of different things for the city of boston that help indicate it’s major economic and social health.

After looking at the entire data set we determined that there are not a lot of data points that we can use as it is a relatively small data set with only about 200 entries and nearly half of them are empty for a quite a few of the variables that have been described in the excel.

Hence it would not make a lot of sense to try and fit statistical learning methods on this data to try and fit or predict any sort of model. However we will so far just look at the descriptive statistics of each of the columns to see if we can find anything worthwhile to look at.

Clearly plotting the distributions for a lot of this data would not show much except that there is a tendency around certain time periods and regions since a lot of this data is mostly indexed by time.

November 12, 2023November 12, 2023

Project 2

https://github.com/Tiyasa-Saha/MTH522-Project-2/blob/main/Project%202.ipynb

Project 2

November 10, 2023November 12, 2023

Feedback Updated Project 1

Stats Project 1

November 8, 2023November 9, 2023

8th November, 2023

Continuing from last time where we were taking a look at the different types of clustering that we can use for the analysis of the geographical data, we used K means where it’s not particularly helpful since we have to ourselves decide the number of clusters for which we would need to have some data before hand to determine if it is working properly.

Instead of this approach I tried another one which involved using DBSCAN which is a cclustering method working on the basis of density and how close to each other points are locally rather than globally.

So we apply DBSCAN to check if it can produce results different to that of KMeans or is our estimate of 4 cluster from K Means correct? To verify this we plotted the DBSCAN data for the state of california.

DBSCAN results of clusters in california

This shows that the data in our K means clustering was not showing the same clusters in terms of density that now DBSCAN shows when using this. Hence this further cements that we need some sort of underlying verification.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Filter shootings in the state of California and remove rows with missing latitudes or longitudes

df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]

# Extract latitude and longitude columns

coordinates = df_ca[[‘latitude’, ‘longitude’]]

# Perform DBSCAN clustering

dbscan = DBSCAN(eps=0.2, min_samples=5) # Adjust eps and min_samples based on your data

df_ca[‘cluster’] = dbscan.fit_predict(coordinates)

# Create a map of California using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-130, -113, 20, 50]) # California bounding box

# Plotting the clustered coordinates

for cluster in df_ca[‘cluster’].unique():

if cluster != -1: # Skip noise points (cluster = -1)

cluster_data = df_ca[df_ca[‘cluster’] == cluster]

latitudes = cluster_data[‘latitude’].tolist()

longitudes = cluster_data[‘longitude’].tolist()

ax.scatter(longitudes, latitudes, label=f’Cluster {cluster}’, s=20)

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

ax.legend()

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘DBSCAN Clustering of Fatal Police Shootings in California’)

plt.show()

November 6, 2023November 9, 2023

6th November, 2023

Continuing from last time we try to do the same comparison but across multiple different types of clustering.

The point of this exercise is just more or less compare how the different clusters look but it’s not strictly useful unless we have any other data for comparison that we are looking a.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Filter shootings in the state of California and remove rows with missing latitudes or longitudes

df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]

# Extract latitude and longitude columns

coordinates = df_ca[[‘latitude’, ‘longitude’]]

# Perform K-means clustering with K = 4

kmeans = KMeans(n_clusters=3, random_state=42)

df_ca[‘cluster’] = kmeans.fit_predict(coordinates)

# Create a map of California using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(100, 28))

ax.set_extent([-130, -106, 20, 50]) # California bounding box

# Plotting the clustered coordinates

for cluster in range(4):

cluster_data = df_ca[df_ca[‘cluster’] == cluster]

latitudes = cluster_data[‘latitude’].tolist()

longitudes = cluster_data[‘longitude’].tolist()

ax.scatter(longitudes, latitudes, label=f’Cluster {cluster + 1}’, s=20)

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

ax.legend()

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘K-means Clustering of Fatal Police Shootings in California (K=4)’)

plt.show()

November 3, 2023November 9, 2023

3rd November, 2023

So further since my data was not able to be finalized for the plotting of the data of the cities and states with dense populations, I worked in a different direction to see that if once my data is actually matched I can make it so that I can plot the cities and then look at the heat maps of the population densities and accordingly work with that to see if the clusters and high population areas show any relation or closeness.

So to figure out clustering we went with K means one to start with and I picked California as the example as the same in class because it’s one of the few states where the data is seemingly isolated from other states and it has enough to seemily form some sort of legible clusters.

K means clustering with K=4 for the data of california

As we can see from the clustering, the clustering alone does not make a lot of sense to us and we can’t tell if this clustering is what we even need unless we have some other data to compare it to and make it more useful than just clusters on it’s own.

This is where either population density data or police station data can be input.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Filter shootings in the state of California and remove rows with missing latitudes or longitudes

df_ca = df[(df[‘state’] == ‘CA’) & (df[‘latitude’].notna()) & (df[‘longitude’].notna())]

# Extract latitude and longitude columns

coordinates = df_ca[[‘latitude’, ‘longitude’]]

# Perform K-means clustering with K = 4

kmeans = KMeans(n_clusters=4, random_state=42)

df_ca[‘cluster’] = kmeans.fit_predict(coordinates)

# Create a map of California using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-125, -113, 32, 37]) # California bounding box

# Plotting the clustered coordinates

for cluster in range(4):

cluster_data = df_ca[df_ca[‘cluster’] == cluster]

latitudes = cluster_data[‘latitude’].tolist()

longitudes = cluster_data[‘longitude’].tolist()

ax.scatter(longitudes, latitudes, label=f’Cluster {cluster + 1}’, s=20)

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

ax.legend()

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘K-means Clustering of Fatal Police Shootings in California (K=4)’)

plt.show()

November 1, 2023November 9, 2023

1st November, 2023

Continuing from last time I am trying to see if we can somehow correlate our geographical data with how far or how close we are from densely populated urban areas.

This work involved a lot of mapping data from different sources to make sure that they are all aligned to be able to plot the grographical data in accordance.

This is a big hurdle for me at the moment because the data sources and libraries of python that I have found for using US state wise data do not have the required granularity of being able to show population density. There is a way to map it by county level but that would require boundaries in terms of coorrdinates to follow which is immensely difficult to find for a county.

So for now I am trying to show density by creating my own dataset of cities with populations greater than 1 million and finding their geographical coordinates manually and mapping them to the city names

This will help me process the data efficiently and be able to overlay the densely populated cities over the data of the shootings.

I wish I could show the output for the overlay however the code for the data cleaning still has some minor issues which I am trying to fix for the time being because it does not always translate as 1:1 when creating a new data set and trying to match it with an existing one.

This is in fact just a huge lookup proocess that needs to complete properly before the plotting functions can work without errors.

import pandas as pd

import matplotlib.pyplot as plt

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

import us

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Create a map of the USA using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-125, -66, 24, 49]) # USA bounding box

capitals = [“Washington, D.C.”, “Montgomery”, “Juneau”, “Phoenix”, “Little Rock”, “Sacramento”, “Denver”, “Hartford”, “Dover”, “Tallahassee”, “Atlanta”, “Honolulu”, “Boise”, “Springfield”, “Indianapolis”, “Des Moines”, “Topeka”, “Frankfort”, “Baton Rouge”, “Augusta”, “Annapolis”, “Boston”, “Lansing”, “St. Paul”, “Jackson”, “Jefferson City”, “Helena”, “Lincoln”, “Carson City”, “Concord”, “Trenton”, “Santa Fe”, “Albany”, “Raleigh”, “Bismarck”, “Columbus”, “Oklahoma City”, “Salem”, “Harrisburg”, “Providence”, “Columbia”, “Pierre”, “Nashville”, “Austin”, “Salt Lake City”, “Montpelier”, “Richmond”, “Olympia”, “Charleston”, “Madison”, “Cheyenne”]

geolocator = Nominatim(user_agent=”my_geocoder”)

coordinates = []

for capital in capitals:

location = geolocator.geocode(capital)

if location:

coordinates.append((capital, location.latitude, location.longitude))

else:

coordinates.append((capital, “Not found”, “Not found”))

# Print the results

for data in coordinates:

print(data)

# Extract latitude and longitude columns

latitude_column = ‘latitude’ # Replace with your actual column name

longitude_column = ‘longitude’ # Replace with your actual column name

latitudes = df[latitude_column].tolist()

longitudes = df[longitude_column].tolist()

# Plotting the coordinates

ax.scatter(longitudes, latitudes, s=10, c=’red’, marker=’o’, alpha=0.7, edgecolor=’k’, transform=ccrs.Geodetic())

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

# Get and plot capital cities using the us library

for state in us.STATES:

capital = us.states.lookup(state.capital)

ax.text(capital.longitude, capital.latitude, state.capital, transform=ccrs.PlateCarree(), fontsize=8, ha=’right’, va=’bottom’, color=’blue’)

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘Coordinates Plot on the Map of the USA with State Capitals Highlighted’)

plt.show()

October 30, 2023November 9, 2023

30th October, 2023

Continuing from last time, my major ideas were to see if we could find any correlation between the density of shootings by having a related plot of population density with it.

This would help us identify more easily if more shootings have been taking place in more populated areas and in a way we could be able to work out if the frequency of shootings is a function of the population of the people.

Trying to code this has some challenges that I have been facing because I can’t seem to really actualize the data I’m looking for in terms of population densities hence it has been a challenge.

A lot of popular python libraries contain data regarding county level population, but I can’t plot data at a county level when looking at a state level. A lot of population data online however does not have coordinates but rather names for the cities which has been another challenge.

Trying to figure the two out for now and coding the lookups required to match the data do not seem to be working as of right now, but I am working on fixing it because I do find it an interesting direction to be going in.

Furthermore I would maybe like to have police station coordinate data plotted to see if there is any pattern of data there in terms of the distances from police stations.

The code below is not yet fully working but It’s a reference point for the work later

import pandas as pd

import matplotlib.pyplot as plt

import geopandas as gpd

import cartopy.crs as ccrs

import cartopy.feature as cfeature

from sklearn.cluster import KMeans

import us

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Extract latitude and longitude columns

latitude_column = ‘latitude’ # Replace with your actual column name

longitude_column = ‘longitude’ # Replace with your actual column name

latitudes = df[latitude_column].tolist()

longitudes = df[longitude_column].tolist()

# Perform K-means clustering

kmeans = KMeans(n_clusters=5, random_state=42)

df[‘cluster’] = kmeans.fit_predict(df[[latitude_column, longitude_column]])

# Create a map of the USA using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-125, -66, 24, 49]) # USA bounding box

# Plotting the coordinates with cluster colors

scatter = ax.scatter(df[longitude_column], df[latitude_column], s=10, c=df[‘cluster’], cmap=’viridis’, marker=’o’, alpha=0.7, edgecolor=’k’, transform=ccrs.Geodetic())

# Add colorbar

cbar = plt.colorbar(scatter, ax=ax, orientation=’vertical’, fraction=0.03, pad=0.05)

cbar.set_label(‘Cluster’)

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

# Get and plot capital cities using the us library

for state in us.STATES:

capital = us.states.lookup(state.capital)

ax.text(capital.longitude, capital.latitude, state.capital, transform=ccrs.PlateCarree(), fontsize=8, ha=’right’, va=’bottom’, color=’blue’)

# Draw state lines

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’)

# Show the plot

plt.title(‘K-Means Clustering on the Map of the USA with State Capitals Highlighted’)

plt.show()

October 27, 2023November 9, 2023

27th October, 2023

Deciding to go in a different direction, I tried my hand at looking at what the geographical data could tell me as I had exhausted all my other avenues working on the historical data.

To do this I had to plot the data to look at what we could find, inititally I need a map of the USA to plot this against or plotting the points is not going to make sense.

import pandas as pd

import matplotlib.pyplot as plt

import cartopy.crs as ccrs

import cartopy.feature as cfeature

# Specify the full path to your Excel file using a raw string

excel_file_path = r’C:\Users\91766\Desktop\fatal-police-shootings-data.xlsx’

# Read data from Excel file

df = pd.read_excel(excel_file_path)

# Create a map of the USA using Cartopy

fig, ax = plt.subplots(subplot_kw={‘projection’: ccrs.PlateCarree()}, figsize=(12, 9))

ax.set_extent([-125, -66, 24, 49]) # USA bounding box

# Extract latitude and longitude columns

latitude_column = ‘latitude’ # Replace with your actual column name

longitude_column = ‘longitude’ # Replace with your actual column name

latitudes = df[latitude_column].tolist()

longitudes = df[longitude_column].tolist()

# Plotting the coordinates

ax.scatter(longitudes, latitudes, s=10, c=’red’, marker=’o’, alpha=0.7, edgecolor=’k’, transform=ccrs.Geodetic())

# Add map features

ax.coastlines(resolution=’10m’, color=’black’, linewidth=1)

ax.add_feature(cfeature.BORDERS, linestyle=’:’)

ax.add_feature(cfeature.STATES, linestyle=’-‘, edgecolor=’black’) # Draw state lines

# Show the plot

plt.title(‘Coordinates Plot on the Map of the USA’)

plt.show()

This gives a good idea of how the points would look on the map of the USA and I have added state lines to get a better understanding of how the spread is concerning the country.

It’s clear to see from the initial plots that the there’s not a lot shootings in the mid west where not a lot of people live. A lot of shootings seem to be concentrated around more populated areas on the coasts.

October 25, 2023November 9, 2023

25th October

MENTAL HEALTH DAY OFF

October 23, 2023November 2, 2023

23rd October, 2023

So continuing on from last time, where I was doing monte carlo simulations on that comparison of white ages of people shot and black ages of people shot.

We saw that the monte carlo simulation showed that over a large number of trials, the data will show that for nearly over 50% of the time, the average ages of black people shot would be on an average 7 years younger than that of white people’s ages.

We do a similar monte carlo simulation on the group of hispanic people’s ages and we find interesting results for that as well. We had gotten similar results with an analysis of variance and T test when done on hispanic ages that there was a statistical significance to the difference in the means between the ages of hispanic people shot and white people shot.

Plotting the means of the several monte carlo simulations as a frequency distribution we get that on an average the hispanic person shot is 6.5 years younger than a white person shot.

Thus the data also prove that for 3rdd quartile or 95% of the time we will have a mean difference of 6.5 years younger for hispanic people.

October 20, 2023October 30, 2023

20th October, 2023

Finally I managed to fix the part of my code that was not letting me plot histograms for my monte carlo simulation on the data set for the ages of black and whites from the washington post shooting data.

So finally having run the code I can tell with more confidence that the difference in means for the age groups was not just by chance as we had seen in the T tests, but we went one strp further to make it more concrete using the monte carlo simulation.

Frequency Distribution of Difference in Means

From the above graph we can clearly tell that over a large number of randomized simulations on our data using resampling and finding the difference in means for the two groups, we still find that the average age of black people in the data is roughy 7 years younger than that of the white people.

Thus after having combined the results of pair wise T tests, T tests, Tukey Methods, Difference in Covariances, Monte Carlo Simulations, we can say that the observed differences in the mean ages of the black people as compared to white people who were shot is not occuring by chance.

October 18, 2023October 30, 2023

18th October, 2023

Hi, so continuing from last time, I was having some difficulties trying to code a monte carlo simulation for the test between black ages and white ages. This was to be done because it is not advised to do multiple T tests, this is just going to increase the error.

The problem with running a monte carlo simulation so far had been that it was tricky to control the test and control groups, as well as the sampling and this was leading to a lot of increased errors while running the code.

The errors were ocurring more often than not because of the issues of grouping and scaling for the plotting the histogram. While I continue to work on the plotting of the histogram, I will do a brief overview of what my code is trying to achieve with.

It starts with taking 3 data sets, we will be using them 2 at a time for the most part and keeping the white ages or the ‘wage’ group as the control group for our hypotheses testing and while this is the case for most tests, we are only doing this as doing a lot of pariwise T tests is not advises due to the errors.

The solution to that is to take our control group and test group, draw a fixed size of samples from each of the groups with replacement and thus adding a nature of randomness to the test. We use this to calculate the difference in the means of the two groups and we store them.

This is done using the for loop for a huge number of iterations so that we have enough data to be able to plot a histogram for the frequency of the value, that is the difference in the means of the values for the two data sets that we have been using. The point of the test is to determine if on an average, over a large number of simulations that are completely random, if we see a recurring pattern that is, there is actually a diffference in the means, and it is the result for the monte carlo simulation as well.

This would be a good marker to represent something more substantial and visaully easy to grasp when it comes to defining if something is ocurring by complete chance or there is something causing it. So far we would just like to understand and reject the Null Hypotheses, that the difference in means for the age groups is by chance.

October 16, 2023

16th October, 2023

Last time I was looking at t tests to see if the differences observed in the frequency distributions for ages of different races was purely by chance or due to some other factors that we could not determine yet.

The t test values for ages of Black, Hispanic and White people show significant differences in means which is not occurring by chance and hence rejecting the null hypotheses.

The next thing we can do is look at the analysis of variance to determine if there is a significant difference between the different races.

Here we can tell that by the P value that there is a significant difference between the 5 degrees of freedom of race.

To further analyse which pairs are contributing the most to differences, we have tried tukey method in R to do a pairwise analysis on all the races to check which of the combinations are more significant in the differences.

Anova pairwise for every race combination

Here we can see the P values for the W-N, W-B, W-H are the most significant of the values and all the rest just indicate chance. W-A also show some degree of significance but not as much as the first 3 mentioned.

More analysis later.

October 13, 2023October 16, 2023

13th October, 2023

Well continuing from where we were last time comparing age data for different races and finding a discrepancy between the means for Black, Whites and Hispanics.

The case was that on average Blacks were found to be 7 years younger than Whites, in the data. We wanted to confirm if this is the case by chance or is there an actual contributing factor.

To do this we can conduct T tests to check if the P values are significant for us to reject the null hypotheses that there is no difference in the true means of the samples.

Here we can see that the ages for black people vary from nearly 9 years younger to 6.5 years younger for 95% of the interval.

The T values are high and the P values are significantly low and we can say this is not just down to chance, it’s highly improbable.

As we can see here, the two ages are different on an average and the P values indicate this is highly unlikely to have happened by chance, the T values corroborate this as well.

October 11, 2023

11th October. 2023

Starting work on the Police shooting data, we don’t really have a very easy way through so we initially just look at a few key parameters of our interest and see if we can work our way through them and look for any patterns or insight.

So starting with some descriptive statistics for the analysis, we found that average and median ages of the victims were roughly 37 and 35 respectively.

It also showed a standard deviation of 13 years. Looking at the distribution of the ages we found something like this.

As we can see from this, there seems to be a right skewed distribution for the ages of the victims.

Looking further, what if we could look for any differences in the distributions for the ages of different races of victims.

So trying the same experiment for some of the races separately, we found this.

As you can see for the distribution of ages of Asians, there does not seem to be any visible pattern.

However when looking at data for the distributions of ages for African Americans who were shot, we can see that in terms of the distribution that had all races present, distribution for African American ages is even more right skewed, hence it could be a sign that on an average, younger people of this race are being killed.

Above we also have for comparison the distribution of the ages of all the people who are Hispanic and as we can see the graph for this race does not seem to be as skewed as those for African Americans, hence possibly indicating an average of relatively older Hispanic people being victims as compared to other races.

The distributions for native Americans and those with their races categorized as other do not show any meaningful patterns when visualized.

Looking at the age distributions for white people who were shot,

We can see that this graph is the most similar in spread to the one for Hispanic and African Americans, however it is not as skewed in nature as that for African Americans or Hispanic, in fact it almost looks the least skewed of the 3 thus indicating a central tendency of data points in terms of age.

Just from the descriptive stats of these different sections of age and race, we can tell that on average younger hispanic and black people are being shot as opposed to white people.

Will later be looking into more tools to see how to better quantify this discrepancy in the data when looking at the ages of the victims across different races.

Also most importantly need to take into account for the fact that not all of the variables have data for every row and thus can be somewhat inconsistent.

October 8, 2023

A report on Predicting Diabetes Prevalence from Obesity and Inactivity

MTH522_Project_1

October 6, 2023October 8, 2023

6th October, 2023

This will be more of a summarizing article because most of our work on the project was completed before this.

We had decided that based on what we had set out to do, the question we chose to answer with this dataset, was if there was any way to predict diabetes given that we had the data for corresponding inactivity and obesity levels.

Based off of this we looked for the counties that had data for all three parameters, and then we tried by fitting simple linear regression models which we found not to be adequate in predicting diabetes, and there was heteroskedasticity present in the data when looking at spread of residuals, also we did BP test to confirm this.

On further inspection it was concluded that a quadratic fit was indeed better for predicting diabetes, and was further improved by using an interaction term.

These various fits along with our final one, were tested using cross-validation with K=10 to provide their respective test errors. Even through this we could conclude that our latest quadratic model with an interaction term was the best fit, however due to the limited data only some 354 data points, we could only come up with up to a 0.42 correlation.

This is as far as we got and was all we could manage, if there were more data perhaps there would be a greater overarching trend that would be fit more easily with relatively less complexity.

October 4, 2023October 8, 2023

4th October, 2023

Just a little bit of final exploration before we go ahead with our final model with our best correlation value that we had obtained.

I just wanted to test if it was possible to get a better fit by increasing the complexity of the model, just by trying this for diabetes and obesity, I would check for the P values to see if any of them were significant,

As seen here, the P values suggest that a quadratic fit would be best

Thus we can see here, that all except the power 2 have the p values of not much significance and hence can be rejected, they are all clearly too high and the quadratic being the lowest.

This all coincides with our study so far the latest model we have built with quadratic models of obesity and inactivity.

October 2, 2023October 8, 2023

2nd October, 2023

So continuing in a little less than ideal fashion because there was a lot of painstaking research that I did on the data by performing various different tests. However my system crashed so I will try to replicate what I can and remember.

I had reached a bit of a dead end with the data wondering why I was really trying to model diabetes on the basis of inactivity and obesity.

In a way I had set out with a goal in mind to look for and had spent all my time looking for it, when the fact is that it may not necessarily even exist.

So I was comparing how the fits are when the models are built from a different set of predictors and what is the value to be predicted.

In doing these tests, I was finding that the multiple linear model of diabetes and obesity, for predicting inactivity was giving the highest R squared term I had encountered so far of 0.42. Actually it was 0.39 for a linear model which was much higher than any other previous linear model I had encountered, and it jumps to 0.42 with the addition of the interaction term.

This was particularly interesting because transformations did not help better the correlation but instead worsened it, in nearly all cases where the model was made more complex, such as log or polynomials.

However when I was comparing test errors for this, using K fold cross validation this model produced an error much higher than that of the 0.3 seen in the log models.

September 29, 2023

29th September, 2023

Continuing where I left off last time which was the comparison of different models with their test errors using the K fold cross validation approach.

To start off, I used the simplest linear model with diabetes modeled on inactivity and obesity. Further down the line a complicate the model further by adding interaction terms or square terms on the basis of the tests from yesterday to improve the model fit.

All of these were used to calculate the test error resulting from a K = 10 fold.

As is seen from this, I was achieving my lowest test errors for the log models but I believe there is something further I need to investigate into this because of how much the drop off in error is, when compared to its other simpler models.

September 27, 2023October 8, 2023

27th September, 2023

So working on just fine tuning my model a bit further I was looking at what terms I need to keep in my model for the most significance and remove those that do not affect the relation but increase complexity. Here I found that for modelling diabetes on inactivity and obesity, we are better off with their 2nd power rather than any higher order polynomials.

A look at P values for all the terms of the model

As you can see only the 2nd power ones are of significance to us and we reduce the model down to just those and we see that the correlation has not changed below.

Finally I also did a K fold cross validation test to check for the test error of the model and compare it to that of the linear model.

As you can see, even though it’s minimal there is a slight edge to the non linear model.

September 25, 2023October 8, 2023

25th September, 2023

Before in general moving onto splines and other smoothing methods and applying more transforms, I wanted to continue looking at how a polynomial representation of the model would affect the fit.

Since the last time I was comparing fits just based off a polynomial for obesity, this will compare both, using just inactivity, and using inactivity and obesity together.

We see here on the basis of the P values that we actually only need up to a quadratic function.

Next I tried combining polynomials for both inactivity and obesity to see if it affected the fit in a meaningful way.

Polynomial fit for both Inactivity and Obesity

As seen here by the R square values we can tell that it was not a significant increase.

After this I tried taking log functions of the polynomials to see if that would alter my results significantly.

As we can see from the R squared values again this was not producing anything drastically different in terms of what the fit was.

To further modify things I have tried to add an interaction term and as well as taking a log of diabetes as well.

Results forr adding an interaction term as well — Results for adding an interaction term as well

As we can see from our latest test that the addition of the interaction term changed things rather significantly as compared to the previous transforms and this also helped increase the R squared value.

September 22, 2023October 8, 2023

22nd September, 2023

I was working on comparing and finding which models are better to predict the data and venturing into non linear modelling, I was trying out various combinations and seeing which was the best fit across the models.

First up trying to model diabetes as just a factor of obesity, and then it’s subsequent powers of 2,3,4,5.

This will also be done to a log of obesity, and a log of the obesity and diabetes both.

I will use all the above to compare the fits by P values and see which one is turning out best.

I will also further be carrying out tests for Inactivity data the same way but while writing this I realised that it contains some states with no data for inactivity hence making it difficult to run the poly function.

But for now focusing on trying to fit on the basis of transformations to obesity,

As you can see from the P values of the compared models, you can see that the most appropriate fit would be achieved by using the quadratic model rather than any higher ones, however it is interesting to notice that the log model and the 5th power non linear fit are similar in P value.

Will continue with more tests and see how the fits are turning out.

September 20, 2023October 8, 2023

20th September 2023

So continuing on from my previous test using a linear model, varying to accommodate an interaction term, and then further varying it to be the log of the function gave us a higher R squared value, for me the highest so far.

So I went ahead and wrote a function and tried to use the bootstrap method to verify the coefficients of our assumed function.

function(data , index) coef(lm(log(diabetes)~log(inactive) +log(obese) + obese*inactive , data = data, subset=index))

I now ran one bootstrap verification on the entire data set taking a data set of 363 with replacement.

As you can see 2 different samples of the same data provided varying results, we want them to be averaged over a large number of randomly sampled data sets.

Here this was done over a varying number of times to see if it provided any benefit in finding a more precise coefficient.

As you can see it did not vary much over 10 different samples, it found the supposed coefficients somewhere between 1 and 5 different samples and their aggregates.

September 18, 2023October 8, 2023

18th September, 2023

Hello,

Starting off where I left last time, in a search of answers of some sort, the linear model was only doing so much. So we tried multiple regression variables modelling diabetes as an effect of inactivity and obesity. This showed a minor increase in the R squared term, it was roughly the same while trying any quadratic factors for the same model.

However when trying to explore further by introducing the interaction term for inactivity and obesity,

Summary for model with interaction terms

Here we see a further increase in the R squared factor for the model which is better for us. However in my efforts to further increase this by trying different variations of the models and factors, I tried out using log of diabetes and the predictors, to see if that could help our case.

Summary of Log transformation — Log Transformation of Linear Model

Hence it is evident that this led to a further increase in the R squared which did not seem to be happening with our higher powered terms.

Now I further introduced the interaction variable to the log transformed model
to see if it could help improve the accuracy.

Summary of log transformations and interaction terms

As you can see this model produced my highest yet R squared of 0.42.

This felt like a few steps in the right direction.

September 15, 2023October 8, 2023

15th September, 2023

I was going to write this post as a continuation to the last one but however today in class we spoke about collinearity as mentioned by someone, so I wanted to check my multiple linear regression model for any.

VIF for Inactivity and obeseity — VIF for Inactivity and obesity

This showed that the VIF values for both predictors were low and hence do not show any signs of collinearity.

Also while I continue to look for a better fitting model, I tried a simple comparison between the simple linear model for diabetes against inactivity, against the multiple linear model.

Since the P value shown is low, we can reject the hypotheses that both these models represent the data equally well and that the multiple linear regression model is in fact better.

Will continue looking for better transformations to apply to the make the fit better.

September 13, 2023October 8, 2023

13th September, 2023

Writing this post in a continuation to the previous tests that I was running.

While before I was only looking visually at the graph of residuals vs fitted values to determine if the variances are generally equal over the spread, but now instead I have tried the breusch pagan test.

It will plot the data of the residuals against the fitted data and look for any sort of relation between the two, this will essentially compute a p value for the data of residual vs fitted. This is useful for us in determining if there is really no pattern in the data and we accept a high p value from the null hypotheses point of view, which is the assumption that the data is homoskedastic. This does not turn out to be the case in the following.

Diabetes vs Inactivity data
Obesity vs Inactivity data
Modelling Diabetes as a factor of both Inactivity and Obesity

But however the model for the Diabetes vs Obesity data clearly shows that the P value is high enough for it’s model to be considered to have equal variances for the most part, or the null hypotheses is accepted, which is that the data is homoskedastic.

The scaled graph provides a better estimate in this case to visually verify the nature of homoskedasticity and the extent of it.

September 11, 2023October 8, 2023

September 11th, 2023

For my first exploration of the CDC diabetes data, I used some basic data cleaning and joining to see if there was anything common across the three data sets provided. This join yielded only about 350 or so entries which reduced the data set significantly but it still has enough number of observations to be applicable for central limit theorem,

I then proceeded to make some simple scatterplots taking two of the variables at a time, to get a better look of the data and to see if it exhibited any trends.

There was nothing too obvious about these so I decided to start with the simplest, a linear model regression which would be used to model for Y which would be diabetes, in which case there is no diabetes data, it Inactivity that was used against obesity as a predictor.

I will insert a brief of the summaries of the linear models generated and their graphs.

Diabetic vs Obese

As you can see the Correlation is very low 0.14 and the coefficients do not have a significant enough p value either. On further examination

The residuals and fitted values show a pattern of ballooning with increase in value of the data and is not what we are looking for, it should be relatively random with no pattern to it so as to be equally spaced. Our graphs however indicate heteroscedacity in the data. The QQ plot does indicate some degree of normality in the residuals but plotting them against the fitted values shows that the variances are not equal. However there are no values of residuals that are having a significant impact on the coefficients either.

Diabetic vs Inactive

The grey area in the plot indicates the standard errors.

Again the P values are very low with a high error and low correlation.

Summary of Diabetes modeled on Inactivity

The data for the residuals is following the same trends, with near normalized residuals and again heteroscedacity as seen in the residual vs fitted data due to the ballooning effect as seen. Again no large effects on the coefficients as seen from the last graph of leverage.

Graphs for Diabetes modeled on inactivity

Inactivity vs Obesity

This again does not show any signs of a good correlation. As well with the rest of the data.

Multiple Linear Regression

Trying to account for diabetes on account of both, Inactivity and Obesity as the two predictor variables.

As you can the error is much lower than any of the above methods and the correlation while not much higher than any of the previous models, is still higher by 0.1.

Investigating further

While the residuals are close to normalized, the residuals vs fitted do still show some spread as the data values increase but significantly lesser than those of before. I still think there could be some hetereoscedacity but will investigate that further next time.

September 6, 2023

Test

test

September 6, 2023

Hello world!

Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!