8th December, 2023

After having split the data across the different levels of granularity by grouping the data by the time index and using different counts and to plot the frequency by time and check for seasonality in data.

We plot the time series data for different granularities and we see what we can see from the visual trends.

Monthly Time Series Modelling
Weekly Time Series Modelling

Clearly the difference in granularity can be seen the details of the graphs for the differences in the weekly and monthly graphs.

We can also check for stationarity by using the ADF test and we can use the P value from the test to confirm stationarity.

ADF Test

After Performing the ADF tests on the data we can tell very clearly that they are not stationary and we can check on how the different components of the monthly view compare when decomposed into it’s different components.

Decomposition of the Time Series for Monthly

6th December, 2023

So looking at the time series data we had to start the analysis in a way that would enable us to look at trends. Since the data that we have is present in the a granular level, we cannot use it directly. We need to apply some level of transformations for the data to be accessible for us initially.

This can be done in the form of using the existing time data and we can use it to a perform a group by on the data level of the required granularity that we need and hence we can check for trends for the respective granularity level wether that is daily, weekly or monthly.

We start of by doing a monthly, 6monthly, and yearly, split. We also use grouped by day just to check out the levelof granularity plotting that is evident at the daily stage.

We used the following code below to make sure we accurately split our plotting across the diffferent levels of granularity to plot the time series model.

import pandas as pd
import os
# Set the directory where your CSV files are located
directory_path = r”C:\Users\91766\Desktop\stats3″
# Get a list of all files in the directory starting with “tpm” and ending with “.csv”
csv_files = [file for file in os.listdir(directory_path) if file.startswith(‘tmp’) and file.endswith(‘.csv’)]
# Initialize an empty DataFrame to store the concatenated data
concatenated_df = pd.DataFrame()
# Loop through each CSV file and concatenate its data to the main DataFrame
for file in csv_files:
    file_path = os.path.join(directory_path, file)
    df = pd.read_csv(file_path)
    df[[‘date’, ‘time’]] = df[‘OCCURRED_ON_DATE’].str.split(‘ ‘, expand=True)
    df[‘date’] = pd.to_datetime(df[‘date’])
    concatenated_df = pd.concat([concatenated_df, df], ignore_index=True)
# Display the concatenated DataFrame
#print(concatenated_df)
#print(df[‘date’])
# Save the concatenated DataFrame to an Excel file
#output_excel_path = r”C:\Users\91766\Desktop\stats3\concatenated_data.csv”
#concatenated_df.to_csv(output_excel_path, index=False)
#print(f”Concatenated data saved to {output_excel_path}”)
grouped_by_day = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.date).size().reset_index(name=’count’)
grouped_by_day.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_day.xlsx”, index=False)
# Output 2: Group by month
grouped_by_month = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.to_period(‘M’)).size().reset_index(name=’count’)
grouped_by_month.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_month.xlsx”, index=False)
# Output 3: Group by 6 months
grouped_by_6_months = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.to_period(‘6M’)).size().reset_index(name=’count’)
grouped_by_6_months.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_6_months.xlsx”, index=False)
grouped_by_week = concatenated_df.groupby(pd.to_datetime(concatenated_df[‘date’]).dt.to_period(‘W’)).size().reset_index(name=’count’)
grouped_by_week.to_excel(r”C:\Users\91766\Desktop\stats3\grouped_by_week.xlsx”, index=False)
print(“complete”)

 

4th December, 2023

Due to the severely limited nature of the data that we were working with in the analyze boston data set we further moved onto the work where we would be looking into another dataset with large enough data sets because that would help us train more accurate and reliable models and provide us a way to test the model as well on the previous data.

So the alternate dataset that we are looking at is crime reporting data set because we have data across multiiple columns using multiple factors and introducing time and location data.

We would mostly like to look at the time data and try to create some sort of time series and analyse that for further insight.

We can try this but this would initiallly require a lot of data transformation because the data in the crime reporting data set is at a very granular reoprting level and considering that we have day level for the last 8 years it’s going to be very low level data. Plottng this as a timeseries would not yield much because trying to predict something like this at a daily level would not be very useful unless done very accurately.

Besides the analyses would not be of much use as it would merely indicate how crimes occurr as a function of time and that is not the reality for crime in real life because there are a variety of factors that go into it but we can highlight from historical plotting of data some sort of trends to avoid certain stretches of time.

1st December, 2023

We look at a final few linear models before we move on to some level of time series forecasting because the nature of the data largely shows the same results and I find it largely due to the nature and manner of the data collection is what causes the the outcomes in the high R squared values and the adjust R values.

So as a final linear model in the economic indicators data we can look at the same linear model that we have been analysing before but we can add to its complexity by adding interaction terms and other things such as the logan passengers and international flights data to futher enhance the complexity of the model and perhaps better enchance the prediction capabilities of the model.

We add the passenger  term to our 3 term linear model and therefore making it 4 terms now.

Linear Model combining Passenger and Flight Data

Thus from the value we can tell while the value of P are low enough for passsenger data they are the only significant values as the passenger data even though marginally does increase the R square adjusted value to be into the 0.8s and it was not there before.

The international flights and passenger data have high enough P values that we can ignore them while considering our model so we can just stick with out initial 2 parameter model.