29th September, 2023

Continuing where I left off last time which was the comparison of different models with their test errors using the K fold cross validation approach.

To start off, I used the simplest linear model with diabetes modeled on inactivity and obesity. Further down the line a complicate the model further by adding interaction terms or square terms on the basis of the tests from yesterday to improve the model fit.

All of these were used to calculate the test error resulting from a K = 10 fold.

K fold cross validation comparison
K fold cross validation comparison

As is seen from this, I was achieving my lowest test errors for the log models but I believe there is something further I need to investigate into this because of how much the drop off in error is, when compared to its other simpler models.

27th September, 2023

So working on just fine tuning my model a bit further I was looking at what terms I need to keep in my model for the most significance and remove those that do not affect the relation but increase complexity. Here I found that for modelling diabetes on inactivity and obesity, we are better off with their 2nd power rather than any higher order polynomials.

A look at P values for all the terms of the model

As you can see only the 2nd power ones are of significance to us and we reduce the model down to just those and we see that the correlation has not changed below.

Fit with removed linear terms
Fit with removed linear terms

Finally I also did a K fold cross validation test to check for the test error of the model and compare it to that of the linear model.

K fold Cross Validation
K fold Cross Validation

As you can see, even though it’s minimal there is a slight edge to the non linear model.

25th September, 2023

Before in general moving onto splines and other smoothing methods and applying more transforms, I wanted to continue looking at how a polynomial representation of the model would affect the fit.

Since the last time I was comparing fits just based off a polynomial for obesity, this will compare both, using just inactivity, and using inactivity and obesity together.

Inactive to degree 6 Polynomial
Inactive to degree 6 Polynomial

We see here on the basis of the P values that we actually only need up to a quadratic function.

Next I tried combining polynomials for both inactivity and obesity to see if it affected the fit in a meaningful way.

Polynomial fit for both Inactivity and Obesity
Polynomial fit for both Inactivity and Obesity

As seen here by the R square values we can tell that it was not a significant increase.

After this I tried taking log functions of the polynomials to see if that would alter my results significantly.

Log of Polynomial Functions
Log of Polynomial Functions

As we can see from the R squared values again this was not producing anything drastically different in terms of what the fit was.

To further modify things I have tried to add an interaction term and as well as taking a log of diabetes as well.

Results forr adding an interaction term as well
Results for adding an interaction term as well

As we can see from our latest test that the addition of the interaction term changed things rather significantly as compared to the previous transforms and this also helped increase the R squared value.

22nd September, 2023

I was working on comparing and finding which models are better to predict the data and venturing into non linear modelling, I was trying out various combinations and seeing which was the best fit across the models.

First up trying to model diabetes as just a factor of obesity, and then it’s subsequent powers of 2,3,4,5.

This will also be done to a log of obesity, and a log of the obesity and diabetes both.

I will use all the above to compare the fits by P values and see which one is turning out best.

I will also further be carrying out tests for Inactivity data the same way but while writing this I realised that it contains some states with no data for inactivity hence making it difficult to run the poly function.

But for now focusing on trying to fit on the basis of transformations to obesity,

Fit comparison for non linear models
Fit comparison for non linear models

As you can see from the P values of the compared models, you can see that the most appropriate fit would be achieved by using the quadratic model rather than any higher ones, however it is interesting to notice that the log model and the 5th power non linear fit are similar in P value.

Will continue with more tests and see how the fits are turning out.

20th September 2023

So continuing on from my previous test using a linear model, varying to accommodate an interaction term, and then further varying it to be the log of the function gave us a higher R squared value, for me the highest so far.

So  I went ahead and wrote a function and tried to use the bootstrap method to verify the coefficients of our assumed function.

function(data , index) coef(lm(log(diabetes)~log(inactive) +log(obese) + obese*inactive , data = data, subset=index))

I now ran one bootstrap verification on the entire data set taking a data set of 363 with replacement.

Verify Bootstrap
Verify Bootstrap with the entire data

As you can see 2 different samples of the same data provided varying results, we want them to be averaged over a large number of randomly sampled data sets.

Here this was done over a varying number of times to see if it provided any benefit in finding a more precise coefficient.

 

Bootstrap with varying tries
Bootstrap with varying tries

 

As you can see it did not vary much over 10 different samples, it found the supposed coefficients somewhere between 1 and 5 different samples and their aggregates.

18th September, 2023

Hello,

Starting off where I left last time, in a search of answers of some sort, the linear model was only doing so much. So we tried multiple regression variables modelling diabetes as an effect of inactivity and obesity. This showed a minor increase in the R squared term, it was roughly the same while trying any quadratic factors for the same model.

Linear Multiple Regression SUmmary
Linear Multiple Regression SUmmary

 

However when trying to explore further by introducing the interaction term for inactivity and obesity,

Summary for model with interaction terms
Summary for model with interaction terms

Here we see a further increase in the R squared factor for the model which is better for us. However in my efforts to further increase this by trying different variations of the models and factors, I tried out using log of diabetes and the predictors, to see if that could help our case.

Summary of Log transformation
Log Transformation of Linear Model

Hence it is evident that this led to a further increase in the R squared which did not seem to be happening with our higher powered terms.

Now I further introduced the interaction variable to the log transformed model
to see if it could help improve the accuracy.

Summary of log transformations and interaction terms
Summary of log transformations and interaction terms

As you can see this model produced my highest yet R squared of 0.42.

This felt like a few steps in the right direction.

15th September, 2023

I was going to write this post as a continuation to the last one but however today in class we spoke about collinearity as mentioned by someone, so I wanted to check my multiple linear regression model for any.

VIF for Inactivity and obeseity
VIF for Inactivity and obesity

This showed that the VIF values for both predictors were low and hence do not show any signs of collinearity.

Also while I continue to look for a better fitting model, I tried a simple comparison between the simple linear model for diabetes against inactivity, against the multiple linear model.

Comparison of variance tables
Comparison of variance tables

Since the P value shown is low, we can reject the hypotheses that both these models represent the data equally well and that the multiple linear regression model is in fact better.

Will continue looking for better transformations to apply to the make the fit better.

13th September, 2023

Writing this post in a continuation to the previous tests that I was running.

While before I was only looking visually at the graph of residuals vs fitted values to determine if the variances are generally equal over the spread, but now instead I have tried the breusch pagan test.

It will plot the data of the residuals against the fitted data and look for any sort of relation between the two, this will essentially compute a p value for the data of residual vs fitted. This is useful for us in determining if there is really no pattern in the data and we accept a high p value from the null hypotheses point of view, which is the assumption that the data is homoskedastic. This does not turn out to be the case in the following.

Diabetes vs Inactivity data
Obesity vs Inactivity data
Modelling Diabetes as a factor of both Inactivity and Obesity

Diabetes vs Inactivity BPTest
Diabetes vs Inactivity BPTest
obesity
Obesity vs Inactivity BPTEST
BPtest for multiple regression model
BPtest for multiple regression model

But however the model for the Diabetes vs Obesity data clearly shows that the P value is high enough for it’s model to be considered to have equal variances for the most part, or the null hypotheses is accepted, which is that the data is homoskedastic.

Obesity vs Diabetes BPTest
Obesity vs Diabetes BPTest
Diabetes modeled for obesity
Diabetes modeled for obesity graphs

 

The scaled graph provides a better estimate in this case to visually verify the nature of homoskedasticity and the extent of it.

September 11th, 2023

For my first exploration of the CDC diabetes data, I used some basic data cleaning and joining to see if there was anything common across the three data sets provided. This join yielded only about 350 or so entries which reduced the data set significantly but it still has enough number of observations to be applicable for central limit theorem,

I then proceeded to make some simple scatterplots taking two of the variables at a time, to get a better look of the data and to see if it exhibited any trends.

Diabetic vs Obesity
Diabetic vs Obesity
Diabetic vs Inactive
Diabetic vs Inactive
Inactivity vs Obesity
Inactivity vs Obesity

 

There was nothing too obvious about these so I decided to start with the simplest, a linear model regression which would be used to model for Y which would be diabetes, in which case there is no diabetes data, it Inactivity that was used against obesity as a predictor.

I will insert a brief of the summaries of the linear models generated and their graphs.

Diabetic vs Obese 

 

Diabetes fit for obesity
Diabetes fit for obesity
DIabetes modeled on Obesity
DIabetes modeled on Obesity Summary

As you can see the Correlation is very low 0.14 and the coefficients do not have a significant enough p value either. On further examination

Diabetes modeled for obesity
Diabetes modeled for obesity graphs

The residuals and fitted values show a pattern of ballooning with increase in value of the data and is not what we are looking for, it should be relatively random with no pattern to it so as to be equally spaced. Our graphs however indicate heteroscedacity in the data. The QQ plot does indicate some degree of normality in the residuals but plotting them against the fitted values shows that the variances are not equal. However there are no values of residuals that are having a significant impact on the coefficients either.

Diabetic vs Inactive

Diabetes modeled for Inactivity
Diabetes modeled for Inactivity

The grey area in the plot indicates the standard errors.

Again the P values are very low with a high error and low correlation.

Summary of Diabetes modeled on Inactivity
Summary of Diabetes modeled on Inactivity

The data for the residuals is following the same trends, with near normalized residuals and again heteroscedacity as seen in the residual vs fitted data due to the ballooning effect as seen. Again no large effects on the coefficients as seen from the last graph of leverage.

Graphs for Diabetes modeled on inactivity
Graphs for Diabetes modeled on inactivity

 

Inactivity vs Obesity

Inactivity modeled on obesity
Inactivity modeled on obesity

This again does not show any signs of a good correlation. As well with the rest of the data.

Inactivity Summary for Obesity
Graphs for Obesity vs Inactivity

Multiple Linear Regression

Trying to account for diabetes on account of both,  Inactivity and Obesity as the two predictor variables.

Summary of Multiple Linear Regression on the data
Summary of Multiple Linear Regression on the data

As you can the error is much lower than any of the above methods and the correlation while not much higher than any of the previous models, is still higher by 0.1.

Investigating further

Graphs for the multiple Regression
Graphs for the multiple Regression

While the residuals are close to normalized, the residuals vs fitted do still show some spread as the data values increase but significantly lesser than those of before. I still think there could be some hetereoscedacity but will investigate that further next time.