September 11th, 2023

For my first exploration of the CDC diabetes data, I used some basic data cleaning and joining to see if there was anything common across the three data sets provided. This join yielded only about 350 or so entries which reduced the data set significantly but it still has enough number of observations to be applicable for central limit theorem,

I then proceeded to make some simple scatterplots taking two of the variables at a time, to get a better look of the data and to see if it exhibited any trends.

Diabetic vs Obesity
Diabetic vs Obesity
Diabetic vs Inactive
Diabetic vs Inactive
Inactivity vs Obesity
Inactivity vs Obesity

 

There was nothing too obvious about these so I decided to start with the simplest, a linear model regression which would be used to model for Y which would be diabetes, in which case there is no diabetes data, it Inactivity that was used against obesity as a predictor.

I will insert a brief of the summaries of the linear models generated and their graphs.

Diabetic vs Obese 

 

Diabetes fit for obesity
Diabetes fit for obesity
DIabetes modeled on Obesity
DIabetes modeled on Obesity Summary

As you can see the Correlation is very low 0.14 and the coefficients do not have a significant enough p value either. On further examination

Diabetes modeled for obesity
Diabetes modeled for obesity graphs

The residuals and fitted values show a pattern of ballooning with increase in value of the data and is not what we are looking for, it should be relatively random with no pattern to it so as to be equally spaced. Our graphs however indicate heteroscedacity in the data. The QQ plot does indicate some degree of normality in the residuals but plotting them against the fitted values shows that the variances are not equal. However there are no values of residuals that are having a significant impact on the coefficients either.

Diabetic vs Inactive

Diabetes modeled for Inactivity
Diabetes modeled for Inactivity

The grey area in the plot indicates the standard errors.

Again the P values are very low with a high error and low correlation.

Summary of Diabetes modeled on Inactivity
Summary of Diabetes modeled on Inactivity

The data for the residuals is following the same trends, with near normalized residuals and again heteroscedacity as seen in the residual vs fitted data due to the ballooning effect as seen. Again no large effects on the coefficients as seen from the last graph of leverage.

Graphs for Diabetes modeled on inactivity
Graphs for Diabetes modeled on inactivity

 

Inactivity vs Obesity

Inactivity modeled on obesity
Inactivity modeled on obesity

This again does not show any signs of a good correlation. As well with the rest of the data.

Inactivity Summary for Obesity
Graphs for Obesity vs Inactivity

Multiple Linear Regression

Trying to account for diabetes on account of both,  Inactivity and Obesity as the two predictor variables.

Summary of Multiple Linear Regression on the data
Summary of Multiple Linear Regression on the data

As you can the error is much lower than any of the above methods and the correlation while not much higher than any of the previous models, is still higher by 0.1.

Investigating further

Graphs for the multiple Regression
Graphs for the multiple Regression

While the residuals are close to normalized, the residuals vs fitted do still show some spread as the data values increase but significantly lesser than those of before. I still think there could be some hetereoscedacity but will investigate that further next time.

Leave a Reply

Your email address will not be published. Required fields are marked *