September 11th, 2023 – gautammarathe

For my first exploration of the CDC diabetes data, I used some basic data cleaning and joining to see if there was anything common across the three data sets provided. This join yielded only about 350 or so entries which reduced the data set significantly but it still has enough number of observations to be applicable for central limit theorem,

I then proceeded to make some simple scatterplots taking two of the variables at a time, to get a better look of the data and to see if it exhibited any trends.

There was nothing too obvious about these so I decided to start with the simplest, a linear model regression which would be used to model for Y which would be diabetes, in which case there is no diabetes data, it Inactivity that was used against obesity as a predictor.

I will insert a brief of the summaries of the linear models generated and their graphs.

Diabetic vs Obese

As you can see the Correlation is very low 0.14 and the coefficients do not have a significant enough p value either. On further examination

The residuals and fitted values show a pattern of ballooning with increase in value of the data and is not what we are looking for, it should be relatively random with no pattern to it so as to be equally spaced. Our graphs however indicate heteroscedacity in the data. The QQ plot does indicate some degree of normality in the residuals but plotting them against the fitted values shows that the variances are not equal. However there are no values of residuals that are having a significant impact on the coefficients either.

Diabetic vs Inactive

The grey area in the plot indicates the standard errors.

Again the P values are very low with a high error and low correlation.

Summary of Diabetes modeled on Inactivity

The data for the residuals is following the same trends, with near normalized residuals and again heteroscedacity as seen in the residual vs fitted data due to the ballooning effect as seen. Again no large effects on the coefficients as seen from the last graph of leverage.

Graphs for Diabetes modeled on inactivity

Inactivity vs Obesity

This again does not show any signs of a good correlation. As well with the rest of the data.

Multiple Linear Regression

Trying to account for diabetes on account of both, Inactivity and Obesity as the two predictor variables.

As you can the error is much lower than any of the above methods and the correlation while not much higher than any of the previous models, is still higher by 0.1.

Investigating further

While the residuals are close to normalized, the residuals vs fitted do still show some spread as the data values increase but significantly lesser than those of before. I still think there could be some hetereoscedacity but will investigate that further next time.

Leave a Reply Cancel reply