This will be more of a summarizing article because most of our work on the project was completed before this.
We had decided that based on what we had set out to do, the question we chose to answer with this dataset, was if there was any way to predict diabetes given that we had the data for corresponding inactivity and obesity levels.
Based off of this we looked for the counties that had data for all three parameters, and then we tried by fitting simple linear regression models which we found not to be adequate in predicting diabetes, and there was heteroskedasticity present in the data when looking at spread of residuals, also we did BP test to confirm this.
On further inspection it was concluded that a quadratic fit was indeed better for predicting diabetes, and was further improved by using an interaction term.
These various fits along with our final one, were tested using cross-validation with K=10 to provide their respective test errors. Even through this we could conclude that our latest quadratic model with an interaction term was the best fit, however due to the limited data only some 354 data points, we could only come up with up to a 0.42 correlation.
This is as far as we got and was all we could manage, if there were more data perhaps there would be a greater overarching trend that would be fit more easily with relatively less complexity.