If you have followed along with my journey of becoming a future data scientist, you might see some posts describing what it means to be one. Performing this project with Chandler Ellsworth really helped me understand the techniques all data scientists need to do.

Like in any data scientist project, we had to get our data and manipulate it in a form that could be analyzed. Sadly, all the data was saved in a numeric format, even though the vast majority should not have been. After correcting for that, we were able to perform some exploratory data analysis. This allows not only the data scientist to find some trends but also helps our users (or the people who are trying to understand our results) some background information that will eventually help explain our final conclusions. Then it came time to the modeling where we understood the important variables needed. We made some logistic regresssion, LASSO, classification tree, random forest, linear discriminant analysis, and support vector machine models. The reason we chose these would be because this problem of classifying who could get diabetes and who would not really falls under a supervised learning technique of classification. This is why we model this problem by these different classification models we can use. After fitting our optimal parameters for each model and tuning it to get the best results we can, we compare the models at the end of our report and recommend the model with the lowest log loss value.

All in all, it was a great project to get this understanding of how to take data and turn it into a solution or answer for our targeted audience. Though, I would have handled some parts of this project differently. First, having to change all of the data into factors (to do our exploratory analysis and make better plots) to then change it back to save computational time in our models was not fun. Maybe I should have kept much more of the data as numerical in the beginning. On top of that, we should have considered not using a support vector machine model as our model of choice. This model takes a very long time to run and makes it frustrating to have to check and make it works after updating it.

This leads to the most difficult part of this project. Models would take a very long time to run and having to make md files out of each education level took an extremely long time. Rather than a quick five or ten minutes to knit our Rmd file to a usable md GitHub file, I would have to run our render code and do other activities with it being processed in the background. It got to the point where I would run it before going to bed or going away from my computer (i.e. sleep, eat, socialize) and have it knit our files before I would get back to my laptop. Maybe if my computer was newer, I would not have this issue.

I had some major takeaways from this project. First, I did not realize the amount of people who had diabetes. According to the CDC, about 37.3 Americans have diabetes. This is an astounding number that it is hard for me to believe. Maybe processed food has something to do with this? Second, it was really nice to learn some machine learning techniques especially using the caret package in R. Having the cross-validation features and being able to fit many different models with similar code is really nice. If only it can get to a point where these models can run faster. Which leads me to my last takeaway. Make sure you know that what you have done is correct. The worst thing you can do is run or knit these documents to publish and then realize you have an error that you have to update it and wait so much longer again to redo the process of getting your files. Always check your work because those extra 60 seconds could potentially save hours.

For those who are interested in viewing the project, please view our website to see the results. If you would like to see the machine learning techniques and how this was made, feel free to check out our GitHub repository that contains all of the information.

As always, feel free to connect with me on LinkedIn or contact me via email and I would be happy to hear what machine learning techniques you would potentially use for this project or any other suggestions you would like to share.


<
Previous Post
Regression Model Selection
>
Next Post
Classroom Learning Over, Professional Journey Awaiting Takeoff