Don’t be Led Astray on Kaggle!
Kaggle is a great platform for learning and sharing so why the word of caution?
Like many others out there, I use Kaggle to hone my data science skills. It’s a great platform for finding datasets, sharing code and exposing fantastic methods for others to use. This time, I completed a project (here) with the Diabetes Health Indicators cleaned dataset before perusing the notebooks shared by others and was initially confused. My answers to the research questions were different; why did my models perform so poorly in comparison?
I started asking why and realised I had done nothing (majorly/obviously) wrong. In fact, I spotted errors in many notebooks I checked and was affirmed in my decisions. The three biggest errors are discussed below:
Many Kagglers used Pearson’s r for all features.
This dataset contains binary/categorical and continuous variables. Pearson’s Correlation Coefficient is insufficient for the task as it is suitable for measuring linear relationships between continuous variables. In my notebook, I also use Cramer’s V (categorical — categorical) and the Point Biseral Test (categorical — continuous). On reflection, I also need to explore methods for quantifying non-linear dependence such as CANOVA.
No justification for models used.
The dataset contains 234034 rows with features which, at a glance, are dependent. Are these factors to consider in model selection? Of course!
Support Vector Machine, the first algorithm I learnt to implement, has a time complexity of O(n²) to O(n³); too much time! The conditions for the Logistic regression algorithm, another favorite, were not met. Thus, it also was not a good choice. I did not see this thought process in many notebooks even where feature selection was rigorous.
The biggest issue — Inappropriate performance metrics!
This is what triggered a near-meltdown whilst comparing other notebooks with mine. After all my hard work, the macro-average F1-score of my chosen model was 0.56 (0.79 weighted average). Why were others boasting performance above 0.8? THE METRIC! For a classification problem with a severely imbalanced dataset (83–17), Accuracy is a poor metric to use. A score of 0.83 is easily achievable by simply predicting “no diabetes” for the entire test set.
Conclusion
Justifying and questioning every method/decision used in each step of the process instills confidence in the final results and facilitates learning.
Also, either the dataset itself is problematic, the collection method needs to change or the cleaning approach for most of the variables needs to be revised because no reliable predictions can be made.
All code and additional methods can be found in the Github repository here.