The “Robust” Data Scientist: Winning with Messy Data and Pingouin
This article uncovers the craftsmanship of using robust statistics in data science processes: illustrating what to do when data fail tests due to not meeting standard assumptions.
Opening excerpt (first ~120 words) tap to expand
Image by Editor # Introduction A harsh truth to begin with: textbook data science usually becomes a lie in the real world. Concepts and techniques are taught on finely curated, beautifully bell-curved data variables, but as soon as we venture into the wild of real projects, we are hit with lots of outliers, unduly skewed distributions, and indomitable variances. A previous article on building an exploratory data analysis (EDA) pipeline with Pingouin showed how to detect, through tests, cases when the data violates a variety of assumptions like homoscedasticity and normality. But what if the tests fail? Throwing the data away isn't the solution: turning robust is. This article uncovers the craftsmanship of using robust statistics in data science processes.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at KDnuggets.