A learning path ready to make your own.

Common data problems in machine learning

Common Data Problems in Machine Learning — Summary Data quality and representativeness are the dominant factors behind ML performance, unexpected behavior, and deployment failures. This summary condenses the key concepts: why data issues matter, a catalog of common problems with causes/impacts/detections/mitigations, diagnostic workflows, tooling, practical checklists, and future directions. Theoretical foundations (why data problems matter) i.i.d. assumption: Many guarantees break when training and test samples are dependent or non-identically distributed. Generalization & sample complexity: Insufficient or biased samples increase expected error; complexity of hypothesis class matters. Bias–variance: Noisy or incorrect data raise irreducible error and estimator variance; overfitting amplifies problems. Causal vs associational: Confounding and selection bias produce spurious correlations; causal thinking helps identify robust features. Catalog of common data problems (what / why / impact / detection / mitigation) 1) Missing data Why: sensor failures, merging, redaction. Impact: bias, lost samples, broken pipelines. Detect: null patterns, correlations with target, missingness diagnostics. Mitigate: drop (MCAR), imputation (mean/iterative/KNN/MICE), missingness indicators, domain-specific encodings. 2) Noisy/incorrect labels Why: human error, ambiguity, shifting definitions. Impact: limits achievable accuracy, mislearned boundaries, calibration issues. Detect: annotator agreement, model–label disagreements, noise-estimation tools (e.g., Cleanlab). Mitigate: re-annotation/active learning, robust losses, probabilistic/soft labels, model-based relabeling. 3) Class imbalance Why: natural rarity or sampling bias. Impact: poor minority performance, misleading overall accuracy. Detect: class frequency tables, per-class metrics. Mitigate: resampling (SMOTE/undersample), class weights, proper metrics (PR-AUC, per-class F1), algorithm choice. 4) Data leakage Why: using future or target-derived information, improper preprocessing order. Impact: over-optimistic validation and production failures. Detect: large validation–production gaps, suspiciously predictive features. Mitigate: strict train/val/test separation, pipelines that fit transforms only on training data, review feature provenance. 5) Distribution shift Types: covariate shift, prior shift, concept drift. Why/Impact: nonstationarity causes performance degradation across time or domains. Detect: monitoring, two-sample tests (KS, MMD), drift detectors. Mitigate: domain adaptation, importance weighting, continual learning, invariant/causal features, scheduled retraining. 6) Outliers & anomalies Why: rare events, measurement error, fraud. Impact: distorts estimators, harms some models. Detect: IQR/Z-score, isolation forest, robust PCA, visual EDA. Mitigate: remove/cap, robust models, treat as separate anomaly class. 7) Duplicate & near-duplicate data Why: scraping/aggregation errors, repeated observations. Impact: inflated metrics if duplicates cross folds, biased learning. Detect: hashing, fuzzy match, embedding clustering. Mitigate: deduplicate, group splits by entity ID. 8) Sampling & selection bias Why: non-representative collection (surveys, logs). Impact: poor generalization, fairness issues. Detect: compare sample vs population demographics, domain knowledge. Mitigate: reweighting (IPW), stratified/targeted collection, causal adjustments. 9) High dimensionality Why: many features vs samples; sparse signal. Impact: overfitting, high variance, compute cost. Mitigate: feature selection, regularization (L1/elastic net), dimensionality reduction (PCA, autoencoders). 10) Multicollinearity & redundant features Impact: unstable coefficients, reduced interpretability. Detect: correlation matrices, VIF. Mitigate: drop/combine features, use regularized models. 11) Inconsistent formatting & preprocessing errors Why: mixed units, encoding inconsistencies, parsing bugs. Impact: silent garbage-in results. Detect/Mitigate: unit checks, validation rules, standardize formats, use data validation frameworks. 12) Temporal & spatial dependencies Why: autocorrelation, seasonality, location effects. Impact: leakage if split randomly, nonstationary performance. Detect: ACF/PACF, rolling evaluation. Mitigate: time-aware splits (walk-forward/TimeSeriesSplit), careful lagging of features. 13) Batch effects & site/vendor effects Why: processing or vendor differences (genomics, imaging). Impact: confounding correlated with label. Detect: PCA colored by batch, ANOVA. Mitigate: batch correction (ComBat), randomized batch allocation. 14) Measurement error & sensor drift Impact: biased features and degradation over time. Mitigate: calibration, redundancy, drift detection and recalibration, robust estimators. 15) Privacy constraints & small data Impact: limited data access and feature availability. Mitigate: federated learning, synthetic data, differential privacy, transfer learning. 16) Adversarial & poisoning attacks Impact: targeted degradation or security failures. Mitigate: provenance checks, adversarial training, anomaly detection on training data. Detection techniques & diagnostic workflow Start with a data inventory and provenance (sources, collection methods, update cadence). Run summary statistics: distributions, missingness matrices, cardinalities. Visual EDA: histograms, boxplots, correlation heatmaps, PCA/TSNE colored by label or batch. Split-aware checks: compare train/validation/test distributions, group-aware splits. Label analysis: annotator agreement, confusion matrices, high-loss samples. Time-aware diagnostics: rolling metrics, drift detectors, two-sample tests (KS, MMD). Automate checks with tools (Great Expectations, TFDV, Evidently). Mitigation strategies — practical patterns Improve data (collect more, targeted re-labeling, better coverage) where possible — often highest ROI. Use robust modeling: regularization, robust loss functions, tree-based models for outliers. Resampling or weighting for imbalance and selection bias (SMOTE, class weights, IPW). Domain adaptation and importance weighting for covariate shift; continual learning for drift. Operationalize pipelines to avoid leakage (transform inside training folds, version data and transforms). Tooling & best practices Validation & monitoring: Great Expectations, Deequ, TensorFlow Data Validation, Evidently. Labeling: Cleanlab, Prodigy, Label Studio. Drift & observability: Evidently, WhyLabs, Fiddler, Seldon Analytics. Pipeline & data versioning: TFX, Kubeflow, MLflow, DVC. Practices: dataset/versioning, data contracts, continuous drift monitoring, canary/holdout datasets, group-aware splits. Representative case studies Healthcare: inter-reader label noise, small datasets, cross-hospital domain shift → consensus labels, transfer learning, harmonization, federated learning. Fraud detection: extreme imbalance, adversarial behavior, nonstationarity → anomaly detection, cost-sensitive learning, frequent updates. NLP sentiment: label ambiguity, duplicates, domain shift → active re-labeling, deduplication, in-domain fine-tuning. IoT sensors: missing data and sensor drift → temporal imputation, redundancy, drift detection and recalibration. Current trends & research frontiers Data-centric AI: prioritizing dataset improvement over model tinkering. Automation: integrated automated validation and drift detection in MLOps. Privacy & synthetic data: differential privacy, synthetic datasets with utility–privacy trade-offs. Research: theoretical understanding of label noise in deep nets, provable domain adaptation, explainable dataset diagnostics, robustness in federated settings. Practical checklist before training Data inventory: sources, timestamps, ownership, update cadence. Missingness: proportions, patterns; consider MCAR/MAR/MNAR. Label quality: annotator agreement, classifier–label discordance. Duplicates: remove or group by entity when splitting. Distribution checks: feature-wise comparisons across splits; two-sample tests. Temporal checks: ensure no future leakage; use time-aware validation. Outliers: visualize and decide remove/transform/model separately. Class balance & metrics: pick appropriate evaluation metrics and sampling strategies. Provenance & lineage: track transformations and dataset versions. Monitoring plan: deploy drift detection and retraining triggers. Conclusion Systematic attention to data issues—missing values, label noise, leakage, shift, imbalance, and others—is essential for reliable, fair, and robust ML. Best practice is data-centric: measure and improve datasets, instrument continuous monitoring, and integrate domain expertise into data curation and validation.

Let the lesson walk with you.

Podcast

Common data problems in machine learning podcast

0:00-3:41

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

Common data problems in machine learning flashcards

17 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

Common data problems in machine learning quiz

12 questions

What does the i.i.d. assumption commonly used in learning theory mean?

Read deeper, connect wider, own the subject.

Deep Article

Common Data Problems in Machine Learning — A Deep Dive

Machine learning (ML) models are only as good as the data they are trained on. Data issues are by far the dominant cause of poor model performance, unexpected behavior, and deployment failures. This article gives a comprehensive, practical, and theoretically informed overview of the most common data problems in machine learning, how they arise, how to detect them, workarounds and fixes, trade-offs, tools, and future directions.

Contents

  • Introduction and historical context
  • Theoretical foundations (i.i.d., generalization, bias-variance, sample complexity)
  • Catalog of common data problems
  • Missing data
  • Noisy labels / label errors
  • Class imbalance
  • Data leakage
  • Distribution shift: covariate shift, prior shift, concept drift
  • Outliers and anomalies
  • Duplicate and near-duplicate data
  • Sampling bias and selection bias
  • High dimensionality and curse of dimensionality
  • Multicollinearity and redundant features
  • Inconsistent formatting and feature engineering errors
  • Temporal and spatial dependencies (time-series-specific issues)
  • Batch effects and vendor/site effects
  • Measurement error and sensor drift
  • Privacy constraints and small data
  • Adversarial and poisoned data
  • Detection techniques and diagnostic workflows
  • Mitigation strategies and code examples
  • Tooling and best practices (data-centric ML, MLOps)
  • Case studies and examples
  • Future directions and research frontiers
  • Practical checklist
  • Conclusion and references

Introduction and historical context

Early statistical modeling and classical econometrics emphasized careful data collection and diagnostic testing (residual analysis, specification tests). As ML scaled to large datasets, the emphasis shifted to model complexity and architecture (deep networks, ensembles), sometimes deemphasizing the quality and representativeness of input data. In recent years, the community has moved back toward a data-centric view: practitioners recognize that improving data quality, labels, and coverage often yields bigger gains than tweaking models.

Prominent recent themes:

  • Data-centric AI (focus on improving datasets)
  • MLOps & continuous monitoring for data drift
  • Synthetic data, privacy-preserving training, and domain adaptation
  • Tools for data validation (Great Expectations, Deequ, TensorFlow Data Validation)

Theoretical foundations

Understanding why data problems matter requires a few theoretical ideas.

  • i.i.d. assumption: Many models assume training and test samples are independent and identically distributed. Violations (distribution shift, dependence) break guarantees.
  • Generalization: Learning theory (VC dimension, Rademacher complexity) shows that sample complexity depends on hypothesis class complexity and data distribution. Biased or insufficient data increase expected generalization error.
  • Bias–variance tradeoff: Noisy or mislabelled data increase irreducible error and can increase variance of estimators. Overfitting amplifies the impact of poor data.
  • Causal vs. associational modeling: Confounders and selection bias can produce spurious correlations. Causal frameworks (do-calculus, potential outcomes) help identify when associations will not generalize under interventions.

Practical consequence: Many mitigation strategies rely on either improving data (collect more, re-label, balance) or on methods robust to data issues (regularization, robust loss, domain adaptation).


Catalog of common data problems

Below we detail each problem with causes, impacts, detection methods, and mitigation strategies.

1) Missing data

  • What: Features or labels are missing (NaN, empty strings).
  • Why: Sensor failures, privacy redaction, merging heterogeneous sources, annotation omissions.
  • Types:
  • MCAR (Missing Completely At Random): missingness independent of observed/unobserved data.
  • MAR (Missing At Random): missingness depends only on observed variables.
  • MNAR (Missing Not At Random): missingness depends on the unobserved value itself.
  • Impact: Bias if handled incorrectly; loss of sample size; broken pipelines.
  • Detection: Summary of null counts, patterns, correlation between missingness and Y.
  • Mitigation:
  • Simple: drop rows/columns (only if data loss small and MCAR).
  • Imputation: mean/median/mode, KNN, iterative imputation (MICE), model-based imputation.
  • Feature engineering: add indicators for missingness.
  • Domain-specific: treat missing as informative category.
  • Code snippet (pandas + IterativeImputer):

``python import pandas as pd from sklearn.experimental import enableiterativeimputer from sklearn.impute import IterativeImputer df = pd.readcsv("data.csv") numcols = df.selectdtypes(include="number").columns imp = IterativeImputer(randomstate=0) df[numcols] = imp.fittransform(df[numcols]) df["wasmissing_feat1"] = df["feat1"].isna().astype(int) # indicator ``

2) Noisy or incorrect labels

  • What: Wrong class labels or regression targets, low inter-annotator agreement, systematic labeling errors.
  • Why: Human errors, ambiguous items, label distribution shift over time, incorrect mappings.
  • Impact: Limits upper bound on accuracy; models can learn wrong decision boundaries; calibration suffers.
  • Detection:
  • Confusion across annotators (kappa), label entropy.
  • Find instances where model strongly disagrees with label.
  • Use agreement heuristics and cleanlab to estimate label noise.
  • Mitigation:
  • Re-annotate targeted samples (active learning).
  • Use robust loss functions (e.g., label smoothing, noise-robust losses).
  • Probabilistic / soft labels reflecting annotator disagreement.
  • Model-based methods: iterative relabeling, meta-learning for label noise.
  • Tools: Cleanlab, CrowdTruth.

3) Class imbalance

  • What: One or a few classes are much rarer than others.
  • Why: Natural rarity (fraud, disease), sampling procedure.
  • Impact: Poor minority class performance; accuracy paradox (high accuracy by predicting majority).
  • Detection: Class frequency table, per-class metrics (precision, recall, F1).
  • Mitigation:
  • Resampling: undersampling majority, oversampling minority (SMOTE, ADASYN).
  • Cost-sensitive learning: class weights in loss.
  • Use proper metrics: ROC-AUC, PR-AUC, per-class recall, F1.
  • Algorithm choices: tree-based models, anomaly detection for extreme imbalance.
  • Code example (SMOTE with imbalanced-learn):

``python from imblearn.oversampling import SMOTE Xres, yres = SMOTE(randomstate=42).fitresample(Xtrain, y_train) ``

4) Data leakage

  • What: Information from outside the training dataset or from the future leaks into training so model sees information it would not have at inference.
  • Examples: including target-derived features, improper scaling before split, using future timestamps in training.
  • Why: Incorrect pipeline ordering, naive feature engineering, using aggregated labels.
  • Impact: Over-optimistic validation, catastrophic failure in production.
  • Detection: Unrealistic performance gap between validation and test/production; features with extremely high importance that are suspicious.
  • Mitigation:
  • Strict separation of training/validation/test datasets.
  • Use pipeline primitives (scikit-learn Pipeline) so transformations are fit only on training folds.
  • Review feature provenance; remove features that are temporally downstream of the label.
  • Code example (correct pipeline):

``python from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler clfpipeline = Pipeline([ ('imputer', SimpleImputer()), ('scaler', StandardScaler()), ('clf', LogisticRegression()) ]) clfpipeline.fit(Xtrain, ytrain) # safe ``

5) Distribution shift (covariate shift, prior shift, concept drift)

  • What: Training and test distributions differ.
  • Covariate shift: P(X) changes, P(Y|X) stable.
  • Prior/class-probability shift: P(Y) changes.
  • Concept drift: P(Y|X) changes (most pernicious).
  • Why: Nonstationary environments (finance, user behavior), different data collection.
  • Impact: Model performance degrades over time or on new domains.
  • Detection: Performance monitoring, two-sample tests (KS, MMD), drift detection algorithms.
  • Mitigation:
  • Domain adaptation (importance weighting for covariate shift).
  • Continual learning and online updating.
  • Retraining schedules, monitoring, A/B tests.
  • Use invariant features or causal features.
  • Importance weighting example (simple):

```python

Fit propensity model p(train|x) to reweight

More advanced: kernel mean matching, domain adversarial networks

```

6) Outliers and anomalies

  • What: Extreme observations that do not conform to expected patterns.
  • Why: Rare events, measurement errors, fraud.
  • Impact: Distorts model estimates (esp. mean-based), harms some algorithms.
  • Detection: Boxplots, z-score, IQR rule, isolation forest, robust PCA.
  • Mitigation:
  • Remove or cap (winsorize) if erroneous.
  • Robust models (quantile regression, tree-based models).
  • Treat as separate class (anomaly detection).
  • Code example (IsolationForest):

``python from sklearn.ensemble import IsolationForest iso = IsolationForest(contamination=0.01) outlierlabels = iso.fitpredict(X) ``

7) Duplicate and near-duplicate data

  • What: Duplicate rows or very similar items (e.g., scraping duplicates).
  • Why: Aggregation errors, data scraping, multiple observations of same entity.
  • Impact: Inflated evaluation metrics if duplicates appear across train/test splits; biased learning.
  • Detection: Hashing, fuzzy matching, clustering on embeddings.
  • Mitigation:
  • Remove duplicates.
  • When splitting, group by entity id to keep duplicates in same fold.
  • Code snippet:

``python df.dropduplicates(subset=["text", "userid"], inplace=True) ``

8) Sampling bias and selection bias

  • What: Training data is not representative of the target population due to how it was collected.
  • Examples: Survey non-response, web logs only for active users.
  • Impact: Models that do not generalize, fairness issues.
  • Detection: Compare sample demographics to population, domain knowledge.
  • Mitigation:
  • Reweighting (inverse probability weighting), stratified sampling, targeted ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.