Title: How to Prepare Data for AI Models — A Comprehensive Guide
Table of contents
- Introduction: Why data preparation matters
- Brief history and shift to data-centric AI
- Theoretical foundations: statistics, information theory, and causality
- The data lifecycle: from problem definition to monitoring
- Step-by-step data preparation workflow
- 1. Define objectives and success metrics
- 2. Data collection and ingestion
- 3. Data storage, formats, and metadata
- 4. Exploratory data analysis (EDA)
- 5. Data cleaning
- 6. Labeling and annotation
- 7. Handling class imbalance and rare events
- 8. Feature engineering and representation
- 9. Data augmentation and synthetic data
- 10. Data splitting, leakage prevention, and cross-validation
- 11. Scaling, normalization, encoding
- 12. Outliers and missing values
- 13. Creating reproducible pipelines
- Tooling and libraries
- Domain-specific examples and code snippets
- Tabular data (pandas + sklearn pipeline)
- Image data (PyTorch / torchvision / albumentations)
- Text data (Hugging Face + tokenization)
- Time-series data
- Quality assurance, testing, and evaluation
- Governance, privacy, ethics, and regulation
- Deployment and monitoring considerations
- Advanced/modern approaches
- Active learning, weak supervision, data programming
- Synthetic data, generative models, domain adaptation
- Data-centric ML and automation
- Checklist: Practical best practices
- Future directions and implications
- References and further reading
Introduction: Why data preparation matters Data is the foundation of every successful AI model. High-quality, well-prepared data often matters more than marginal tweaks to model architecture or training hyperparameters. Clean, representative, and well-documented data reduces bias, improves generalization, and speeds iteration. Preparing data is not a one-time activity but an ongoing discipline that spans collection, curation, validation, documentation, and monitoring.
Brief history and shift to data-centric AI Historically, AI research focused heavily on modeling: inventing better architectures, optimizers, and loss functions. Over time, the returns on architecture alone diminished, especially for practical applications. The industry has seen a pronounced shift toward data-centric AI: improving datasets, labels, and preprocessing to yield better models with less effort. Pioneers like Andrew Ng advocate “fix the data” as a priority — accurate, consistent labels and diverse, high-quality examples can often outperform complex model changes.
Theoretical foundations: statistics, information theory, and causality At its core, data preparation is guided by statistical principles:
- Sampling and representativeness: ensure training data reflects real-world distribution to avoid sampling bias.
- Bias-variance tradeoff: data augmentation, feature selection, and model complexity interact to control overfitting/underfitting.
- Information content: feature selection, encoding, and transformations aim to maximize signal and reduce noise.
- Causality: distinguishing correlation from causation helps avoid spurious predictors that break under distribution shifts.
The data lifecycle: from problem definition to monitoring A typical data lifecycle for AI projects:
- Define the problem and metrics
- Acquire and ingest raw data
- Annotate and label
- Clean, explore, and preprocess
- Split and create training/validation/test sets
- Train and evaluate models
- Deploy and monitor in production
- Continuously collect feedback, retrain, and update data
Step-by-step data preparation workflow
- Define objectives and success metrics
- Determine the prediction target, available inputs, tolerable latency, and success criteria (e.g., F1-score, AUC, accuracy on key segments).
- Identify deployment constraints (on-device vs. server), privacy requirements, and fairness goals.
- Determine the minimal viable dataset size and incremental data collection strategy.
- Data collection and ingestion
- Sources: sensors, databases, logs, APIs, third-party datasets, web scraping, public datasets.
- Ensure legal/commercial rights for data usage.
- Raw data capture considerations: timestamps, provenance, unique IDs, and versioned snapshots.
- Ingest into centralized storage formats (data lake, SQL/NoSQL, object storage) with consistent schemas and metadata.
- Data storage, formats, and metadata
- Recommended formats: Parquet/ORC/Feather for tabular, TFRecord for TensorFlow ecosystems, plain CSV for small tasks, JPEG/PNG for images with metadata in CSV/JSON, JSONL for text entries.
- Use schemas (Avro, Parquet schema) and metadata catalogs (Data Catalog, Delta Lake) to track lineage and features.
- Maintain dataset versions: commit snapshots or use systems like DVC, Quilt, Delta Lake, or LakeFS.
- Exploratory data analysis (EDA)
- Summary statistics: mean, median, variance, distribution histograms.
- Visualizations: class distribution, feature correlations, pairplots, time-series plots.
- Check distributions across subgroups (time, geography, user segments).
- Identify suspicious patterns, drift, or missingness.
- Data cleaning
- Remove duplicate entries and resolve conflicting records using source priority rules.
- Unify formats: timestamps, units, currencies, text normalization.
- Fix obvious errors (e.g., impossible ages), but document edits and keep raw copies.
- Standardize categorical values and normalize free-text fields.
- Labeling and annotation
- Labeling formats: categorical labels, bounding boxes, segmentation masks, language annotations, entity tags.
- Use annotation tools: Labelbox, Supervisely, CVAT, Prodigy, Amazon SageMaker Ground Truth, Label Studio.
- Define clear labeling guidelines, edge cases, examples, and quality checks.
- Inter-annotator agreement (Cohen’s kappa, Fleiss’ kappa): measure and resolve disagreements.
- Strategies: in-house annotators, crowdsourcing, expert labeling, or semi-automated labeling.
- Consider multi-label/soft labels for ambiguity; capture label confidence.
- Handling class imbalance and rare events
- Resampling: undersampling majority, oversampling minority (SMOTE, ADASYN).
- Cost-sensitive learning: weighted loss functions.
- Data augmentation focused on minority classes.
- Generate synthetic examples where appropriate using generative models.
- Use appropriate metrics (precision-recall, F1, ROC AUC) rather than accuracy for imbalanced data.
- Feature engineering and representation
- Domain-driven features: ratios, aggregations, time-based features (e.g., rolling means), categorical grouping.
- Feature selection: univariate tests, mutual information, recursive feature elimination, L1 regularization.
- Encoding categorical variables: one-hot, ordinal, target encoding (careful with leakage), embedding layers.
- Interaction features and polynomial terms when appropriate.
- Dimensionality reduction: PCA, t-SNE (exploratory), UMAP (visualization), truncated SVD.
- Data augmentation and synthetic data
- Images: rotations, flips, crops, photometric transforms, MixUp, CutMix.
- Text: backtranslation, synonym replacement, span masking, controlled paraphrasing.
- Tabular: conditional GANs, interpolation (SMOTE), simulation models.
- Synthetic data can address privacy and scarcity but must preserve statistical properties and not introduce artifacts.
- Data splitting, leakage prevention, and cross-validation
- Holdout splits: train / validation / test. Test set must be strictly untouched until final evaluation.
- Time-series: use time-based splits (no peeking into future).
- Grouped splits: ensure samples from same user/device are not in both train and test to avoid leakage.
- Cross-validation: k-fold, stratified k-fold for classification, nested cross-validation for hyperparameter tuning.
- Avoid data leakage via feature construction that uses future or test-set dependent information.
- Scaling, normalization, encoding
- Scale numerical features: standard scaling (zero mean, unit variance) or min-max scaling.
- Normalize per-feature or per-sample depending on model (neural networks often benefit from feature-wise scaling).
- Fit scalers on training data only and apply to validation/test.
- Pipeline transformations (scikit-learn Pipeline, TF Transform) ensure consistency.
- Outliers and missing values
- Detect with boxplots, z-scores, robust statistics, isolation forests.
- Imputation strategies: mean/median/mode, k-nearest neighbors, MICE (multivariate imputation), model-based imputation.
- For deep learning, consider using missing indicators and letting models learn patterns.
- Decide whether ...