What is data preprocessing in machine learning?

May 19, 2026··

12 min read

What is Data Preprocessing in Machine Learning?

Abstract
Data preprocessing is the set of techniques and operations applied to raw data to make it suitable for machine learning (ML). It transforms imperfect, heterogeneous, and often noisy inputs into clean, consistent, and informative representations that models can learn from effectively. Good preprocessing improves model accuracy, robustness, interpretability, and makes training more stable and computationally efficient. This article provides a deep, end-to-end survey: definitions, history, theoretical foundations, concrete techniques, code examples, best practices, pitfalls, tools, state of the art, and future directions.

Table of contents

Overview and historical context
Definitions and objectives
Why preprocessing matters (intuitions and examples)
Key preprocessing tasks and techniques
- Exploratory data analysis (EDA)
- Data cleaning
- Missing value handling
- Outlier detection and treatment
- Encoding categorical variables
- Scaling and normalization
- Transformations (log, Box–Cox, power transforms)
- Feature engineering and creation
- Feature selection
- Dimensionality reduction
- Imbalanced data handling
- Time-series-specific preprocessing
- Text preprocessing
- Image preprocessing and augmentation
Theoretical foundations (statistics, linear algebra, information theory)
Practical pipelines and code examples (scikit-learn, imbalanced-learn)
Best practices and checklist
Common pitfalls and how to avoid them
Tools and infrastructure (feature stores, pipelines, AutoML)
Current research directions and future implications
Summary and recommended reading

Overview and historical context

Early ML efforts (pre-2000s) placed heavy emphasis on feature engineering because models were relatively simple and data-cleaning technology sparse. As datasets grew and complex models (e.g., deep learning) matured, preprocessing remained essential but shifted: deep nets ingest raw modalities like pixels and raw text, but still need normalization, augmentation, tokenization and careful handling of labels and metadata.

In modern production ML, preprocessing is a core discipline: data pipelines, feature stores, and reproducible preprocessing logic power robust systems. Research on automated preprocessing (AutoML), fairness-aware transformations, and privacy-preserving preprocessing (differential privacy) reflects ongoing evolution.

Definitions and objectives

Data preprocessing in ML: the set of operations that convert raw inputs into forms suitable for model training and inference. Objectives include:

Remove or correct faulty data (cleaning).
Convert data types/representations so models can use them (encoding, normalization).
Improve signal-to-noise ratio (filtering, outlier removal).
Reduce dimensionality and redundant features (feature selection, PCA).
Create new informative features (feature engineering).
Make datasets balanced/representative (resampling and weighting).
Ensure preprocessing is reproducible, non-leaking, and applicable in production.

Why preprocessing matters (intuition and examples)

Many algorithms assume scaled input (K-means, K-NN, SVM). Without scaling, features with larger numeric ranges dominate.
Missing values break algorithms or lead to biased estimates.
Categorical variables must be encoded numerically; naive mapping can impose spurious ordinal relationships.
Unaddressed class imbalance causes models to predict majority classes always.
Irrelevant or noisy features increase variance and reduce generalization.
Feature engineering (interaction terms, date/time decomposition) can create simple signals that dramatically improve performance.

Example: Predicting house prices — adding engineered features such as "age of house", "rooms per area", or "distance to city center", or log-transforming prices, often yields much better models than feeding raw columns only.

Key preprocessing tasks and techniques

Below are the primary tasks you'll encounter, with practical notes.

1) Exploratory Data Analysis (EDA)

Summary stats (mean, median, std), distributions, missingness maps.
Visualizations: histograms, boxplots, pairplots, correlation heatmaps.
Purpose: detect anomalies, relationships, non-linearities, and modeling signals.

2) Data cleaning

Fix incorrect types (strings vs numerics).
Remove duplicates.
Standardize text (case, spacing).
Remove or annotate corrupt records.

3) Handling missing values

Strategies:
- Deletion: drop rows or columns (only safe if missingness is small and MCAR).
- Imputation: mean/median/mode, forward/backfill (time series), model-based (KNN, MICE).
- Flags: add binary indicators signaling missingness (captures informative missingness).
Pattern of missingness is important: MCAR (missing completely at random), MAR (at random), MNAR (not at random).

4) Outlier detection and treatment

Methods: z-score, IQR rule, robust methods (median absolute deviation), clustering, isolation forest.
Options: clip/truncate, transform, separate modeling of outliers, or remove if erroneous.

5) Encoding categorical variables

One-hot / dummy encoding (suitable for low-cardinality nominal features).
Ordinal encoding (only when order is meaningful).
Target encoding / mean encoding (useful for high-cardinality but must avoid leakage via CV).
Hashing trick (scales to very high cardinality).
Embeddings (learned as part of a model, especially in deep learning).

Important: Use cross-validated or within-fold encoding to prevent target leakage when using target-based encodings.

6) Scaling and normalization

StandardScaler (z = (x - mean) / std): centers and scales.
MinMaxScaler (scales to [0,1]): preserves shape but compresses ranges.
RobustScaler (uses median and IQR): robust to outliers.
L2-normalization (makes feature vectors unit-norm): for text or where direction matters.
When to scale: before distance-based models, before regularized linear models. Tree-based models are invariant to monotonic scaling but still may benefit from scaling with mixed feature types in some workflows.

7) Transformations

Log transform (handles skewness, positive-valued features).
Box–Cox and Yeo–Johnson (family of power transforms).
Rank or quantile transforms (make distribution close to normal).
Goal: stabilize variance and improve linearity between features and target.

8) Feature engineering

Combining features (ratios, differences, interactions).
Domain-specific features: e.g., time-of-day cycles, moving averages in time-series, image gradients.
Automated feature generation: featuretools, Deep Feature Synthesis.

9) Feature selection

Filter methods: correlation, mutual information, univariate statistical tests.
Wrapper methods: recursive feature elimination (RFE), forward/backward selection.
Embedded methods: regularization (L1), tree-based feature importances.
Goal: reduce dimensionality, remove collinear/uninformative variables, improve generalization.

10) Dimensionality reduction

PCA (Principal Component Analysis): orthogonal linear projections maximizing variance.
SVD (used in latent semantic analysis).
t-SNE, UMAP (non-linear, for visualization).
Autoencoders (learned non-linear compressions).
Use to reduce noise and storage, or for visualization.

11) Handling imbalanced data

Resampling: undersampling majority, oversampling minority (duplicative).
Synthetic sampling: SMOTE, ADASYN (create synthetic minority samples).
Cost-sensitive learning: class weights in loss functions.
Evaluation metrics: use precision, recall, F1, AUC, PR curves instead of only accuracy.

12) Time series preprocessing

Stationarity transforms: differencing, seasonal adjustment.
Lag creation, rolling windows (mean, std).
Time-based splitting for CV: avoid shuffling across time.
Handling timestamps: decompose into cyclical features (sin/cos for hour/day).

13) Text preprocessing

Tokenization, lowercasing, stop-word removal (problem-dependent).
Stemming/lemmatization, n-grams.
Vectorization: CountVectorizer, TF-IDF, word embeddings (Word2Vec, GloVe), contextual embeddings (BERT).
Handling OOV (out-of-vocab): hashing, subword tokenization.

14) Image preprocessing and augmentation

Resizing, center-cropping, color normalization (per-channel mean/std), scaling to [0,1].
Data augmentation: flips, rotations, crops, color jitter, mixup, random erasing.
Whitening or standardization per ImageNet statistics when using pretrained models.

Theoretical foundations

Understanding the math/statistics behind preprocessing helps choose the right methods.

Statistical assumptions: Many models assume IID data, linearity, homoscedasticity, or Gaussianity. Preprocessing attempts to make data align with model assumptions (e.g., log transform for heteroscedastic errors).
Scaling and distance measures: Distance-based methods (K-NN, K-means, SVM with RBF) are sensitive to feature scale; features with larger variance dominate Euclidean distance. Standardization removes scale differences.
PCA: PCA finds eigenvectors of the covariance matrix (or SVD of centered data) and projects data onto principal components. It maximizes variance captured and yields orthogonal basis.
Regularization: L1 (Lasso) induces sparsity, L2 penalizes large weights. Preprocessing (standardization) ensures regularization penalizes features equitably.
Information-theoretic measures: Mutual information can rank features by information about the target.
Bias–variance interplay: More preprocessing and feature engineering can decrease bias (help model capture signal) but can increase variance if overfitting to idiosyncrasies of data.

Practical pipelines and code examples

Below are concrete code examples using Python, scikit-learn, and imbalanced-learn. These show best practices: using Pipeline and ColumnTransformer to avoid leakage and to make transformations reproducible.

Tabular preprocessing pipeline (numeric + categorical)

Python

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Example dataset
df = pd.read_csv("train.csv")  # replace with your file
X = df.drop("target", axis=1)
y = df["target"]

numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_cols),
    ("cat", categorical_transformer, categorical_cols)
])

clf = Pipeline([
    ("preproc", preprocessor),
    ("model", RandomForestClassifier(n_estimators=100, random_state=42))
])

scores = cross_val_score(clf, X, y, cv=5, scoring="roc_auc")
print("AUC scores:", scores, "mean:", scores.mean())

Avoiding target leakage with target-encoding (use CV)

Python

# Use category_encoders with cross-validation to prevent leakage
from sklearn.model_selection import KFold
import category_encoders as ce
import numpy as np

kf = KFold(n_splits=5, shuffle=True, random_state=0)
oof = np.zeros(len(X))

for tr_idx, val_idx in kf.split(X):
    tr_X, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_X = X.iloc[val_idx]
    te = ce.TargetEncoder(cols=["high_card_col"])
    te.fit(tr_X, tr_y)
    oof[val_idx] = te.transform(val_X)["high_card_col"].values
# Now 'oof' contains properly cross-validated target encoding

Handling imbalanced data with SMOTE in pipeline

Python

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

pipeline = ImbPipeline([
    ("preproc", preprocessor),
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000))
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring="f1")
print("F1 scores:", scores, "mean:", scores.mean())

Text pipeline with TF-IDF and classifier

Python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

text_pipeline = make_pipeline(
    TfidfVectorizer(max_features=20000, ngram_range=(1,2)),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)

scores = cross_val_score(text_pipeline, df["text"], df["label"], cv=5, scoring="f1")
print("Text F1 mean:", scores.mean())

PCA for dimensionality reduction before modeling

Python

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pca_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=20)),
    ("svc", SVC())
])

Best practices and checklist

Always split data (train/validation/test) before any fit-based preprocessing to avoid leakage.
Prefer Pipeline and ColumnTransformer (scikit-learn) so the same transformations apply in production.
Use cross-validation for parameter selection, and ensure encoding/imputation is fit inside folds.
Log and version preprocessing code; keep transformations deterministic (set random_state where applicable).
Consider adding missing-value indicators when missingness could be informative.
Use domain knowledge: not everything should be automated—domain features often matter most.
Monitor data drift and model performance in production; adapt preprocessing when input distributions change.
For high-cardinality categoricals, consider hashing or target encoding (careful with leakage).
Use robust scalers when outliers are expected.
For time-series, use time-aware splitting and avoid shuffling.

Common pitfalls and how to avoid them

Data leakage: fitting imputers, scalers, or encoders on full dataset before splitting. Fix: use pipeline and split first.
Target leakage: using features derived from future info or from the target. Fix: feature audit, temporal splitting.
Improper handling of categorical cardinality causing explosion of features with one-hot encoding. Fix: use hashing or embedding or limit cardinality.
Overfitting in feature selection: selecting features on full dataset then testing on the same set. Fix: perform selection within each cross-validation fold.
Ignoring distribution shifts: models trained on one distribution may fail when input distribution changes. Fix: monitoring, recalibration, retraining strategies.
Inadequate evaluation metrics for imbalanced problems: accuracy is misleading. Use precision/recall/F1/AUC-PR.

Tools and infrastructure

Libraries: pandas, numpy, scikit-learn, imbalanced-learn, category_encoders, featuretools, xgboost/lightgbm/catboost, spaCy, NLTK, Hugging Face Transformers, albumentations (images).
Production: MLflow, Airflow, Kubeflow, TFX (TensorFlow Extended) for pipelines.
Feature stores: Feast, Tecton (for maintaining consistent feature definitions between training and serving).
AutoML: Auto-sklearn, H2O AutoML, Google AutoML, AutoGluon — these often include automated preprocessing steps.

Current research directions and future implications

Auto-preprocessing and AutoML: automated selection of imputations, encodings, and transformations.
Differentiable preprocessing: making preprocessing steps part of end-to-end learnable pipelines (e.g., learned imputations, differentiable tokenizers).
Causally-aware preprocessing: avoiding spurious correlations and ensuring robustness via causal features.
Fairness-aware preprocessing: transformations that reduce bias (reweighing, disparate impact removal).
Privacy-preserving preprocessing: differential privacy mechanisms applied during preproc to protect sensitive info.
Synthetic data and generative methods to augment scarce or privacy-sensitive datasets.
Feature stores and standardized feature contracts across organizations to reduce errors and drift.

Examples across domains

Tabular (banking, insurance): imputation, target encoding of categorical features, robust scaling, feature crosses.
NLP: tokenization, subword vocabularies (BPE), contextual embeddings, sequence truncation/padding.
Computer vision: resizing, normalization with dataset means, color augmentation, random crops, use of pretrained normalization stats.
Time series (IoT, finance): timestamp features, lag features, rolling aggregates, differencing to remove trend.
Healthcare: careful missingness handling (missingness informative), privacy-preserving aggregation, clinical code embeddings.

Monitoring and maintenance in production

Continuously monitor feature distributions (mean/std, missingness, cardinality) and model performance.
Detect drift and set alerts for retraining triggers.
Maintain atomic, versioned preprocessing code and store transformation metadata (scalers’ means/stds, encoders’ vocabularies).
Ensure online and offline preprocessing match exactly — use shared libraries or feature stores.

Summary

Data preprocessing is not ancillary — it is central to successful ML. It encompasses cleaning, transforming, encoding, augmenting, and selecting features so models learn robust, generalizable patterns. Good preprocessing avoids leakage, harmonizes modalities, addresses distributional issues, and incorporates domain insight. Use reproducible pipelines, consider the statistical assumptions behind methods, and monitor deployed systems for drift. Advances in automation, privacy, and fairness will continue shaping preprocessing practice.

What is Data Preprocessing in Machine Learning?

Overview and historical context

Definitions and objectives

Why preprocessing matters (intuition and examples)

Key preprocessing tasks and techniques

1) Exploratory Data Analysis (EDA)

2) Data cleaning

3) Handling missing values

4) Outlier detection and treatment

5) Encoding categorical variables

6) Scaling and normalization

7) Transformations

8) Feature engineering

9) Feature selection

10) Dimensionality reduction

11) Handling imbalanced data

12) Time series preprocessing

13) Text preprocessing

14) Image preprocessing and augmentation

Theoretical foundations

Practical pipelines and code examples

Best practices and checklist

Common pitfalls and how to avoid them

Tools and infrastructure

Current research directions and future implications

Examples across domains

Monitoring and maintenance in production

Summary

Recommended reading and resources