How does machine learning work? ==============================
Abstract
Machine learning (ML) is a set of methods that enable computers to learn patterns from data and make predictions or decisions without being explicitly programmed for specific rules. This article gives a deep, end-to-end overview of how ML works: historical context, core concepts and mathematical foundations, algorithm families, practical workflow (data, training, evaluation, deployment), modern advances (deep learning, transformers, foundation models), evaluation and optimization techniques, key applications, limitations and ethical issues, and future directions. Concrete examples and code snippets (scikit-learn, PyTorch) illustrate typical ML workflows.
Contents
- Introduction and intuitive view
- Brief historical timeline
- Problem formulation and core concepts
- Types of learning
- Common algorithms and models
- Theoretical foundations
- Training and optimization
- Data engineering & feature representation
- Model selection, evaluation, and metrics
- Practical pipeline: from data to production
- Modern advances and current state-of-the-art
- Examples (code)
- Challenges, risks, and ethics
- Future directions
- Further reading and resources
- Conclusion
- Introduction and intuitive view
At its simplest, machine learning is about mapping inputs to outputs using data. Instead of hand-writing rules, we collect examples (data) and use algorithms to find functions that generalize from those examples to new cases.
Illustrative example:
- Given many images labeled "cat" or "dog", learn a function f(image) → {cat, dog} that classifies new images correctly.
- Given past customer purchases and features, learn to predict churn probability.
Key intuition:
- Use data (observations) to estimate unknown relationships.
- Choose a family of functions (models), measure how well they fit the data (loss), and adjust parameters to minimize loss.
- Ensure the learned function generalizes to unseen data (avoid overfitting).
- Brief historical timeline
- 1950s: Early ideas (Turing). Perceptron (Rosenblatt, 1958) — early binary linear classifier.
- 1960s-70s: Symbolic AI & limitations of perceptron (Minsky & Papert).
- 1980s: Backpropagation popularized (Rumelhart, Hinton, Williams) enabling training of multi-layer neural networks.
- 1990s: Statistical learning theory (Vapnik) and Support Vector Machines; kernel methods.
- 2000s: Ensemble methods (Bagging, Random Forests), boosting (AdaBoost, Gradient Boosting).
- 2012: AlexNet — deep convolutional networks revive interest in deep learning.
- 2014–2020s: Rapid advances in deep learning (GANs, ResNets, Transformers). Rise of large-scale pretrained models (BERT, GPT).
- 2020s: Foundation models, self-supervised learning, wide adoption in industry.
- Problem formulation and core concepts
Formal supervised learning:
- Data: D = {(x1, y1), ..., (xn, yn)} where xi ∈ X (feature space) and yi ∈ Y (labels).
- Goal: find f: X → Y that minimizes expected loss (risk) R(f) = E_{(x,y)∼P}[L(f(x), y)].
- Empirical Risk Minimization (ERM): minimize empirical loss on training data: R_emp(f) = (1/n) ∑ L(f(xi), yi).
Common elements:
- Model (hypothesis class): family of functions parameterized by θ (e.g., linear functions, decision trees, neural nets).
- Loss function L(ypred, ytrue): e.g., squared error for regression, cross-entropy for classification.
- Optimization method: how to find θ that minimizes loss (gradient descent, coordinate descent, etc.).
- Regularization: penalties or constraints to control complexity and prevent overfitting.
- Generalization: performance on new, unseen data.
Key tradeoffs:
- Bias-variance tradeoff: simple models (high bias) underfit; complex models (high variance) overfit.
- Computational cost vs accuracy.
- Types of learning
- Supervised learning: learn f(x)→y from labeled data. Tasks: classification, regression.
- Unsupervised learning: find structure in unlabeled data (clustering, density estimation, dimensionality reduction).
- Semi-supervised learning: use small labeled and large unlabeled datasets.
- Self-supervised learning: create surrogate tasks from unlabeled data (e.g., masked language modeling) for pretraining.
- Reinforcement learning (RL): learn policies to take sequential actions to maximize cumulative reward; uses interaction with environment.
- Online learning: models update incrementally as streaming data arrives.
- Transfer learning & domain adaptation: leverage knowledge from one domain/task to another.
- Common algorithms and models
Broad families and representative methods:
Linear models
- Linear regression (ordinary least squares)
- Logistic regression
- Linear discriminant analysis (LDA)
Instance-based methods
- k-Nearest Neighbors (k-NN)
Tree-based methods
- Decision trees (CART)
- Random Forests (bagging ensembles)
- Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)
Kernel methods
- Support Vector Machines (SVM)
- Kernel ridge regression
Probabilistic models
- Naive Bayes
- Gaussian Mixture Models (GMM)
- Hidden Markov Models (HMM)
Dimensionality reduction
- PCA (Principal Component Analysis)
- t-SNE, UMAP (nonlinear visualization)
- Autoencoders (neural)
Neural networks and deep learning
- Fully connected networks (MLP)
- Convolutional Neural Networks (CNNs) for images
- Recurrent Neural Networks (RNNs), LSTM/GRU for sequences
- Transformers for sequences & attention-based models
- Generative models: GANs, VAEs, diffusion models
Reinforcement learning
- Q-learning, Deep Q-Networks (DQN)
- Policy gradient, Actor-Critic, PPO
- Model-based RL
Ensembles and hybrid systems
- Bagging, boosting, stacking
- Theoretical foundations
Statistics and probability:
- Estimation, bias, consistency, variance.
- Maximum Likelihood Estimation (MLE) and Bayesian inference (posterior estimation).
Optimization:
- Convex vs non-convex optimization.
- Gradient descent (GD), stochastic gradient descent (SGD), momentum, Adam, RMSProp.
- Convergence guarantees for convex problems; heuristic for deep learning.
Generalization theory:
- VC dimension, Rademacher complexity, PAC learning.
- Regularization (L1, L2), capacity control.
- Uniform convergence and bounds on generalization error.
Information theory:
- Entropy, KL divergence used in loss functions (cross-entropy) and divergences for generative models.
Linear algebra:
- Singular Value Decomposition (SVD), eigenanalysis underpin PCA and many algorithms.
- Training and optimization
Objective: minimize loss over parameters θ.
Gradient-based optimization:
- Full-batch GD: θ ← θ − η ∇_θ L(θ) (uses gradient over all data)
- Stochastic Gradient Descent (SGD): θ ← θ − η ∇_θ L(θ; xi) (update per example)
- Mini-batch gradient descent (common): compromise between stability and speed.
- Adaptive optimizers: Adam, Adagrad, RMSProp.
Pseudocode: Mini-batch SGD `` initialize θ for epoch in 1..Nepochs: shuffle training data for batch in minibatches: g = (1/|batch|) sum{(x,y)∈batch} ∇_θ L(f(x;θ), y) θ = θ - η g ``
Regularization techniques:
- L2 (weight decay), L1 (sparsity)
- Early stopping (monitor validation loss)
- Dropout (neural networks)
- Data augmentation
- Batch normalization
Hyperparameters:
- Learning rate, batch size, architecture choices, regularization strength.
- Often tuned via grid search, random search, Bayesian optimization, or AutoML.
Loss functions examples:
- Regression: Mean Squared Error (MSE) = (1/n) ∑ (yi − ŷi)^2
- Classification: Cross-Entropy Loss (log loss)
- Ranking: pairwise hinge loss, NDCG-based losses
- Reinforcement learning: policy gradient losses, temporal-difference errors
- Data engineering & feature representation
Data is central. Common steps:
- Data collection: instrumentation, logging, surveys, scraping.
- Data cleaning: remove duplicates, fix errors, handle missing values.
- Feature engineering: create informative features (categorical encoding, polynomial features, domain transformations).
- Normalization/scaling: e.g., standard scaling, min-max scaling for numerical features.
- Categorical encoding: one-hot, ordinal, target encoding, embeddings.
- Text/image/audio preprocessing: tokenization, normalization, augmentation.
- Data augmentation: generate variants to increase robustness (flipping images, noise, cropping).
- Label quality: noisy labels degrade models; consider label cleaning or robust loss.
Feature representation:
- Basic models rely on handcrafted features.
- Deep learning extracts hierarchical features automatically from raw inputs (pixels, text tokens).
- Model selection, evaluation, and metrics
Splitting data:
- Training set: used to fit model parameters.
- Validation set: used to tune hyperparameters.
- Test set: final unbiased evaluation.
Cross-validation:
- k-fold CV (common when dataset is small): rotate validation folds.
- Stratified CV for imbalanced classes.
Metrics: Classification
- Accuracy, Precision, Recall, F1-score
- Confusion matrix
- ROC curve and AUC-ROC...