A learning path ready to make your own.

computer vision

Computer Vision — A Deep Dive (Summary) Computer vision (CV) enables machines to interpret visual data (images, video, depth, infrared, event streams) by combining optics, signal processing, geometry, machine learning and neuroscience to convert pixels into semantic understanding and actionable outputs. Historical Evolution & Key Milestones Classic era (1960s–2000s): geometry, hand-crafted features, filtering, optical flow, stereo, SfM. Feature/statistical era (2000s–2012): SIFT, SURF, HOG, bag-of-words, probabilistic models. Deep learning era (2012–present): CNN breakthrough (AlexNet), R-CNN family, FCNs, ViTs and large multimodal foundation models. Notable milestones: early edge/stereo methods, Bundler/SfM, SIFT/HOG/DPM, AlexNet (2012), R-CNN/Mask R-CNN, ViT and vision-language models (CLIP). Foundations: Image Formation & Modalities Pinhole camera model: x ~ K [R | t] X, intrinsics (K), extrinsics (R,t), and lens distortion requiring calibration. Modalities: RGB, grayscale, depth (LiDAR/stereo), infrared/thermal, hyperspectral, event cameras. Sampling & aliasing: continuous radiance → sampled images; anti-aliasing and demosaicing matter. Low-Level Vision & Signal Processing Core ops: convolution (smoothing, derivatives), Gaussian/Laplacian pyramids, morphological ops. Tasks: edge detection (Canny), denoising (BM3D, DL denoisers), deblurring, super-resolution, color-space transforms. Geometry & 3D Vision Epipolar geometry, fundamental/essential matrices: x'ᵀ F x = 0. Stereo/disparity → depth (Z = fB/d); SfM, MVS, bundle adjustment, PnP for pose estimation. Optical flow (Lucas–Kanade, Horn–Schunck; FlowNet/RAFT), SLAM for real-time mapping & localization. Classical Methods vs Deep Learning Classical: detectors (Harris, FAST), descriptors (SIFT, ORB), HOG+SVM, DPM, graph-cuts, superpixels, tracking with Kalman/particle filters. Deep learning: end-to-end learned hierarchical features (CNNs), transformers, self-/contrastive learning, foundation vision-language models. Deep model advantages: larger-scale learning, robustness to many variations; limitations include data needs and explainability. Modern Architectures & Typical Tasks Classification: ResNet, EfficientNet, ViT. Detection: Faster R-CNN, YOLO family, DETR. Segmentation: U-Net, DeepLab, Mask R-CNN. Tracking: SORT/DeepSORT, ByteTrack, Siamese trackers. Depth/3D: MiDaS, NeRF; optical flow: RAFT; generative/restoration: GANs, diffusion models. Typical CV Pipeline Data acquisition (sensors, calibration) Preprocessing (demosaic, denoise, rectify, normalize) Data augmentation Model selection & training (backbone, loss, optimizer) Inference optimization (quantization, pruning, ONNX/TensorRT) Postprocessing (NMS, association) and deployment with monitoring Datasets & Evaluation Metrics Key datasets: ImageNet, COCO, Pascal VOC, Cityscapes, KITTI, NYU Depth, ADE20K, Waymo/nuScenes, DAVIS, CLIP-scale image-text corpora. Metrics: classification (accuracy/top-5), detection (AP/mAP COCO-style), segmentation (mIoU, Dice), depth (RMSE), flow (EPE), tracking (MOTA/MOTP), calibration (ECE), latency/FLOPs/energy. Beware leaderboard overfitting and domain shift from benchmarks to real-world data. Tools, Frameworks & Hardware Software: OpenCV, PyTorch, TensorFlow, TorchVision, Detectron2, MMDetection, Albumentations, Open3D. Inference/export: ONNX, TensorRT, TVM. Hardware: GPUs, TPUs, edge devices (Jetson, Coral), NPUs, FPGAs/ASICs; sensors: global vs rolling shutter, HDR, event cameras, LiDAR. Applications & Use Cases Autonomous vehicles, robotics, medical imaging, manufacturing quality control, retail (visual search), agriculture, surveillance (with ethical concerns), AR/entertainment, remote sensing, document analysis. Challenges, Limitations & Ethics Technical: domain shift, long-tail events, annotation cost, robustness/adversarial vulnerability, real-time constraints. Ethical & societal: bias, privacy concerns (face recognition), safety/accountability, dual-use risks, environmental cost of training. Mitigations: dataset curation, fairness audits, privacy-preserving methods, transparency, regulation compliance. Future Directions Foundation multimodal models and vision-language integration. Self-supervised/few-shot learning to reduce labels. Improved 3D & neural scene representations (NeRF + real-time 3D perception). Event-based vision, continual learning, certifiable robustness, efficient on-device inference (TinyML). Explainability, standards, and stronger safety/regulation frameworks. Best Practices & Engineering Tips Use pretrained backbones, careful validation splits, and domain-relevant augmentations. Apply learning-rate schedules, mixed precision, and experiment tracking for reproducibility. Profile & optimize inference (quantize/prune), monitor deployed models for drift, and evaluate robustness. Key Takeaways Foundations matter: image formation, geometry and low-level signal processing underpin modern methods. Deep learning dominates but classical methods and geometry remain important for constraints, efficiency and interpretability. Practical systems require attention to data, evaluation, deployment, and ethical implications—not just model accuracy. If you’d like, I can provide a focused deep dive (e.g., stereo geometry, DETR, NeRF), a hands‑on pipeline tutorial (dataset → train → evaluate → deploy), or tailored dataset/architecture recommendations for a specific application. Which would you prefer?

Open full tree

Follow the trail that experts already trust.

Resources

1:02:53

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction

Stanford Online495.8K views

11:10

Read deeper, connect wider, own the subject.

Deep Article

Computer Vision — A Deep Dive

Computer vision (CV) is the interdisciplinary field that enables computers to interpret, understand, and act upon visual data from the world: images, video, depth, infrared, and other modalities. It combines ideas from optics, signal processing, geometry, machine learning, and neuroscience to convert pixels into semantic understanding and actionable insights.

This article provides a comprehensive overview: history, theoretical foundations, core algorithms, deep-learning revolution, practical pipelines, datasets & benchmarks, evaluation metrics, hardware & tooling, applications, limitations and ethics, and future directions. Code snippets and examples are included to show how common tasks are implemented.

Table of Contents

Introduction & History
Image Formation & Cameras
Low-Level Vision & Signal Processing
Geometry & 3D Vision
Feature-Based & Classical Methods
Machine Learning & the Deep Learning Revolution
Modern Architectures & Models
Typical Computer Vision Pipeline
Datasets, Benchmarks & Evaluation Metrics
Tools, Frameworks & Hardware
Applications & Industry Use Cases
Challenges, Limitations & Ethics
Future Directions
Practical Examples (code)
Recommended Reading & Resources
Summary

Introduction & History

Computer vision emerged in the 1960s as an effort to replicate aspects of human vision with machines. Early work emphasized geometric inference and modeling of image formation. Over decades it evolved through three main phases:

Classic vision era (1960s–2000s): Emphasis on geometry, handcrafted features, filtering, segmentation, optical flow, stereo, structure-from-motion (SfM).
Feature learning & statistical era (2000s–2012): Emergence of bag-of-words, feature descriptors like SIFT and SURF, probabilistic models and learning approaches.
Deep learning era (2012–present): Breakthroughs with convolutional neural networks (CNNs) — AlexNet (2012) demonstrated that end-to-end learned hierarchical features outperform hand-engineered pipelines on large datasets. Since then, deep models (CNNs, RNNs, Transformers) dominate most tasks.

Key milestones:

1970s–1980s: Early edge detection, Hough transform, stereo vision fundamentals.
1999–2004: Bundler, foundational SfM tools.
2004–2012: SIFT, SURF, HOG, deformable part models.
2012: AlexNet marks deep learning breakthrough.
2014–2016: R-CNN family for detection; fully convolutional networks (FCN) for segmentation.
2020s: Vision Transformers (ViT) and large multimodal foundation models.

Image Formation & Cameras

Understanding how images are formed is fundamental.

Pinhole camera model:

A point in 3D space X = [X, Y, Z, 1]^T projects to an image point x = [u, v, 1]^T via:

x ~ K [R | t] X

where K is the intrinsic matrix (focal lengths, principal point, skew), R and t are rotation and translation (extrinsics). The ~ denotes equality up to scale.

Intrinsic matrix K:

K = [[fx, s, cx], [0, fy, cy], [0, 0, 1]]

Radial distortion: Real lenses introduce distortion; typical models are radial + tangential terms that need calibration.

Image sampling and aliasing: Continuous radiance fields are sampled and quantized; anti-aliasing (blurring) and proper sampling are important.

Modalities:

RGB, grayscale
Depth (LiDAR, structured light, stereo disparity)
Infrared, thermal
Event cameras (asynchronous brightness change)
Hyperspectral

Low-Level Vision & Signal Processing

Low-level tasks operate on pixels and local neighborhoods.

Key operations:

Linear filtering (convolution): smoothing, sharpening, derivatives
Gaussian blur, Laplacian, Sobel operators
Edge detection: Canny, Marr-Hildreth
Noise models and denoising: Gaussian, Poisson; algorithms include BM3D, non-local means, DL-based denoisers
Image restoration: deblurring, super-resolution
Color spaces and transformations: RGB, HSV, YUV, Lab
Histograms & equalization
Morphological operations: dilation, erosion, opening, closing
Image pyramids and scale-space: Gaussian/Laplacian pyramids, DoG used in SIFT

Convolution is the basic linear operation: (I * K)(x, y) = sumu sumv I(x - u, y - v) K(u,v)

Geometry & 3D Vision

Many vision tasks rely on geometric constraints.

Key concepts:

Epipolar geometry: Relationship between two views. If x ↔ X ↔ x', the essential/ fundamental matrices encode correspondences:

x'^T F x = 0

Stereo vision & disparity: Depth Z related to baseline B and disparity d:

Z = f * B / d

Structure-from-Motion (SfM): Recover camera poses and sparse 3D structure from multiple images via bundle adjustment.
Multi-view stereo (MVS): Dense reconstruction from multiple viewpoints.
Pose estimation: PnP (Perspective-n-Point) solves camera pose from 3D-2D correspondences.
Optical flow: Dense motion field between frames; e.g., Lucas-Kanade, Horn-Schunck; modern CNNs (FlowNet, RAFT).
SLAM (Simultaneous Localization and Mapping): Real-time pose and map estimation for robotics. Visual SLAM uses monocular or stereo cameras.

Mathematical tools: projective geometry, homogeneous coordinates, Lie groups (SO3, SE3), optimization (bundle adjustment using non-linear least squares), RANSAC for robust model fitting.

Feature-Based & Classical Methods

Before deep learning, pipelines built using carefully engineered modules:

Feature detection & description:

Detectors: Harris corner, FAST, MSER
Descriptors: SIFT, SURF, ORB, BRIEF, BRISK
Matching: nearest neighbor, ratio test, cross-check

Object detection & recognition (classical):

Sliding windows with HOG+SVM (Dalal & Triggs)
Deformable Part Models (DPM)

Segmentation:

Thresholding, region growing, graph cuts (Boykov-Kolmogorov), mean shift, watershed
Superpixels: SLIC

Tracking:

Correlation filters (MOSSE), Kalman filters, particle filters
Multi-object tracking (MOT) uses detection-by-tracking, data association (Hungarian algorithm), appearance models

Limitations: Sensitivity to illumination, viewpoint changes, limited generalization; motivated move to learned representations.

Machine Learning & the Deep Learning Revolution

Deep learning replaced handcrafted features by learned hierarchical representations.

Convolutional Neural Networks (CNNs)

Key properties: local receptive fields, weight sharing, translation equivariance, hierarchical feature extraction.
Typical layers: convolution, ReLU, batch normalization, pooling, fully-connected, upsampling/deconvolution.
Landmark networks: LeNet, AlexNet, VGG, ResNet, Inception, MobileNet.

Why CNNs work:

Learn Gabor-like filters at early layers and more abstract features at deeper layers.
End-to-end training with backpropagation and large labeled datasets (ImageNet).
Regularization and normalization techniques (dropout, batch norm) enable very deep networks.

Detection & segmentation deep methods:

R-CNN family: R-CNN → Fast R-CNN → Faster R-CNN (with Region Proposal Network)
YOLO & SSD: Single-stage detectors focusing on speed.
Mask R-CNN: Instance segmentation by adding mask head to Faster R-CNN.
FCN, U-Net, DeepLab: Semantic segmentation architectures with encoder-decoder/backbone plus skip connections and dilated convolutions.

Self-supervised & Unsupervised methods:

Contrastive learning: SimCLR, MoCo
Non-contrastive/self-distillation: BYOL, DINO
Masked image modeling for ViTs: MAE

Multimodal & foundation models:

CLIP (contrastive language-image pretraining) aligns images and text, enabling zero-shot classification.
Flamingo, GPT-4o-Vision, etc. integrate vision and language for richer tasks.

Transformers in vision:

ViT splits images into patches treated as tokens. Transformers scale well and benefit from large-scale pretraining.
Hybrid CNN-Transformer backbones and transformer decoders for detection (DETR).

Modern Architectures & Models

Representative architectures:

Image classification:

ResNet, DenseNet, EfficientNet, Vision Transformer (ViT), ConvNeXt, Swin Transformer

Object detection:

Two-stage: Faster R-CNN, Mask R-CNN
One-stage: YOLOv3/v4/v5/v8, SSD, RetinaNet (focal loss)
Transformer-based: DETR, Deformable DETR

Segmentation:

FCN, U-Net, DeepLabv3+, HRNet, Segmenter, MaskFormer

Tracking:

SORT, DeepSORT, ByteTrack, Siamese trackers (SiamRPN)

Depth & 3D:

Monocular depth prediction: MiDaS, DPT
Neural radiance fields (NeRF) for novel view synthesis

Optical flow:

FlowNet, PWC-Net, RAFT

Generative & restoration:

GANs (pix2pix, CycleGAN), diffusion models for image synthesis and restoration

Loss functions & training tricks:

Cross-entropy, focal loss (-address class imbalance), IoU-based losses, dice/soft-Jaccard, contrastive losses, perceptual losses (VGG-based), adversarial losses.
Data augmentation: random crop, flip, color jitter, RandAugment, MixUp, CutMix, mosaic for detection.
Transfer learning and fine-tuning widely used.

Typical Computer Vision Pipeline

A practical CV system usually follows these stages:

Data acquisition

Cameras, sensors; synchronisation and calibration for multi-sensor rigs.

Preprocessing

Demosaicing, color conversion, denoising, rectification, normalization.

Data augmentation

Augment to improve generalization.

Model selection & training

Select backbone, head; choose loss, optimizer, learning rate schedule.

Inference

Optimize for latency: quantization, pruning, TensorRT/ONNX, batching.

Postprocessing

NMS for detection, morphological operations, tracking association.

Evaluation & deployment

Evaluate with test sets & metrics; monitor in deployment for domain shifts.

Example: Object detection pipeline

Input image → backbone feature map → proposal head or dense head → bounding box regression + classification → postprocess with NMS → tracked across frames.

Datasets, Benchmarks & Evaluation Metrics

Large-scale datasets propelled modern CV.

Key datasets:

ImageNet: 1k-class classification (ILSVRC) — catalyst for deep learning.
COCO (Common Objects in Context): object detection, instance segmentation, keypoint detection.
Pascal VOC: earlier object detection/segmentation tasks.
OpenImages: large-scale ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.