A learning path ready to make your own.

What is a convolutional neural network?

Convolutional Neural Networks (CNNs) — Summary A Convolutional Neural Network (CNN or ConvNet) is a class of deep neural networks designed for grid-like data (most notably 2D images, also 1D time series and 2D spectrograms). CNNs rely on three core ideas — local connectivity, parameter sharing (convolutions), and hierarchical feature learning — to efficiently learn spatially structured features and are the dominant approach in computer vision and many other domains. History & motivation Early neuroscience-inspired ideas and LeNet (1990s) established local receptive fields and weight sharing. AlexNet (2012) catalyzed the deep-learning revolution for vision; later advances include VGG, Inception, ResNet, DenseNet, MobileNet, EfficientNet, and the rise of Vision Transformers (ViT). Motivation: images have strong local correlations; convolutions reduce parameters and improve generalization compared to dense layers. Core concepts Local connectivity: each neuron sees a local patch (receptive field) rather than the whole image. Parameter sharing: a small kernel (e.g., 3×3) is applied across locations; same weights detect features everywhere. Hierarchical features: early layers learn edges/textures; deeper layers form parts and objects. Translational equivariance vs invariance: convolutions are equivariant to translations; pooling/global pooling and learned invariances produce translation-insensitive representations useful for classification. Mathematical foundations (concise) Convolution vs cross-correlation: most frameworks implement cross-correlation (no kernel flip) but learnable kernels make the distinction negligible. Stride, padding, dilation: stride reduces spatial resolution, padding controls borders, dilation (atrous) increases receptive field without extra params. Output size (2D): H_out = floor((H_in + 2p - K_h) / s) + 1 (similarly for W_out). Parameters & MACs: params = C_out × C_in × K × K (+ C_out bias); MACs ≈ H_out × W_out × C_out × C_in × K × K. Receptive field: grows with kernel sizes, strides and dilations. RF_l = RF_{l-1} + (k_l - 1) * stride_total_{l-1}; stride_total accumulates multiplicatively. Efficient variants: depthwise separable conv (depthwise + 1×1 pointwise), grouped conv reduce cost and parameters. Common building blocks Convolutional layers: 1×1 for channel mixing, 3×3 common for spatial pattern learning. Nonlinearities: ReLU, LeakyReLU, PReLU, ELU, GELU, Softmax (output). Pooling: max/average pooling and global average pooling to downsample and encourage invariance. Normalization: BatchNorm (common), LayerNorm, GroupNorm, InstanceNorm. Skip connections / residual blocks: identity or projection skips (ResNet) that enable very deep models. Regularization: Dropout, spatial dropout, label smoothing, weight decay. Attention / non-local modules: capture long-range dependencies beyond local conv receptive fields. Representative architectures LeNet, AlexNet, VGG, Inception (GoogLeNet), ResNet, DenseNet MobileNet, ShuffleNet, EfficientNet for efficiency and mobile deployment YOLO / SSD / Faster R-CNN (detection), U-Net / DeepLab (segmentation), Mask R-CNN (instance seg) Vision Transformer (ViT) and hybrid CNN-Transformer models Training, optimization & regularization Losses: cross-entropy for classification, smooth L1 + classification for detection, Dice/IoU for segmentation. Optimizers: SGD with momentum (often best generalization), Adam/AdamW for faster convergence; use weight decay appropriately. LR schedules: step decay, cosine annealing, cyclical LR, warmup are critical for performance. Data augmentation & transfer learning: flips, crops, color jitter, Mixup/CutMix, AutoAugment; pretraining (ImageNet or self-supervised) + fine-tuning is common. Regularization: dropout, weight decay, BatchNorm-induced regularization, early stopping. Practical examples & tips Simple CNNs can be implemented in PyTorch or Keras; common patterns: conv → norm → activation → pool → FC. Hyperparameter tips: start from pretrained models for small data, use mixed precision (FP16), scale learning rate with batch size, use strong augmentation and weight decay (~1e-4), prefer SGD+m for best generalization but AdamW is a robust default. Compute output shapes and params using the formulas above; calculate receptive field when context matters. Applications Computer vision: classification, detection, segmentation, super-resolution, style transfer, face recognition, pose estimation. Medical imaging, remote sensing, audio (spectrograms), time-series (1D convs), some NLP tasks and multimodal models. Deployment & efficiency Use GPUs/TPUs for training; optimize inference for edge with quantization (8-bit), pruning, knowledge distillation, and efficient architectures. Deployment toolchains: TensorRT, ONNX, TensorFlow Lite, Core ML, OpenVINO. Interpretability Visualize first-layer filters, activation maximization, feature inversion; use saliency maps, Grad-CAM, SHAP, LIME for explanations. Limitations & failure modes Need lots of labeled data (mitigated by transfer/self-supervised learning). Vulnerable to adversarial examples and dataset biases; limited interpretability in deep layers. Local convolutions struggle with very long-range dependencies (addressed by dilations, large kernels, attention/transformers). Current trends & future directions Self-supervised learning (SimCLR, MoCo, BYOL), Vision Transformers and hybrids, Neural Architecture Search (NAS), compound scaling (EfficientNet). Efficiency improvements (sparsity-aware ops, better depthwise separable designs), robustness and interpretability advances, and wider multimodal models (e.g., CLIP). Conclusion CNNs remain a foundational and highly efficient approach for spatial data due to their inductive biases (locality and translation structure), extensive architectural and training advances, and practical deployment strategies. Emerging methods (self-supervision, attention/transformers, NAS, efficiency-centric architectures) continue to broaden their capabilities and use cases.

Open full tree

Follow the trail that experts already trust.

Resources

23:01

But what is a convolution?

3Blue1Brown3.6M views

8:37

Convolutional Neural Networks (CNNs) explained

deeplizard1.5M views

23:54

Simple explanation of convolutional neural network | Deep Learning Tutorial 23 (Tensorflow & Python)

codebasics1.4M views

26:14

Read deeper, connect wider, own the subject.

Deep Article

What is a Convolutional Neural Network (CNN)?

A Convolutional Neural Network (CNN or ConvNet) is a class of deep neural networks particularly well-suited to processing data with a grid-like topology — most commonly 2D images (height × width) and 1D signals (time series) or 2D time-frequency representations (audio spectrograms). CNNs leverage three central ideas that make them powerful and efficient for such data: local connectivity, parameter sharing (convolutions), and hierarchical feature learning. They have become the dominant approach in computer vision and are widely used in many other domains.

This article is a deep dive into CNNs: history, core concepts, mathematical foundations, architectural components, training and regularization techniques, applications, state of the art, practical implementation examples (PyTorch/TensorFlow), deployment tips, limitations, and future directions.

Table of contents

History and motivation
Core concepts and intuition
Local connectivity
Parameter sharing (convolution)
Hierarchical feature learning
Translational equivariance vs invariance
Mathematical foundations
Discrete convolution vs cross-correlation
Stride, padding, dilation
Output shape and parameter count
Receptive field
Depthwise and separable convolutions
Common CNN building blocks
Convolutional layers
Nonlinearities
Pooling and downsampling
Normalization
Fully connected layers
Skip connections and residual blocks
Attention and gated mechanisms
Popular architectures (historical and modern)
Training, optimization, and regularization
Losses and metrics
Optimizers and learning rate schedules
Data augmentation & transfer learning
Regularization methods
Practical examples: code (PyTorch, Keras)
Applications across domains
Deployment, efficiency, and hardware
Interpretability and visualization methods
Limitations and failure modes
Current trends and future directions
References and recommended reading

History and motivation

1980s–1990s: Foundational ideas in local receptive fields and weight sharing came from computational neuroscience. Yann LeCun and collaborators developed LeNet (1990s) that used convolutions for digit recognition, one of the first successful applications of backpropagation to real-world tasks (LeNet-5, 1998).
2012: AlexNet (Krizhevsky et al., 2012) demonstrated a dramatic improvement in ImageNet classification using deep convolutional networks on GPUs — this is widely regarded as the start of the deep learning revolution in computer vision.
2014–2016: VGG, Inception, ResNet, and variants pushed depth, width, and architectural innovations (e.g., residual connections) that greatly increased performance and stability.
2017–present: Continued innovations include efficient architectures (MobileNet, EfficientNet), object detection frameworks (R-CNN family, YOLO), segmentation networks (U-Net, DeepLab), and the incursion of Transformer-based models into vision (Vision Transformer).

Motivation: images have strong local correlations and structure; using dense fully connected layers is inefficient. Convolutions exploit locality and translation structure to dramatically reduce parameters and improve generalization.

Core concepts and intuition

Local connectivity

Instead of connecting each input pixel to every neuron in the next layer (as in fully connected layers), convolutional layers connect each neuron only to a local region of the input (its receptive field). This reflects the assumption that nearby pixels are more strongly correlated than distant ones.

Parameter sharing

A convolutional kernel (filter) of small spatial size (e.g., 3×3) is applied across the entire image. The same set of weights is used at every spatial location. This means the network detects the same feature regardless of location, greatly reducing the number of parameters.

Hierarchical feature learning

Early layers learn simple features (edges, colors, textures). Deeper layers compose these into higher-level features (parts, objects). The network thus builds a feature hierarchy.

Translational equivariance vs invariance

Equivariance: Convolution preserves spatial relationships — a translated input produces a translated feature map.
Invariance: Pooling, global pooling, and learned invariance produce representations that are insensitive to small translations or distortions — useful for classification.

Mathematical foundations

Discrete convolution vs cross-correlation

Most deep learning frameworks implement discrete cross-correlation, but call it convolution. Given input feature map X and kernel (filter) K, cross-correlation at location (i,j):

Y[i, j] = sum{u=0}^{kh-1} sum{v=0}^{kw-1} X[i+u, j+v] * K[u, v]

True convolution flips the kernel; cross-correlation does not. The practical difference is negligible because kernels are learned.

Stride, padding, dilation

Stride (s): step size with which the kernel moves. s > 1 reduces spatial resolution.
Padding (p): adding zeros (or other padding modes) around input to control output size and preserve borders.
Dilation (rate): spaced-apart sampling of kernel elements (atrous convolution) to increase receptive field without increasing parameters.

Output spatial size formula for 2D conv: Given input Hin, Win, kernel size Kh × Kw, padding p, stride s: Hout = floor((Hin + 2p - Kh) / s) + 1 Wout = floor((Win + 2p - Kw) / s) + 1

Example: parameters and FLOPs

For a convolution with Cin input channels, Cout output channels, kernel K × K:

Number of weight parameters: Cout × Cin × K × K
Add bias per output channel (optional): + C_out
For one forward pass over spatial size Hout × Wout, number of multiply-adds (MACs): Hout × Wout × Cout × Cin × K × K

Example: 3×3 conv with Cin=64, Cout=128, on 56×56 feature map: params = 128×64×3×3 = 737,28 (approx 737k) MACs ≈ 56×56×128×64×9 ≈ large number (compute as needed)

Receptive field

The receptive field of a neuron in deeper layers is the region of the input image that affects it. It grows with layers, kernel sizes, strides, and dilations. Receptive field affects the context a neuron can "see." For typical conv stacks with 3×3 kernels and stride 1, receptive field grows by 2 per layer (left and right).

A quick formula for receptive field (RF): Start RF0 = 1, stridetotal0 = 1 For layer l with kernel kl and stride sl: RFl = RF{l-1} + (kl - 1) stridetotal{l-1} stridetotall = stridetotal{l-1} s_l

Depthwise separable and grouped convolutions

Depthwise separable conv (MobileNet): factorize standard conv into depthwise conv (per-channel spatial conv) followed by pointwise (1×1) conv that mixes channels. Results in much fewer parameters and FLOPs.
Grouped conv (AlexNet early GPU split, ResNeXt): split channels into groups that are convolved separately, then concatenated. Efficient for parallelism and designed to maintain capacity.

Common CNN building blocks

Convolutional layer

Kernel size: 1×1, 3×3, 5×5, etc.
Number of filters: channels in output.
Stride, padding, dilation options.

1×1 conv is used for channel mixing and dimension reduction/increase (bottleneck).

Nonlinear activation functions

ReLU (Rectified Linear Unit): max(0, x) — popular for simplicity and gradient stability.
Leaky ReLU, PReLU: variants allowing small negative slope.
ELU, SELU: aimed at self-normalizing.
GELU: used in Transformers, smooth nonlinearity.
Softmax: used at the output for multi-class classification.

Pooling

Max pooling: retains the maximum activation in a region — adds local translation invariance.
Average pooling: averages activations.
Global average pooling: replaces FC layers in classification heads to reduce parameters.

Pooling downsamples spatial resolution; strides and convolutions with stride can also downsample.

Normalization

Batch Normalization (BatchNorm): normalizes activations across a mini-batch; speeds training and stabilizes learning.
LayerNorm, InstanceNorm, GroupNorm: alternatives useful in small-batch or style-transfer settings.

Fully connected layers

Historically used at the end for classification. Many modern architectures replace large FC layers with global pooling + small FC or even no FC at all.

Residual connections (skip connections)

Introduced in ResNet, residual connections (identity or projection skips) allow gradients to flow more easily and enable training much deeper networks.

Residual block pseudocode: y = x + F(x) where F(x) is a small conv stack (e.g., two 3×3 convs). This architecture mitigates vanishing gradients and allows training hundreds of layers.

Dropout, spatial dropout

Dropout randomly zeroes activations during training to reduce co-adaptation. Spatial dropout zeros entire feature maps, which is often more appropriate for conv layers.

Attention mechanisms

Attention (non-local blocks, self-attention) has been incorporated into CNNs to model long-range dependencies that convolutions might miss.

Popular architectures (historical and modern)

LeNet-5 (1998): early convnet for digit recognition.
AlexNet (2012): revived CNNs for ImageNet; used ReLU, dropout, ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.