A learning path ready to make your own.

What is a convolutional neural network?

Convolutional Neural Networks (CNNs) — Summary A Convolutional Neural Network (CNN or ConvNet) is a class of deep neural networks designed for grid-like data (most notably 2D images, also 1D time series and 2D spectrograms). CNNs rely on three core ideas — local connectivity, parameter sharing (convolutions), and hierarchical feature learning — to efficiently learn spatially structured features and are the dominant approach in computer vision and many other domains. History & motivation Early neuroscience-inspired ideas and LeNet (1990s) established local receptive fields and weight sharing. AlexNet (2012) catalyzed the deep-learning revolution for vision; later advances include VGG, Inception, ResNet, DenseNet, MobileNet, EfficientNet, and the rise of Vision Transformers (ViT). Motivation: images have strong local correlations; convolutions reduce parameters and improve generalization compared to dense layers. Core concepts Local connectivity: each neuron sees a local patch (receptive field) rather than the whole image. Parameter sharing: a small kernel (e.g., 3×3) is applied across locations; same weights detect features everywhere. Hierarchical features: early layers learn edges/textures; deeper layers form parts and objects. Translational equivariance vs invariance: convolutions are equivariant to translations; pooling/global pooling and learned invariances produce translation-insensitive representations useful for classification. Mathematical foundations (concise) Convolution vs cross-correlation: most frameworks implement cross-correlation (no kernel flip) but learnable kernels make the distinction negligible. Stride, padding, dilation: stride reduces spatial resolution, padding controls borders, dilation (atrous) increases receptive field without extra params. Output size (2D): H_out = floor((H_in + 2p - K_h) / s) + 1 (similarly for W_out). Parameters & MACs: params = C_out × C_in × K × K (+ C_out bias); MACs ≈ H_out × W_out × C_out × C_in × K × K. Receptive field: grows with kernel sizes, strides and dilations. RF_l = RF_{l-1} + (k_l - 1) * stride_total_{l-1}; stride_total accumulates multiplicatively. Efficient variants: depthwise separable conv (depthwise + 1×1 pointwise), grouped conv reduce cost and parameters. Common building blocks Convolutional layers: 1×1 for channel mixing, 3×3 common for spatial pattern learning. Nonlinearities: ReLU, LeakyReLU, PReLU, ELU, GELU, Softmax (output). Pooling: max/average pooling and global average pooling to downsample and encourage invariance. Normalization: BatchNorm (common), LayerNorm, GroupNorm, InstanceNorm. Skip connections / residual blocks: identity or projection skips (ResNet) that enable very deep models. Regularization: Dropout, spatial dropout, label smoothing, weight decay. Attention / non-local modules: capture long-range dependencies beyond local conv receptive fields. Representative architectures LeNet, AlexNet, VGG, Inception (GoogLeNet), ResNet, DenseNet MobileNet, ShuffleNet, EfficientNet for efficiency and mobile deployment YOLO / SSD / Faster R-CNN (detection), U-Net / DeepLab (segmentation), Mask R-CNN (instance seg) Vision Transformer (ViT) and hybrid CNN-Transformer models Training, optimization & regularization Losses: cross-entropy for classification, smooth L1 + classification for detection, Dice/IoU for segmentation. Optimizers: SGD with momentum (often best generalization), Adam/AdamW for faster convergence; use weight decay appropriately. LR schedules: step decay, cosine annealing, cyclical LR, warmup are critical for performance. Data augmentation & transfer learning: flips, crops, color jitter, Mixup/CutMix, AutoAugment; pretraining (ImageNet or self-supervised) + fine-tuning is common. Regularization: dropout, weight decay, BatchNorm-induced regularization, early stopping. Practical examples & tips Simple CNNs can be implemented in PyTorch or Keras; common patterns: conv → norm → activation → pool → FC. Hyperparameter tips: start from pretrained models for small data, use mixed precision (FP16), scale learning rate with batch size, use strong augmentation and weight decay (~1e-4), prefer SGD+m for best generalization but AdamW is a robust default. Compute output shapes and params using the formulas above; calculate receptive field when context matters. Applications Computer vision: classification, detection, segmentation, super-resolution, style transfer, face recognition, pose estimation. Medical imaging, remote sensing, audio (spectrograms), time-series (1D convs), some NLP tasks and multimodal models. Deployment & efficiency Use GPUs/TPUs for training; optimize inference for edge with quantization (8-bit), pruning, knowledge distillation, and efficient architectures. Deployment toolchains: TensorRT, ONNX, TensorFlow Lite, Core ML, OpenVINO. Interpretability Visualize first-layer filters, activation maximization, feature inversion; use saliency maps, Grad-CAM, SHAP, LIME for explanations. Limitations & failure modes Need lots of labeled data (mitigated by transfer/self-supervised learning). Vulnerable to adversarial examples and dataset biases; limited interpretability in deep layers. Local convolutions struggle with very long-range dependencies (addressed by dilations, large kernels, attention/transformers). Current trends & future directions Self-supervised learning (SimCLR, MoCo, BYOL), Vision Transformers and hybrids, Neural Architecture Search (NAS), compound scaling (EfficientNet). Efficiency improvements (sparsity-aware ops, better depthwise separable designs), robustness and interpretability advances, and wider multimodal models (e.g., CLIP). Conclusion CNNs remain a foundational and highly efficient approach for spatial data due to their inductive biases (locality and translation structure), extensive architectural and training advances, and practical deployment strategies. Emerging methods (self-supervision, attention/transformers, NAS, efficiency-centric architectures) continue to broaden their capabilities and use cases.

Let the lesson walk with you.

Podcast

What is a convolutional neural network? podcast

0:00-3:20

Follow the trail that experts already trust.

Resources

Turn quick sparks into lasting recall.

Flashcards

What is a convolutional neural network? flashcards

16 cards

Question

Click to flip
Answer

Prove the idea before it slips away.

Quizzes

What is a convolutional neural network? quiz

12 questions

What does "parameter sharing" (in the context of convolutional layers) mean?

Read deeper, connect wider, own the subject.

Deep Article

What is a Convolutional Neural Network (CNN)?

A Convolutional Neural Network (CNN or ConvNet) is a class of deep neural networks particularly well-suited to processing data with a grid-like topology — most commonly 2D images (height × width) and 1D signals (time series) or 2D time-frequency representations (audio spectrograms). CNNs leverage three central ideas that make them powerful and efficient for such data: local connectivity, parameter sharing (convolutions), and hierarchical feature learning. They have become the dominant approach in computer vision and are widely used in many other domains.

This article is a deep dive into CNNs: history, core concepts, mathematical foundations, architectural components, training and regularization techniques, applications, state of the art, practical implementation examples (PyTorch/TensorFlow), deployment tips, limitations, and future directions.


Table of contents

  • History and motivation
  • Core concepts and intuition
  • Local connectivity
  • Parameter sharing (convolution)
  • Hierarchical feature learning
  • Translational equivariance vs invariance
  • Mathematical foundations
  • Discrete convolution vs cross-correlation
  • Stride, padding, dilation
  • Output shape and parameter count
  • Receptive field
  • Depthwise and separable convolutions
  • Common CNN building blocks
  • Convolutional layers
  • Nonlinearities
  • Pooling and downsampling
  • Normalization
  • Fully connected layers
  • Skip connections and residual blocks
  • Attention and gated mechanisms
  • Popular architectures (historical and modern)
  • Training, optimization, and regularization
  • Losses and metrics
  • Optimizers and learning rate schedules
  • Data augmentation & transfer learning
  • Regularization methods
  • Practical examples: code (PyTorch, Keras)
  • Applications across domains
  • Deployment, efficiency, and hardware
  • Interpretability and visualization methods
  • Limitations and failure modes
  • Current trends and future directions
  • References and recommended reading

History and motivation

  • 1980s–1990s: Foundational ideas in local receptive fields and weight sharing came from computational neuroscience. Yann LeCun and collaborators developed LeNet (1990s) that used convolutions for digit recognition, one of the first successful applications of backpropagation to real-world tasks (LeNet-5, 1998).
  • 2012: AlexNet (Krizhevsky et al., 2012) demonstrated a dramatic improvement in ImageNet classification using deep convolutional networks on GPUs — this is widely regarded as the start of the deep learning revolution in computer vision.
  • 2014–2016: VGG, Inception, ResNet, and variants pushed depth, width, and architectural innovations (e.g., residual connections) that greatly increased performance and stability.
  • 2017–present: Continued innovations include efficient architectures (MobileNet, EfficientNet), object detection frameworks (R-CNN family, YOLO), segmentation networks (U-Net, DeepLab), and the incursion of Transformer-based models into vision (Vision Transformer).

Motivation: images have strong local correlations and structure; using dense fully connected layers is inefficient. Convolutions exploit locality and translation structure to dramatically reduce parameters and improve generalization.


Core concepts and intuition

Local connectivity

Instead of connecting each input pixel to every neuron in the next layer (as in fully connected layers), convolutional layers connect each neuron only to a local region of the input (its receptive field). This reflects the assumption that nearby pixels are more strongly correlated than distant ones.

Parameter sharing

A convolutional kernel (filter) of small spatial size (e.g., 3×3) is applied across the entire image. The same set of weights is used at every spatial location. This means the network detects the same feature regardless of location, greatly reducing the number of parameters.

Hierarchical feature learning

Early layers learn simple features (edges, colors, textures). Deeper layers compose these into higher-level features (parts, objects). The network thus builds a feature hierarchy.

Translational equivariance vs invariance

  • Equivariance: Convolution preserves spatial relationships — a translated input produces a translated feature map.
  • Invariance: Pooling, global pooling, and learned invariance produce representations that are insensitive to small translations or distortions — useful for classification.

Mathematical foundations

Discrete convolution vs cross-correlation

Most deep learning frameworks implement discrete cross-correlation, but call it convolution. Given input feature map X and kernel (filter) K, cross-correlation at location (i,j):

Y[i, j] = sum{u=0}^{kh-1} sum{v=0}^{kw-1} X[i+u, j+v] * K[u, v]

True convolution flips the kernel; cross-correlation does not. The practical difference is negligible because kernels are learned.

Stride, padding, dilation

  • Stride (s): step size with which the kernel moves. s > 1 reduces spatial resolution.
  • Padding (p): adding zeros (or other padding modes) around input to control output size and preserve borders.
  • Dilation (rate): spaced-apart sampling of kernel elements (atrous convolution) to increase receptive field without increasing parameters.

Output spatial size formula for 2D conv: Given input Hin, Win, kernel size Kh × Kw, padding p, stride s: Hout = floor((Hin + 2p - Kh) / s) + 1 Wout = floor((Win + 2p - Kw) / s) + 1

Example: parameters and FLOPs

For a convolution with Cin input channels, Cout output channels, kernel K × K:

  • Number of weight parameters: Cout × Cin × K × K
  • Add bias per output channel (optional): + C_out
  • For one forward pass over spatial size Hout × Wout, number of multiply-adds (MACs): Hout × Wout × Cout × Cin × K × K

Example: 3×3 conv with Cin=64, Cout=128, on 56×56 feature map: params = 128×64×3×3 = 737,28 (approx 737k) MACs ≈ 56×56×128×64×9 ≈ large number (compute as needed)

Receptive field

The receptive field of a neuron in deeper layers is the region of the input image that affects it. It grows with layers, kernel sizes, strides, and dilations. Receptive field affects the context a neuron can "see." For typical conv stacks with 3×3 kernels and stride 1, receptive field grows by 2 per layer (left and right).

A quick formula for receptive field (RF): Start RF0 = 1, stridetotal0 = 1 For layer l with kernel kl and stride sl: RFl = RF{l-1} + (kl - 1) stridetotal{l-1} stridetotall = stridetotal{l-1} s_l

Depthwise separable and grouped convolutions

  • Depthwise separable conv (MobileNet): factorize standard conv into depthwise conv (per-channel spatial conv) followed by pointwise (1×1) conv that mixes channels. Results in much fewer parameters and FLOPs.
  • Grouped conv (AlexNet early GPU split, ResNeXt): split channels into groups that are convolved separately, then concatenated. Efficient for parallelism and designed to maintain capacity.

Common CNN building blocks

Convolutional layer

  • Kernel size: 1×1, 3×3, 5×5, etc.
  • Number of filters: channels in output.
  • Stride, padding, dilation options.

1×1 conv is used for channel mixing and dimension reduction/increase (bottleneck).

Nonlinear activation functions

  • ReLU (Rectified Linear Unit): max(0, x) — popular for simplicity and gradient stability.
  • Leaky ReLU, PReLU: variants allowing small negative slope.
  • ELU, SELU: aimed at self-normalizing.
  • GELU: used in Transformers, smooth nonlinearity.
  • Softmax: used at the output for multi-class classification.

Pooling

  • Max pooling: retains the maximum activation in a region — adds local translation invariance.
  • Average pooling: averages activations.
  • Global average pooling: replaces FC layers in classification heads to reduce parameters.

Pooling downsamples spatial resolution; strides and convolutions with stride can also downsample.

Normalization

  • Batch Normalization (BatchNorm): normalizes activations across a mini-batch; speeds training and stabilizes learning.
  • LayerNorm, InstanceNorm, GroupNorm: alternatives useful in small-batch or style-transfer settings.

Fully connected layers

Historically used at the end for classification. Many modern architectures replace large FC layers with global pooling + small FC or even no FC at all.

Residual connections (skip connections)

Introduced in ResNet, residual connections (identity or projection skips) allow gradients to flow more easily and enable training much deeper networks.

Residual block pseudocode: y = x + F(x) where F(x) is a small conv stack (e.g., two 3×3 convs). This architecture mitigates vanishing gradients and allows training hundreds of layers.

Dropout, spatial dropout

Dropout randomly zeroes activations during training to reduce co-adaptation. Spatial dropout zeros entire feature maps, which is often more appropriate for conv layers.

Attention mechanisms

Attention (non-local blocks, self-attention) has been incorporated into CNNs to model long-range dependencies that convolutions might miss.


Popular architectures (historical and modern)

  • LeNet-5 (1998): early convnet for digit recognition.
  • AlexNet (2012): revived CNNs for ImageNet; used ReLU, dropout, ...

Ready to see the full tree?

Clone the preview to open the complete learning structure, practice tools, and generated study materials.