- Convolutional Neural Networks
- Convolution Operation
- Motivation
- Pooling
- Convolution and Pooling as an Infinitely Strong Prior
- Variants of the Basic Convolution Function
- Types of Convolution
- Structured Outputs
- Data Types
- Efficient Convolution Algorithms
- Random and Unsupervised Features
Convolutional networks also known as convolutional neural networks, or CNNs, are a specialized kind of neural network for processing data that has a known grid-like topology.
The convolution operates on the input with a kernel (weights) to produce an output map given by:

- 1-D discrete convolution operation can be given by:

- 2-D discrete convolution operation can be given by:

- 2-D convolution operation can be visualized as below:

- Convolution operation is commutatiive.
- Commutative property arises because we have flipped the kernel relative to the input

- Function which is analogous to convolution operation without flipping the kernel is called cross-correlation operation.
- Cross-correlation is not commutative.
- Convolution operation:

- Correlation operation:

- 1D convolution operation can be represented as a matrix vector product.
- The kernel marix is obtained by composing weights into a Toeplitz matrix.
- Toeplitz matrix has the property that values along all diagonals are constant.

-
To extend the concept of Toeplitz matrix towards 2-D input, we need to convert 2-D input to 1-D vector.
-
Kernel needs to be modified as before but this time resulting in a block-circulant matrix.
-
A circulant matrix is a special case of a Toeplitz matrix where each row is equal to the row above shifted by one element.

-
A matrix which is circulant with respect to its sub-matrices is called a block circulant matrix.

-
If each of the submatrices is itself circulant, the matrix is called doubly block-circulant matrix.

- In traditional Neural Networks, every output unit interacts with every input unit.
- Convolutional networks, however, typically have sparse interactions, by making kernel smaller than input.
- Reduces memory requirements
- Improves statistical efficiency
- In a deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input.

- Parameter sharing refers to using same parameter for more than one function in a model.
- In convolutional neural net, each member of kernel is used at every position of input i.e. parameters used to compute different output units are tied together (all times their values are same).
- Sparse interactions and parameter sharing combined can improve efficiency of a linear function for detecting edges in an image
- Parameter sharing in a convolutional network provides equivariance to translation.

- Translation of image results in corresponding translation in the output map.
- Convolution operation by itself is not equivariant to changes in scale or rotation.

-
A convolution layer consists of 3 layers -
- Convolution
- Activation (Detector Stage)
- Pooling
-
A pooling function replaces the output of net at a certain location with summary statistic of nearby outputs.
-
Common summary statistics are : mean, median, weighted average.

-
Pooling makes the representation slightly translation invariant, in that small translations in the input do not cause large changes in output map.
-
It allows detection of a particular feature if we only care about its existence, not its position in an image.
-
Pooling reduces input size to the next layer in turn reducing the number of computations required upstream.

- Classification layers requires fixed size of their inputs.
- Pooling makes their output fixed size by changing their pooling size, stride etc.

- Pooling over feature channels can be used to develop invariance to certain transformations of the input.
- Units in a layer may be developed to learn rotated features and then pooled over. This property has been used in Maxout networks.

Assumptions about weights (before learning) in terms of acceptable values and range are encoded into the prior distribution of weights.
| S.No. | Prior Type | Variance/Confidence Type |
|---|---|---|
| 1. | Weak | High Variance, Low Confidence |
| 2. | Strong | Narrow range of values about which we are confident before learning begins. |
| 3. | Infinitely strong | Demarkates certain values as forbidden completely assigning them zero probability. |
- Convolution imposes an infinitely strong prior by making the following restrictions on weights:
- Adjacent units must have the same weight but shifted in space.
- Except for a small spatially connected region, all other weights must be zero.
- Features should be translation invariant.
- If tasks relies on preserving specific spatial information, then pooling can cause on all features can increase training error.
In practical implementations of the convolution operation, certain modifications are made which deviate from standard discrete convolution operation -
- In general a convolution layer consists of application of several different kernels to the input. Since, convolution with a single kernel can extract only one kind of feature.
- The input is generally not real-valued but instead vector valued.
- Multi-channel convolutions are commutative iff number of output and input channels is the same.
- Stride is the number of pixels shifts over the input matrix.
- In order to allow for calculation of features at a coarser level strided convolutions can be used.
- The effect of strided convolution is the same as that of a convolution followed by a downsampling stage.
- Strides can be used to reduce the representation size.
- Below is an example representing 2-D Convolution, with (3 * 3) Kernel and Stride of 2 units.

- Convolution networks can implicitly zero pad the input V, to make it wider.
- Without zero padding,the width of representation shrinks by one pixel less than the kernel width at each layer.
- Zero padding the input allows to control kernel width and size of output independently.
3 common zero padding strategies are:
| Convolution Type | Properties | Advantages and Disadvantages |
|---|---|---|
| Unshared Convolution | 1. No Parameter sharing. 2. Each output unit performs a linear operation on its neighbourhood but parameters are not shared across output units. 3. Captures local connectivity while allowing different features to be computed at different spatial locations. |
Advantages 1. Reducing memory consumption 2. Increasing statistical efficiency 3. Reducing the amount of computation needed to perform forward and back-propagation. Disadvantages 1. requires much more parameters than the convolution operation. |
| Tiled Convolution | 1. Offers a compromise b/w unshared and traditional convoltion. 2. Learn a set of kernels and cycle/rotate them through space. 3. Makes use of parameter sharing. |
Advantages 1. Reduces the #parameters in model. |
| Traditional Convolution | 1. Equivalent to tiled convolution with t=1. 2. Has the same connectivity as unshared convolution |
- Convolutional networks can be trained to output high-dimensional structured output rather than just a classification score.
- To produce an output map as same size as input map, only same-padded convolutions can be stacked.
- The output of the first labelling stage can be refined successively by another convolutional model.
- If the models use tied parameters, this gives rise to a type of recursive model

| Variable | Description |
|---|---|
| X | Input image tensor |
| Y | Probability distribution over tensor for each pixel |
| H | Hidden representation |
| U | Tensor of convolution kernels |
| V | Tensor of kernels to produce estimation of lables |
| W | Kernel tensor to convolve over Y to provide input to H |
The data used with a convolutional network usually consist of several channels,each channel being the observation of a different quantity at some point in space or time.
- When output is variable sized, no extra design change needs to be made.
- When output requires fixed size (classification), a pooling stage with kernel size proportional to input size needs to be used.

The Fourier Transform is a tool that breaks a waveform (a function or signal) into an alternate representation, characterized by sine and cosines.

- Convolution is equivalent to converting both input and kernel to frequency domain using a Fourier transform, performing point-wise multiplication of two signals:

- Converting back to time domain using an inverse Fourier transform.

- When a d-dimensional kernel can be expressed as outer product of d vectors, one vector per dimension, the kernel is called separable.
- The kernel also takes fewer parameters to represent as vectors.
| Kernel Type | Runtime complexity for d-dimensional kernel with w elements wide |
|---|---|
| Traditional Kernel | ![]() |
| Separable Kernel | ![]() |
To reduce the cost of convolutional network training, we have to use features that are not trained in a supervised way:
- Random Initialization:
- Layers consisting of convolution followed by pooling naturally become frequency selective and translation invariant when assigned random weights.
- Randomly initialize several CNN architectures and just train the last classification layer.
- Once a winner is determined, train that model using a more expensive approach (supervised approach).
- Hand Designed Kernels: * Used to detect edges at a certain orientation or scale.
- Unsupervised Training: * Unsupervised pre-training may offer regularization effect. * It may also allow for training of larger CNNs because of reduced computation cost.
Instead of training an entire convolutional layer at a time, we can train a model of a small patch:
- Train the first layer in isolation.
- Extract all features from the first layer only once.
- Once the first layer is trained, its output is stored and used as input for training the next layer.
- We can train very large models and incur a high computational cost only at inference time.









