[DL] 13. Convolution and Pooling Variants (Dilated Convolution, SPP, ASPP)

jun94
jun-devpBlog
Published in
3 min readDec 28, 2020

--

1. Dilated Convolution

Dilated convolution, which is also often called atrous convolution, was introduced in 2016 ICLR. Its major idea is that when performing convolution, not taking a look at directly adjacent pixels, but further away pixels by a certain distance.

This distance is called ‘Dilation rate r,’ and the dilated convolution can be mathematically represented as follows.

Equation 1. Dilated Convolution with dilation rate r

From the equation of standard convolution, only the term for dilation rate r is added, and as one might notice, if r is one, then the eq 1 is just a standard convolution.

Then what is the role of dilatation rate? This question can be well answered with the visualization below.

Standard convolution, when r = 1

As mentioned before, dilated convolution becomes the standard convolution if r = 1.

However, things get different if r is larger than one. The following illustrates when r = 2, and we can observe that the convolution computes the weighted sum for pixels, which are far away by a distance of 2 to each other, in the input.

Dilated Convolution, when r = 2

What is advantageous of using dilated convolution instead of the standard one?

As we have seen, the dilated convolution has an effect of increasing the receptive field without increasing computation. In other words, it can achieve the result of pooling without additional computation and losing the resolution.

Figure 1. Illustration of 3x3 Dilated convolution. Left panel: dilation rate r = 1, equivalent to the normal convolution, center panel: dilation rate r=2, right panel: r=3, from link

2. Spatial Pyramid Pooling (SPP)

Given feature maps from previous convolution layers, SPP performs max-pooling operations multiple times with increasing kernel sizes. Afterward, each pooling operation results are transformed into a vector representation, and those vector representations are concatenated to produce the final fixed-length representation.

Figure 2. Illustration of the Spatial Pyramid Pooling (SPP), from [1]

With multiple pooling layers at different scales, it captures information from varying image scale, leading to achieving the robustness to varying image scales. As a result, SPP creates a multi-scale feature representation that can be used for classification in the later parts of the network.

3. Atrous Spatial Pyramid Pooling (ASPP)

Another approach to building a multi-scale representation using the idea of looking at input image at varying scales is the Atrous Spatial Pyramid Pooling (ASPP). In other words, ASPP is an extension of the SPP concept making use of dilated convolutions instead of max pooling.

Figure 3. Illustration of ASPP, from [1]

The difference is that ASPP does not decrease the resolution, meaning that the input and output sizes of ASPP are the same since the max-pooling operation does not exist in ASPP.

More specifically, while the output of SPP is in form of a single vector, the output of ASPP remains in form of 2D feature maps, as is shown below.

Figure 4. Illustration of ASPP(2)

Further, as the dilated convolutions replace the standard convolutions in ASPP, the receptive fields are increased without extra computation.

Reference

[1] RWTH Aachen, computer vision group

Any corrections, suggestions, and comments are welcome.

--

--