[Review] 4. Faster R-CNN

jun94
jun-devpBlog
Published in
5 min readApr 8, 2021

--

1. Improvement from Fast R-CNN by introducing a Region Proposal Network

  • Introduce a Region Proposal Network (RPN) and replace the external region proposal algorithms (e.g., Selective Search, EdgeBoxes), which costs 2 and 0.2 seconds per image, respectively.
  • Region proposal algorithms are slow by order of magnitude as they are implemented on CPU, while the detection network is on GPU.
  • RPN takes full-image convolutional features as input and produces object boundaries and their corresponding confidence scores at each grid point.
  • With RPN, computing time for proposals significantly decreases from 2 (or 0.2, depends on the region proposal algorithm) sec to 10 ms.
  • RPN is trained end-to-end to output region proposals, which are fed into Fast R-CNN for detection.
  • In order to merge and train the RPN and Fast R-CNN in an end-to-end manner, they share the convolutional features, using the ‘attention’ mechanism.

2. Faster R-CNN Architecture

Figure 1. Faster R-CNN architecture consists of separately trained and merged RPN and Fast R-CNN, from [2]

Faster R-CNN consists of two modules; one is the RPN, which is a fully-convolutional network producing region proposals, and the other one is the Fast-RCNN detector that takes the proposal from RPN as input and produces object detection results.

3. Region Proposal Networks (RPN)

RPN takes an image of arbitrary size, and it produces a set of object proposals containing the location (in a rectangular shape) and the confidence score of objectness (Foreground vs Background).

Fig 2 illustrates how the RPN works in a sliding window fashion. Verbalized process of RPN is as below:

(1) A certain spatial area of feature maps (window) from the previous convolution network in Fig 2 is extracted and fed into the RPN.

(2) By convolving the window with n⨯n (shared for classification and regression) convolutional layer, the intermediate feature maps are obtained. Afterward, two sibling 1⨯1 (not shared) convolutional layers are applied to retrieve the respective feature maps for objectness classification and regression.

(3) Lastly, such feature maps are fed into the spatially-shared-fully-connected layer to make the region proposals. Note that, there are actually two FC layers at the same level (for classification and regression, respectively).

Figure 2. RPN producing objectness predictions at a given grid point for k anchor boxes.

After taking a look at the paper, one question came up to my mind:

3.1 Does RPN actually avoids enumerating filters of multiple scales or aspect ratios?

Figure 3. Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. © pyramids of reference boxes in the regression functions, from [1]

The answer for this question is as follows, and is well illustrated in Fig 3.

(a) is very time-consuming as it requires an image pyramid.

(b) applies multiple filters with different sizes to detect objects in various scales. Suppose we have N filters with different scales (e.g. 3⨯3, 5⨯5, etc.), then this costs T × N, where T is the computation time convolving feature map by a single-scale filter.

(c) For regression (and it is the same to apply for classification), As RPN uses the fully-connected layer, making BBOX predictions in many shapes can be done in parallel. Let’s say we want to have k proposals for each grid point of the feature map. The flattened k⨯256 feature matrix is given from the intermediate layer of RPN. To produce RPN predictions, we only need to multiply it by 256⨯4 weight matrix of FC layer for predicting k predictions in different scales or aspect ratios.

This is obvious because of the matrix multiplication: [k, 256] × [256, 4] = [k, 4]

4. Training RPN

As RPN only predicts the presence of an object (regardless of what class it belongs to), a binary class label is assigned to each anchor whose

  • IoU (Intersection over Union) with a ground-truth box is the highest than that of other anchors, or
  • IoU is higher than 0.7 with any ground-truth box.

Anchors meeting either of the above conditions are considered positive while the others are negative.

where i indicates an index of anchor, and p is the predicted probability of anchor i being an object. t notes the encoded 4 coordinates representing the location of predicted BBOX, relative to the anchor i. Same notations with * imply the corresponding ground-truth.

Remark that log loss and smooth L1 are chosen for classification and regression loss, respectively.

5. ROI warping

Faster R-CNN replaces the ROI pooling by ROI warping. Mainly, this is because of the following two reasons:

  • due to the quantization, it loses a lot of spatial information especially when it comes to the bounding box regression.
Figure 4. Illustration of why quantization and RoI pooling does not work for RPN backpropagation, from [6]
  • Further, quantization followed by ROI pooling is not differentiable, and therefore, the error w.r.t predicted BBOX coordinates cannot be propagated to RPN network. To address this, Faster R-CNN applied ROI Warping, which uses the bilinear interpolation to ensure the differentiability of the pooling operation.

Please see the above article and the paper for further detail on how ROI warping works.

6. Training Faster R-CNN

Figure 5. 4 training steps of Faster R-CNN, from [3]

Summary

Figure 6. Difference between R-CNN series, from [4]

Reference

[1] Faster R-CNN

[2] Ross Girshick

[3] https://www.youtube.com/watch?v=nDPWywWRIRo

[4] https://www.youtube.com/watch?v=Jo32zrxr6l8

[5] https://towardsdatascience.com/understanding-region-of-interest-part-2-roi-align-and-roi-warp-f795196fc193

[6] Aridian Uman

Any corrections, suggestions, and comments are welcome

--

--