[Review] 6. YOLO ver 2

jun94

Published in

jun-devpBlog

6 min readApr 16, 2021

1. Contribution

proposed a method to jointly train YOLO2 simultaneously on COCO detection dataset and ImageNet classification dataset, which enables the model to predict object classes that don’t have labeled detection data.
instead of simply scaling up or ensembling multiple models, YOLO2 focused on simplifying the network while keeping the accuracy.
applied various ideas from previous works to improve YOLO
introduced anchor-box-based prediction mechanism into YOLO2, inspired by Faster R-CNN.
used K-means algorithm to find optimal anchor shapes given the cluster number K, which corresponds to the number of anchor shapes.
introduced DarkNet-19.

2. Limitations of YOLO

produces a notable number of localization errors, compared to Fast R-CNN
has a relatively low recall in comparison to two-stage (region-proposal-based) methods.
has a discrepancy of image resolutions when training its classifier (224⨯224) and training detection module (448⨯448).

3. Applied methods for YOLO2 to improve from YOLO

In order to improve YOLO1, authors leveraged novel ideas from previous researches.

3.1 Batch Normalization

By adding a batch normalization layer before each convolutional layer, YOLO2’s mAP got higher by 2%, according to the authors. Further, it was significantly helpful to accelerate the convergence and eliminate all the other forms of regularization, including L2 and Dropout.

3.2 Training Classifier with High-resolution Input

The original classifier network of YOLO is trained on ImageNet with image size 256⨯256, while image resolution of 448⨯448 is used as input to YOLO for object detection task. Therefore, this leads to the fact that YOLO has to learn to detect objects while adjusting its parameters for new image resolution.

In order to fix the discrepancy of image sizes, YOLO2 classifier is trained on an image scale of 448⨯448 from the first place, yielding an increase of mAP by 4%.

3.3 Convolutional with Anchor Boxes

YOLO1 directly predicts the coordinates of bounding boxes using FC layers preceded by convolutional layers. However, as is observed in Fast R-CNN and Faster R-CNN, predicting offsets and confidences for anchor boxes, instead of directly predicting the coordinates, simplifies the task and stabilizes the training process.

Getting inspired by this, YOLO2 also introduced an anchor-box-based approach.

Firstly, they removed the two FC layers on top of the convolutional layers from YOLO1.
Secondly, in order to maintain the high resolution of feature maps at the last feature layer, they eliminated one pooling layer at the end.
Lastly, they shrank the network to operate on 416⨯416, which leads to the odd feature map size of 13⨯13.

YOLO2 without anchor boxes reported mAP of 65.9% with a recall of 81%. That of anchor boxes showed 69.2% mAP and 88% recall. Although mAP gets slightly decreased by anchor box, recall increases remarkably by 7%.

3.3.1 why odd number?

As YOLO2 removed the FC layers from YOLO1, the receptive field of grid pixels at the last feature layer cannot cover the entire image. By making the feature map’s size odd at the last feature layer, the pixel in the center of the feature map becomes responsible and able to detect a large object, which is often located nearby the image center.

Further, as we reduce the input image size from 448⨯448 to 416⨯416 in the process of obtaining the odd feature map size, it decreases the number of computations and results in faster inference speed.

3.4 Anchor box selection with K-means

When using anchor boxes for object detection, one issue we often encounter is how to find a suitable number of anchor shapes (size, aspect ratio, etc). In many cases, the anchor shapes are manually designed, and frequently such design choice ends up with 9 shapes (3 sizes and 3 aspect ratios, resulting in 3×3 = 9).

Instead of choosing priors by hand, YOLO2 runs k-means clustering on the training set bounding boxes to automatically search for appropriate anchor shapes with varying k.

In order to run the k-means algorithm, a distance metric is required. Thus, authors defined their own metric for the purpose of finding appropriate anchor shapes maximizing the average IoU between GT box and its nearest anchor shape, as defined below.

Following Fig 1 shows the average IOU of GT bounding boxes to their best matching anchor, with respect to k. Considering the tradeoff for recall and complexity of the model, the number of clusters k = 5 is chosen.

**Figure 1. Clustering box dimensions on COCO and VOC, from [1]**

Surprisingly, anchor shapes found by k-means when k = 5 showed a slightly better average IoU than that with 9 anchor shapes used in Faster R-CNN. Moreover, when k = 9, it obtained a completely outperforming IoU by 7%, as is shown in the following Table.

**Table 1: Average IOU of boxes to closest priors on VOC 2007, from [1]**

3.5 Location Prediction

Instead of predicting offsets of GT box relative to an anchor box at a grid cell, YOLO2 directly predicts the location of GT boxes inheriting the approach of YOLO1.

Mathematically, this can be formulated as below.

Note that YOLO2 predicts 5 coordinates for each bounding box, where (tˣ, tʸ, tʷ, tʰ) indicates the predicted bounding box location, and tº is the objectness parameter. In order to prevent the instability of early training phase, which mostly occurs due to the unbounded predicted bounding box coordinates, the sigmoid activation 𝜎 is applied to ensure that the final network outputs 𝜎(tˣ), 𝜎(tʸ) fall between 0 and 1.

cˣ and cʸ are the coordinates of grid cell where the network predictions are made, pʷ, pʰ are anchor width and height, respectively.

Lastly, (bˣ, bʸ, bʷ, bʰ) represents the target GT box coordinates.

Figure 2. Bounding boxes with dimension priors and location prediction, from [1]

3.5.1 personal opinion on YOLO2’s location prediction

For predicting (bˣ, bʸ) an ideal prediction is made when 𝜎(tˣ) and 𝜎(tʸ) are 0.5 (as bˣ = 0.5 + cˣ, same for bʸ). In other words, tˣ and tʸ are to be zero, ideally. This can be relatively easily achieved by setting zero-mean weights and zero bias for the last convolutional layer before sigmoid 𝜎, ensuring the stability of training as 𝜎(tˣ) and 𝜎(tʸ) fall between 0 and 1.

Similarly, the perfect prediction for (bʷ, bʰ) is made when tˣ and tˣ are zero (due to the exponent). However, there is no sigmoid activation for (tʷ, tʰ), and therefore the exponential term is unbounded. This might lead to instability in training. Therefore applying sigmoid (with Conv layer of zero-mean weights and negative bias) or tanh activation (with Conv layer of zero-mean weights and zero bias) might be helpful.

3.6 Finer-grained Features

As mentioned before, YOLO2 makes predictions at 13⨯13 feature maps, which is sufficient for detecting large objects but often not enough for localizing small objects.

While other networks, like SSD, predicts at various feature layers (e.g. at 52⨯52, 26⨯26, 13⨯13) which enable them to detect small objects better, YOLO2 applied another approach: adding a passthrough layer to integrate feature information from previous feature layer to that of next layer.

In other words, feature maps of 26⨯26 resolution is concatenated to the next 13⨯13 feature layer.

Of course, due to the size difference they cannot directly stacked. Therefore, the authors stack adjacent features of 26⨯26 feature maps into different channels then concatenate them into feature maps in the next layer. Eventually, YOLO2 architecture with a passthrough becomes like the figure below.