RetinaNet model for object detection

RetinaNet Model for object detection explanation


Lately RetinaNet model for object detection has been buzz word in Deep learning community.

And why should it not ? Object detection is a tremendously important field in computer vision . People are using different object detection methods for autonomous driving, video surveillance, medical applications, and to solve many other business problems.

Table of Contents: –

  1. What is RetinaNet
  2. Why RetinaNet was needed
  3. Architecture of RetinaNet
    • Backbone Network
    • Subnetwork for object Classification
    • Subnetwork for object Regression
  4. Focal Loss
  5. Final Notes

In this article, I’ll introduce you to the architecture of RetinaNet model & working of it. Cherry on top? In next article , we’ll build a “Face mask detector” using RetinaNet to help us in this ongoing pandemic.

What is RetinaNet Model: –

Facebook AI research (FAIR ) team has introduced RetinaNet model with aim to tackle dense and small objects detection problem.

For this reason, it has become a popular object detection model that one can use with aerial and satellite imagery also.

Researchers have introduced RetinaNet by making two improvements over existing single stage object detection models –

  • Feature Pyramid Networks (FPN)
  • Focal Loss

Need of RetinaNet Model: –

Both classic one stage detection methods, like boosted detectors, DPM & more recent methods like SSD evaluate almost [latex] { 10 }^{ 4 } [/latex] to [latex] { 10 }^{ 5 } [/latex] candidate locations per image but only a few locations contain objects (i.e. Foreground) and rest are just background objects. This leads to class imbalance problem.

And this turn out to be the central cause of making performance of one stage detectors inferior to two stage detectors.

Hence , researchers have introduced RetinaNet Model with concept called Focal Loss to fill in for the class imbalances and inconsistencies of the single shot object detectors like YOLO and SSD ,while dealing with extreme foreground-background classes.

Architecture of RetinaNet Model: –

In essence, we can break down RetinaNet architecture in to 3 following components:

  1. Backbone Network (i.e. Bottom up pathway + Top down pathway with lateral connections eg. ResNet + FPN)
  2. Sub-network for object Classification
  3. Sub-network for object Regression
RetinaNet Model Architecture
Figure 1 :- RetinaNet Model Architecture Source

For better understanding, Let’s understand each component of architecture separately –

The backbone Network: –

Bottom up pathway – Bottom up pathway (eg. ResNet) is used for feature extraction. So, It calculates the feature maps at different scales, irrespective of the input image size.

Top down pathway with lateral connections– The top down pathway up samples the spatially coarser feature maps from higher pyramid levels, and the lateral connections merge the top-down layers and the bottom-up layers with the same spatial size.

Higher level feature maps tend to have small resolution though semantically stronger and is therefore more suitable for detecting larger objects; on the contrary, grid cells from lower level feature maps have high resolution and hence are better at detecting smaller objects

So, with combination of the top-down pathway and its lateral connections with bottom up pathway, which do not require much extra computation, every level of the resulting feature maps can be both semantically and spatially strong

Hence this architecture is scale-invariant and can provide better performance both in terms of speed and accuracy.

Sub-network for object Classification: –

Fully convolutional network (FCN) is attached to each FPN level for object classification.  As it’s shown in diagram above , This subnetwork incorporates 3*3 convolutional layers with 256 filter followed by another 3*3 convolutional layer with K*A filters. Hence output feature map would be of size W*H*KA , where W & H are proportional to the width and height of input feature map and K & A are number of object class and anchor boxes respectively.

At last , researchers have used Sigmoid layer (not softmax) for object classification.

And reason for last convolution layer to have KA filters is because  , if there’re “A ” number of anchor box proposals for each position in feature map obtained from last convolution layer then each anchor box has possibility to be classified in K number of classes . So the output feature map would be of size KA channels or filters.

Sub-network for object Regression: –

The regression subnetwork is attached to each feature map of the FPN in parallel to the classification subnetwork. The design of the regression subnetwork is identical to that of the classification subnet, except that the last convolutional layer is of size 3*3 with 4 filters resulting in output feature map with size of W*H*4A .

Reason for last convolution layer to have 4 filters is because in order to localize the class objects, regression sub-network produces 4 numbers for each anchor box that predict the relative offset (in terms of center coordinates, width and height) between the anchor box and the ground truth box. Therefore, the output feature map of the regression sub-net has 4A filters or channels.

So by now we’ve little clarity on RetinaNet model for object detection architecture. Now let’s understand most discussed topic topic of RetinaNet model for object detection and that is Focal loss.

Focal Loss : –

Focal Loss (FL) is an improved version of Cross-Entropy Loss (CE)  that tries to handle the class imbalance problem by assigning more weights to hard or easily misclassified examples  (i.e. background with noisy texture or partial object or the object of our interest ) and to down-weight easy examples (i.e. Background objects).

So Focal Loss reduces the loss contribution from easy examples and increases the importance of correcting misclassified examples.)

Focal loss is just an extension of cross entropy loss function that would down-weight easy examples and focus training on hard negatives. So to achieve this researchers have proposed [latex] { (1-{ p }_{ t }) }^{ \gamma } [/latex] to the cross entropy loss ,with a tunable focusing parameter [latex] \gamma \ge =0 [/latex]

RetinaNet object detection method uses an α-balanced variant of the focal loss, where α=0.25, γ=2 works the best.

Focal loss vs probability of ground truth class
Figure 1 . Focal loss vs probability of ground truth class Source

So one can define focal loss as –

[latex]FL({ p }{ t })=\quad -{ \alpha }{ t }{ (1-{ p }{ t }) }^{ \gamma }\ln { ({ p }{ t }) }[/latex]

The focal loss is visualized for several values of [latex] \gamma \epsilon \left[ 0,5 \right] [/latex] ,refer Figure 1.

Focal Loss characteristics:-

We shall note following properties of the focal loss-

  1. When an example is misclassified and [latex] { p }_{ t } [/latex] is small, the modulating factor is near 1 and the loss is unaffected. 
  2. As [latex] { p }_{ t }\rightarrow 1 [/latex] ,the factor goes to 0 and the loss for well classified examples is down weighed.
  3. The focusing parameter [latex] \gamma [/latex] smoothly adjusts the rate at which easy examples are down-weighted.
  4. As [latex] \gamma [/latex] is increased , the effect of modulating factor is likewise increased. ( After a lot of experiments and trails , researchers have found [latex] \gamma =2 [/latex] to work best.

Note:- When [latex] \gamma =0 [/latex] , FL is equivalent to CE. (Shown blue curve in Figure 1)

You can read about Focal loss in detail in this article , Where I’ve talked about evolution of cross entropy into Focal loss, need of focal loss, comparison of focal loss with Cross entropy.

And cherry on top, I’ve used couple of examples to explain why Focal loss is better than cross entropy.

End Points: –

Retina Net is a powerful model that uses Feature Pyramid Network & ResNet as its backbone.

In general RetinaNet is a good choice to start an object detection project, in particular if you need to quickly get good results. In next article we’ll build a solution using RetinaNet model.

If you’ve enjoyed this article, leave a few claps, it will encourage me to explore further machine learning opportunities 🙂

References: –



Leave a Reply