Region Proposal Network (RPN) architecture explained


If you’re reading this post then I assume that you must have heard about RCNN family for object detection & if so then you must have come across RPN that is Region Proposal Network.  If you don’t know about RCNN family then I highly recommend you to click here to read this post before we deep dive in RPN.

So as we know that in the object detection algorithms , the aim is to generate the candidate boxes , the boxes that could have our objects and these boxes would be localized by Bounding box regression method & classified by the classifier to their respective classes.

In earlier versions of object detection algorithms , these candidate boxes used to be generated by traditional computer vision techniques.  And one such method was “Selective Search”  but downside of this method was that it was it was offline & it was computationally expensive.

That’s where RPN ( Region proposal network) approach  came to rescue by generating the candidate boxes in very small time & cherry on top, this network could be plugged in to any Object detection network which makes it even more useful to any object detection Model .

RPN ( Region Proposal Network) :-

The way CNN learns classification from feature maps, RPN also learns to generate these candidate boxes from feature maps. A typical Region proposal network can be demonstrated using below figure-


                                           Fig: RPN in training

Let’s understand above block diagram step by step-

Step 1 .

So in the very first step , our input image goes through the Convolutional Neural Network and its last layer gives the features maps as output .

Step 2.

In this step , a sliding window is run through the feature maps obtained from the last step . The size of sliding window is  n*n  (here  3×3 ). For each sliding window, a particular set of anchors are generated  but with 3 different aspect ratios (1:1, 1:2, 2:1 ) and 3 different scales (128, 256 and 512) as shown below .

So with 3 different aspect ratios and 3 different scales, total of 9 proposals are possible for each pixel. And total number of Anchor Boxes with feature map of size W*H & K number of anchors for each position of feature map , can be given as W*H*K .

The following graph shows 9 anchors at the position (450, 350) of an image with size (600, 900).

In above figure , three colors represent three scales or sizes: 128×128, 256×256, 512×512.

Let’s single out the brown boxes/anchors (inner most boxes in above figure). The three boxes have height width ratios 1:1, 1:2 and 2:1 respectively.

Now we’ve 9 Anchor boxes for each position of the feature map. But there might be many boxes which are not having any object in it. So model needs to learn which anchor box could possibly have the our object in it. The anchor box with our object in it could be classified as foreground and rest would be background. And at the same time Model needs to learn the offsets for the foreground boxes to adjust for fitting the objects.  And this brings us to the next step .

Step 3. 

Localizing and classifying the anchor box is done by Bounding box Regressor layer and Bounding box Classifier layer .

Bounding Box Classifier calculate the IoU score of Ground Truth Box with anchor boxes and classifies the Anchor box in either Foreground or Background with certain probability aka objectness score.

Bounding box Regressor layer learns the offsets (or difference) for x,y,w,h values with respect to Ground truth Box for the Anchor Box that has been classified as Foreground, where (x,y) is the center of the box, w and h are width and height.

Since RPN is a model and every model has a cost function to train , so does RPN .  Loss or cost function for RPN can be written as –

Note:- RPN doesn’t care what final class (eg. Cat, dog ,car or person etc) of object is.  
It only cares whether it's an foreground object or background.


Let’s revise the whole concept of RPN using an example –

So if we’ve an image of size 600×800 & after passing through Convolution Neural Network (CNN) block, this input image shrinks down to a 38×56 feature map with 9 anchor boxes for each position of feature map. Then we’ll have 38*56*9=1192 proposals or Anchor Boxes to consider . And every anchor box has two possible labels (Foreground or Background). If we make the depth of the feature map as 18 (9 anchors x 2 labels), we will make every anchor have a vector with two values (normal called logit) representing foreground and background. If we feed the logit into a softmax/logistic regression activation function, it will predict the labels.

Let’s say the 600×800 image shrinks 16 times to a 39×51 feature map after applying CNNs. Every position in the feature map has 9 anchors, and every anchor has two possible labels (background, foreground). If we make the depth of the feature map as 18 (9 anchors x 2 labels), we will make every anchor have a vector with two values (normally called logit) representing foreground and background. If we feed the logit into a softmax/logistic regression activation function, it will predict the labels. Now the training data is complete with features and labels. And model will train on it further.

Summary / Final Note:-

The output of a region proposal network (RPN) is a bunch of boxes/proposals that will be passed to a classifier and regressor to eventually check the occurrence of objects. In nutshell , RPN predicts the possibility of an anchor being background or foreground, and refine the anchor.

References :-

If you want to read original research paper then here is the link to the paper:

Below are the couple of links to research papers related to RPN

  1. Fast R-CNN:
  2. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks:
  3. py-faster-rcnn:
  4. A guide to receptive field arithmetic for Convolutional Neural Networks:
  5. Region of interest pooling explained:

Article Credit:-

Name:  Praveen Kumar Anwla

Leave a Comment