So, we assume you have been through our article RCNN and we presume that you know about RCNN, if not you can check on this link first ———————————————
The two major drawbacks of RCNN architecture are seen as follows-
- For every single region of interest among 2000 ROI from a single input image has to go through CNN layer individually. And this is a computationally expensive thing.
- It has to train three different models separately –
- The CNN to generate image features,
- The classifier that predicts the class
- The regression model to tighten the bounding boxes.
( This makes the pipeline extremely hard to train.)
In 2015, Ross Girshick, the author of R-CNN, solved both these problems, leading to the second algorithm – Fast R-CNN. Let’s now go over its main insights.
- The architecture of the model takes the photograph as a set of region proposals as input that are passed through a deep convolutional neural network.
- A pre-trained CNN, such as a VGG-16 or Alexnet, is used for feature extraction.
- The end of the deep CNN is a custom layer called a Region of Interest Pooling Layer, or RoI Pooling, that extracts features specific for a given input candidate region.
- The output of the CNN is then interpreted by a fully connected layer then the model bifurcates into two outputs, one for the class prediction via a softmax layer, and another with a linear output for the bounding box. This process is then repeated multiple times for each region of interest in a given image.
Insight was simple — Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?
This is exactly what Fast R-CNN does using a technique known as RoIPool (Region of Interest Pooling).
As you can observe, RoIPool gets the forward pass of a CNN for a whole image across its subregions. In the image above, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!
RoI pooling layers on the extracted regions of interest also make sure all the regions are of the same size.
( This is because we are going to feed these features to a fully connected layer whose neurons are already fixed.)
And this way we get rid of 1st issue !
The second advancement of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Where earlier we had different models to extract image features (CNN), classify (SVM), and tighten bounding boxes (regressor), Fast R-CNN instead used a single network to compute all three.
Fast R-CNN replaced the SVM classifier with a softmax layer on top of the CNN to output a classification. It also added a linear regression layer parallel to the softmax layer to output bounding box coordinates. In this way, all the outputs needed came from one single network!
Now we also got rid of the 2nd issue !!!
And this is how Fast RCNN got an edge over RCNN.
R-CNN & Fast R-CNN use selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network.
Even with all advancements from RCNN to fast RCNN, there was one remaining bottleneck in the Fast R-CNN process — the region proposer.
In RCNN the very first step is detecting the locations of objects by generating a bunch of potential bounding boxes or regions of interest (ROI) to test.
In Fast R-CNN, after the CNN layer ,these proposals were created using Selective Search, a fairly slow process and it is found to be the bottleneck of the overall process.
In the middle 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, found a way to make the region proposal step almost cost free through an architecture they (creatively) named Faster R-CNN.
The insight of Faster R-CNN was that region proposals depended on features of the image that were already calculated with the forward pass of the CNN (first step of classification). So why not reuse those same CNN results for region proposals instead of running a separate selective search algorithm?
The authors write:
“ Our observation is that the convolutional feature maps used by region-based detectors, like Fast R- CNN, can also be used for generating region proposals [thus enabling nearly cost-free region proposals ”
How the Regions are Generated
how Faster R-CNN generates these region proposals from CNN features. Faster R-CNN adds a Fully Convolutional Network on top of the features of the CNN creating what’s known as the Region Proposal Network.
The Region Proposal Network slides a window over the features of the CNN. At each window location, the network outputs a score and a bounding box per anchor (hence 4k box coordinates where k is the number of anchors). Source: https://arxiv.org/abs/1506.01497
The Region Proposal Network works by passing a sliding window over the CNN feature map and at each window, outputting k potential bounding boxes and scores for how good each of those boxes is expected to be.
What do these k boxes represent ?
We know that the bounding boxes for people tend to be rectangular and vertical. We can use this intuition to guide our Region Proposal networks through creating an anchor of such dimensions.
Intuitively, we know that objects in an image should fit certain common aspect ratios and sizes. For instance, we know that we want some rectangular boxes that resemble the shapes of humans. Likewise, we know we won’t see many boxes that are very very thin. In such a way, we create k such common aspect ratios we call anchor boxes. For each such anchor box, we output one bounding box and score per position in the image.
We then pass each such bounding box that is likely to be an object into Fast R-CNN to generate a classification and tightened bounding boxes.
The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are.
Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes?
This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.