Why did we need RCNN ??
Before using RCNN the approach to do object detection was through the sliding window concept, which of course was very computationally expensive and time consuming to slide through the whole image and that too with windows of different sizes in expectation of obtaining the target object inside it.
So key problem was- Working on a large number of regions
Ross Girshick et al. from UC Berkeley in 2014 proposed a method where we use selective search to extract just 2000 regions from the image and he called them region proposals.
As the name suggests RCNN is a region based object detection algorithm. To bypass the problem of selecting a huge number of regions.
The RCNN algorithm instead of taking all the regions of the image, propose a bunch of boxes in the image and see if any of them actually correspond to an object.
Basically the RCNN can be summed as three tasks-
- Region Proposal
- Feature Extraction of region given
Region of Proposal (ROI) task-
In order to extract these boxes from the image, RCNN uses a selective search algorithm. Inputting an image to this algorithm it tries to give 2k different regions of image that too of different sizes and aspect ratios.
These boxes are called Region of Interest or ROI. It draws multiple bounding boxes on the given image.
How does it do that???
- Generates initial sub-segmentations so that we have multiple regions from this image based on color, texture, size and shape.
- Combines the similar regions to form a larger region (based on color similarity, texture similarity, size similarity, and shape compatibility).
- The authors use the selective search algorithm to generate 2000 category independent region proposals (usually indicated by rectangular regions or ‘bounding boxes’) for each individual image.
- Finally, these regions then produce the final object locations (Region of Interest)
Now to extract out the important underlying meaningful feature CNN is used. Now since the region of interest we obtained will be of varying heights and widths it is not possible to feed CNN directly. So now each region of 2k regions proposed for an image is converted to a fixed size of 227×227 with simple warping.
The authors map each object proposal(ROI) to the ground-truth instance with which it has maximum IoU overlap and label it as a positive (for the matched ground-truth class) if the IoU is at least 0.5. The rest of the boxes are treated as the background class (negative for all classes)
As of now we are having feature maps from CNN with the useful features, if present in ROI, now is the need of classifying this ROI.
On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object? If yes then where is the object?
Now, having found the object in the box, we can tighten the box to fit the true dimensions of the object, R-CNN runs a simple linear regression on the region proposal to generate tighter bounding box coordinates to get our final result.
So in this stage input is the sub regions of the image known as ROI and the output is the new bounding box coordinates for the object in the sub-region.
R-CNN is just the following steps:
- Generate a set of proposals for bounding boxes.
- Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
- Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.
- It requires a forward pass of the CNN (AlexNet) for every single region proposal for every single image, that’s around 2000 forward passes per image! .
- It has to train three different models separately – the CNN to generate image features, the classifier that predicts the class, and the regression model to tighten the bounding boxes. This makes the pipeline extremely hard to train.
In 2015, Ross Girshick, the first author of R-CNN, solved both these problems, leading to the second algorithm known to be Fast RCNN. Please click here to read about Faster RCNN.