Whenever any collection of objects enter your field of vision, your brain instinctively begins the processes of recognition and localization. One of the core challenges of computer vision is to replicate this intelligence in a computer. This is impossible without a proper understanding of human vision and visual perception. The study of biological vision has revealed that the human eyes and brain are connected to an intricate level of functioning. While the eyes receive the visual data, the brain performs recognition, motion interpretation and filling in any gaps left by the information. The human brain is still capable of recognizing 3D vision even with vision being blocked in one eye, by using parallax effect and relative size comparisons. To add to the complexity, our mind processes all visual perceptions in real-time. Our mind can recognize a ball in various orientations, lighting conditions, and with numerous types of obstructions from other objects in different environments. However, a computer is not ideally designed for infinite-scenarios in contrast to human perception.
To replicate this behavior in machines, we require increasingly accurate machine learning models that are capable of recognizing and locating multiple objects in an image. Historically, the cost of computational resources and the sheer amount of training data required did not add up to feasible models especially on low-power, resource-constrained devices.
However, recent breakthroughs in deep learning and advances in hardware have significantly improved the performance of object detection models. It has also augmented their ability to draw meaning from images without relying on massive datasets for computer vision problems. Consequently, it has become easier and more necessary to develop object detection applications having numerous practical applications across diverse industries in the real world. Pre-trained models are becoming available that have been trained to see and understand objects in a variety of fields and environments. This has drastically given developers the ability to get the models they need and simply start working with them. The developers can utilize a technique called transfer learning to fine tune the pre-trained machine learning models for their specific application needs.
What is Object Detection?
Object detection is a subset of computer vision which deals with image recognition, its classification, and localization to provide a better understanding of the complete picture. The main purpose of object detection is to determine the instances of a predefined set of classes to which the objects belong. It also describes their location in the image or video by enclosing them in a rectangular bounding box. Hence, our two main concerns with object detection are:
- Identify and filter out the object of interest with respect to the background
- Draw bounding boxes and label all objects to know where they are located in an image/scene
Object Detection in the Real World
The applications of object detection are numerous, ranging from autonomous vehicles (such as smart drones and self-driven cars) to video surveillance, image retrieval systems and facial recognition. For example, an application may use object detection to identify particular objects that don’t belong in a certain area. In those cases, identifying vehicles and detecting foreign entities in restricted-access areas, sensors used for surveillance generate a large image data set. By scaling this information down to its geospatial contents and combining the results with other data, we can fine-tune our perception of the scenario. The process is heavily dependent on object detection especially to identify moving bodies, such vehicles or people, and stationary items such as suspicious items, like unattended baggage, from the raw images.
For self-driven cars, determining when to apply brakes or accelerate is critical. This application of object detection relies on the car being able to locate and recognize the objects nearby. To achieve those results, we need to train the model and we need a labeled data set that classifies the objects such as other vehicles, pedestrians, traffic lights, and road signs.
Face detection and facial recognition are just some other applications of object detection. While face detection is an example of object-class detection – that locates and finds the size of all the objects that appear to be faces (relative to specified classes) in the raw image data – facial recognition goes beyond that as it attempts to recognize whose face it is. Facial recognition has many popular applications including biometric surveillance and security – such as the features used to unlock smartphones or grant access to protected areas.
The diversity of these computer vision problems essentially means that we require unique methods and models for yielding accurate results. For selecting an appropriate model, you need to first thoroughly understand how object detection fits into the bigger picture and how to overcome the common hurdles along the way.
Classical approaches of object detection typically rely on template matching techniques that work by finding small fragments of an image which match the target image. Template matching algorithms can further branch off into two groups: feature-extraction object detection and level histogram methods. In cases where there are variations in sub-image to target image, statistical classifiers such as SVM, Bayes and Neural Networks are used for improving recognition. Other popular classical approaches to Object Detection include:
- Multi-resolution models
- Viola-Jones Object Detection
- Scale-space search with classifier
- SVM (Support Vector Machine) Classification with HOG (Histogram of Oriented Gradient features)
These classical models are still in use owing to their low computational cost. However, they have drawbacks (especially for human detection) such as false negatives or duplicate detections, flickers in detection and inconsistent boundaries.
Deep Learning Object Detection
Most of the hurdles faced by early approaches have been successfully overcome by deep learning object detection models largely based on CNN (Convolutional Neural Networks). A CNN does not require manual-feature extraction as is capable of learning directly from the image data to perform classification tasks. These are also ideal for pattern recognition tasks. The type of model depends on your choice of single-stage network or a two-stage network.
Single-stage networks make predictions on the grid for regions spanning the target image using anchor boxes. Single-stage networks like YOLO are faster in comparison to two-stage networks, but sometimes compromise accuracy in the process.
Workflow of YOLO:
- Divide input image into an equal-sized, n-times grid.
- Each cell of the grid predicts preset b-number of bounding boxes with predefined rectangular shape, called anchor boxes (obtained by means of K-means algorithm). Each cell prediction is linked to a class probability and confidence score (as objects of different shape and size are assigned unique anchor boxes).
- Go for bounding boxes with high confidence and probability scores, indicating higher accuracy of object being enclosed. Decoding the predications will produce the final bounding boxes for every object.
The initial stage of a two-stage network finds sub-images that might contain the objects through regional proposals. The second stage increases the accuracy of these proposals and classifies the objects enclosed in the regional proposals via a final output prediction. These state-of-the-art two-stage networks produce increasingly accurate results at the cost of speed of the object detection tasks. Examples of two-stage networks include R-CNN and those based on it.
Workflow of Fast R-CNN:
- Take the input image and pass it to the CNN to get feature maps.
- Derived feature maps are then passed on to the Region Proposal Network to produce numerous windows that can contain the object proposal and hence helps output the object-ness score.
- The RoI (Region of Interest) layer is applied to reshape the proposed regions to a fixed size.
- The feature maps inside each resized proposal window are passed on to a fully connected layer. It is layered with a softmax and regression layer to determine the class probability of objects and provide more accurate bounding boxes.
Object Detection Vs. Image Recognition Vs. Image Classification & Localization
Although the terms, especially object detection and recognition, are sometimes used interchangeably, there is a clear difference among them as defined below:
- Image Recognition: Receive an image as input and output class label for the image from a predefined set of classes.
- Image Classification with Localization: Receive an input image and output a class label plus a bounding box to locate an object in an image.
- Object Detection: Image localization is implemented on all the objects in the image to get all of their class labels and multiple bounding boxes enclosing all of the objects in the image (i.e. object detection comprises multiple class labels and multiple bounding boxes for objects of an image).
General Workflow of Object Detection Frameworks
At the base level, there are three steps in most object detection frameworks:
- For Object Localization: An appropriate model is used to produce region proposals for an image. These region proposals include a substantial set of bounding boxes generated across the complete image.
- For Object Classification: Use input image to extract features from the bounding boxes so they can be used to determine which class of object is present in a bounding box.
- For Non-Maximum Suppression: The final step ahead of processing combines the overlapping bounding boxes into a single bounding box.
So, How Do I Get Started?
There are a couple of ways you can get around creating your own object detection application. The first approach is starting from scratch, by designing a network architecture that will extract the features of the objects in question. This will require a large, labeled training data set for the CNN to learn from. While the building an object detection application may produce accurate results, the pitfall is that it requires a large training data set and loads of time. You’ll also need to manually appoint the weights and set up the layers in the CNN.
The second approach is to utilize a pre-trained model. Many object detection models built on deep learning leverage the transfer of learning enabled by the model. Hence, you can start with a pre-trained network and modify it according to your application. This method produces accurate results in less time as the pre-trained model has already processed and learned from thousands if not millions of images.
Why Choose Our Platform?
We set out to bring everyday developers the resources needed to quickly build computer vision into their apps at scale. Our deep learning computer vision API platform integrates all the primary CV services such as object detection, classification, and tracking. It also standardizes both Caffe and TensorFlow models, the most widely used frameworks today. The deep learning frameworks are specifically compiled from source code for ARM Cortex group of RISC processors, providing you peak performance.
Our pre-trained model libraries are easy to plug into your application. You’re also able to provide your own models to create a customized SDK for offline deployment on your ARM board, which eliminates online dependency. We develop our services for everyone, ranging from large scale enterprises to everyday developers. Our purpose is to allow computer vision applications to be easily developed and deployed, thus saving you time, minimizing resources and simplifying complexities through the entire process.