A Developer’s Intro to Object Detection
by Nichole Peterson | Aug 27, 2019 | Computer Vision | 8 minute read
by Nichole Peterson | Aug 27, 2019 | Computer Vision | 8 minute read
Whenever any collection of objects enter your field of vision, your brain instinctively begins the processes of recognition and localization. One of the core challenges of computer vision (CV) is to replicate this intelligence in a computer.
This is impossible without a proper understanding of human vision and visual perception. The study of biological vision has revealed that the human eyes and brain are connected to an intricate level of functioning. While the eyes receive the visual data, the brain performs recognition, motion interpretation and filling in any gaps left by the information.
The human brain is capable of recognizing 3D vision even with vision being blocked in one eye, by using parallax effect and relative size comparisons. To add to the complexity, our mind processes all visual perceptions in real-time. Our mind recognizes a ball in various orientations, lighting conditions, and with numerous types of obstructions from other objects in different environments. However, a computer is not ideal for creating infinite-scenarios in contrast to human perception.
To replicate this behavior in machines, we require increasingly accurate machine learning models that are capable of recognizing and locating multiple objects in an image. Historically, the cost of computational resources and the sheer amount of training data required did not add up to feasible models especially on low-power, resource-constrained devices.
However, recent breakthroughs in deep learning and advances in hardware have significantly improved the performance of AI object detection models. It has also augmented their ability to draw meaning from images without relying on massive datasets for computer vision problems.
Consequently, it has become easier and more necessary to develop AI object detection applications having numerous practical applications across diverse industries. Pre-trained models are becoming available. They have been trained to see and understand objects in a variety of fields and environments. This has drastically given developers the ability to get the models they need and simply start working with them. The developers can utilize a technique called transfer learning to fine tune the pre-trained machine learning models for their specific application needs.
Object detection is a subset of computer vision which deals with image recognition, its classification, and localization to provide a better understanding of the complete picture. The main purpose of AI object detection is to determine the instances of a predefined set of classes to which the objects belong. It also describes their location in the image or video by enclosing them in a rectangular bounding box. Hence, our two main concerns with object detection are:
The applications of object detection are numerous, ranging from autonomous vehicles (such as smart drones and self-driven cars) to video surveillance, image retrieval systems and facial recognition. For example, an application may use AI object detection to identify particular objects that don’t belong in a certain area. In those cases, identifying vehicles and detecting foreign entities in restricted-access areas, sensors used for surveillance generate a large image data set. By scaling this information down to its geospatial contents and combining the results with other data, we can fine-tune our perception of the scenario. The process is heavily dependent on object detection especially to identify moving bodies, such vehicles or people, and stationary items such as suspicious items, like unattended baggage, from the raw images.
For self-driven cars, determining when to apply brakes or accelerate is critical. This application of object detection relies on the car being able to locate and recognize the objects nearby. To achieve those results, we need to train the model and we need a labeled data set that classifies the objects such as other vehicles, pedestrians, traffic lights, and road signs.
Face detection and facial recognition are just some other applications of AI object detection. While face detection is an example of object-class detection – that locates and finds the size of all the objects that appear to be faces (relative to specified classes) in the raw image data – facial recognition goes beyond that as it attempts to recognize whose face it is. Facial recognition has many popular applications including biometric surveillance and security – such as the features used to unlock smartphones or grant access to protected areas.
The diversity of these computer vision problems essentially means that we require unique methods and models for yielding accurate results. For selecting an appropriate model, you need to first thoroughly understand how object detection fits into the bigger picture and how to overcome the common hurdles along the way.
Classical approaches of object detection typically rely on template matching techniques that work by finding small fragments of an image which match the target image. Template matching algorithms can further branch off into two groups: feature-extraction object detection and level histogram methods. In cases where there are variations in sub-image to target image, statistical classifiers such as SVM, Bayes and Neural Networks are used for improving recognition. Other popular classical approaches to Object Detection include:
These classical models are still in use owing to their low computational cost. However, they have drawbacks (especially for human detection) such as false negatives or duplicate detections, flickers in detection and inconsistent boundaries.
Most of the hurdles faced by early approaches have been successfully overcome by deep learning object detection models largely based on CNN (Convolutional Neural Networks). A CNN does not require manual-feature extraction as is capable of learning directly from the image data to perform classification tasks. These are also ideal for pattern recognition tasks. The type of model depends on your choice of single-stage network or a two-stage network.
Single-stage networks make predictions on the grid for regions spanning the target image using anchor boxes. Single-stage networks like YOLO are faster in comparison to two-stage networks, but sometimes compromise accuracy in the process.
Workflow of YOLO:
The initial stage of a two-stage network finds sub-images that might contain the objects through regional proposals. The second stage increases the accuracy of these proposals and classifies the objects enclosed in the regional proposals via a final output prediction. These state-of-the-art two-stage networks produce increasingly accurate results at the cost of speed of the AI object detection tasks. Examples of two-stage networks include R-CNN and those based on it.
Workflow of Fast R-CNN:
Although the terms, especially AI object detection and recognition, are sometimes used interchangeably, there is a clear difference among them as defined below:
At the base level, there are three steps in most object detection frameworks:
There are a couple of ways you can get around creating your own object detection application. The first approach is starting from scratch, by designing a network architecture that will extract the features of the objects in question. This will require a large, labeled training data set for the CNN to learn from. While the building an object detection application may produce accurate results, the pitfall is that it requires a large training data set and loads of time. You’ll also need to manually appoint the weights and set up the layers in the CNN.
The second approach is to utilize a pre-trained model. Many object detection models built on deep learning leverage the transfer of learning enabled by the model. Hence, you can start with a pre-trained network and modify it according to your application. This method produces accurate results in less time as the pre-trained model has already processed and learned from thousands if not millions of images.
We set out to bring everyday developers the resources needed to quickly build computer vision into their apps at scale. Our deep learning computer vision API platform integrates all the primary computer vision services such as object detection, classification, and tracking. It also standardizes both Caffe and TensorFlow models, the most widely used frameworks today. The deep learning frameworks are specifically compiled from source code for ARM Cortex group of RISC processors, providing you peak performance.
Our pre-trained model libraries are easy to plug into your application. You’re also able to provide your own models to create a customized SDK for offline deployment on your ARM board, which eliminates online dependency. We develop our services for everyone, ranging from large scale enterprises to everyday developers. Our purpose is to allow computer vision applications to be easily developed and deployed, thus saving you time, minimizing resources and simplifying complexities through the entire process.