The real-time detection of humans is emerging as a significant trend with data scientists and across widespread industries from smart cities to retail to surveillance. It no longer seems like science fiction to consider:
In fact, it may be easier than you think.
Successfully detecting a person in an image or video means you are building an application that will marry object detection and image classification. The tech that lets you detect objects in image data is a little different than the popular visual classification tools currently being used across many industries.
For one thing, there is now a framework in place to detect specific objects in video with varying levels of accuracy. Pairing the identified location of an object in an image with the understanding of the object’s class means your application can differentiate between a human in one region of the image versus an object that might be mistaken as a person, such as a mannequin in a retail environment.
Exploring object detection means understanding:
1. What you could accomplish by detecting people in images and video.
2. How detecting a person is different from other tech, such as facial recognition.
3. The relationship between general object detection, such as vehicle detection, and detecting people.
4. Probability issues that define these tools.
5. Current and prospective real-world industry 4.0 applications of this tech.
What can you accomplish by applying object detection tools to your images and video? Using computer vision for human detection accomplishes three distinct tasks:
On a higher level, there are two elements to consider when approaching human detection in an image using computer vision applications. First, there's the technical side — how you can detect people in an image or a video. The second part is what you can do with the results, and has to do with the quality of the returns you get from your application.
Generally speaking, solving the problem of how to detect objects in photo or video data starts with a systematic division of the image. First, the tool would apply algorithms to input with the intent of identifying regions of interest. The machine would then come up with a range of object proposals based on your settings. The final steps of detection are to classify objects based on models, apply probability thresholds and return the classes and the locations in the frame of your final accepted proposals.
The class you would be looking for, in this case, is humans. The applications detect these human objects in the visual field via processing blocks pre-trained by crunching a huge number of images with a deep learning artificial intelligence system. These processing blocks are known as models, and they can be trained to recognize almost anything humans can see.
When implementing human detection in your computer vision application, you can use a pre-trained model or supply the model yourself. The more data you give your model, the better your device will be at recognizing the objects you want and learning how to improve for the future.
After you have an output, it would be up to you how to use it. Your use case would determine various specifics, such as the detection thresholds. The best object detection tool for one situation may not work for others.
Machines have become effective in the past few years at performing these tasks due to advances in model training and multilayer deep processing. For example, Deepface and DeepID, two of the early vanguards of the feature extraction and comparison technique have become so efficient that many consumer-facing examples now exist: People now use the technology to unlock their tablets and phones, for example.
Cloud processing represented an important breakthrough in CV, putting powerful resources in the hands of developers everywhere. However, new CV platforms provide easy access to open API platforms, giving developers even more flexibility. Now, it's possible to build deep learning applications on devices at the edge. You don't need to be a computer vision expert and you don't need to rely on cloud connectivity to utilize core computer vision services, such as object detection, to process and analyze your images. Enterprises of all sizes can now build and deploy advanced computer vision applications on resource-constrained, low-power devices.
There is a difference between detecting people and other objects. For example, object detection for manufacturing is much different than people detection would be in most cases.
Manufacturing object detection applications might involve pipeline tracking or analysis of robotic behavior, or using computer vision to analyze microscopic defects. The human detection goals in the same industry might be more geared towards other aspects of operations. For example, AI analysis may be able to leverage existing security camera feeds to improve worker safety in plants, complementing existing practices and safeguards. The ability of the AI CV tools to detect types of objects — not simply their presence or movement — allows this level of versatility.
One of the central concepts in object detection is classification. The model takes an image input and, based on how it has been trained, returns a proposed object class. For example, you might build a simple human detection application to look through a series of images and look for people. It would compare the image to the model's understanding of what a human is (and is not), and then return a probability value for each input you give it. This is typically a percentage value from zero to one hundred that indicates how confident the program is of a human presence in the photo.
The probability that a detected object is a member of the proposed class is a central concern of almost every object detection application. Every region that an object detection algorithm defines usually has a probability score assigned to it. The average of these scores, secured by various methods, would return the total probability that a proposed object is a person. These figures can be more or less accurate based on a number of factors:
1. The amount of data in and level of training of the model.
2. The type of analysis performed on the image.
3. The quality of the data input.
4. The size of the object relative to the total image.
For example, a high-accuracy, low-speed analysis might start with a massive set of boxes and apply rules that sort through them, looking for possible hits. Lower-accuracy rapid analysis might take a grid and analyze each section for the presence of each class in the model. Regardless of the processes you intend to employ, your goal for your application will determine the probability you require at the output.
The bottom line is understanding the hardware or type of embedded device as well as your use case goals are important to set up your development cycle for success. Knowing your options is the first step toward finding the ideal implementation.
Facial recognition and facial detection have some similarities, especially on the technical side, but they're different in terms of utility. Make sure you know the difference when looking at object detection platforms or models:
Facial recognition: Identifies people based on their characteristics.
Facial detection: Finds individual faces, without individual identification.
Both facial recognition and facial detection use object detection frameworks to classify and locate objects in a visual field. Facial detection would take image input of some kind, check for the person or face class of objects and locate them in the frame. Additionally, facial recognition would pick out eyes, mouths, and various other features to compare to a known dataset.
The generalized technical aspect of the process is more or less where the similarities end. The goals are typically quite different. Whereas facial recognition finds faces and analyzes their features, most facial detection tools need to confirm the presence of people and identify their locations.
This gives you options. For example, at a store, a solution with facial detection could count the number of customers and the time spent in various parts of the store, while facial recognition could be used to identify and exclude the store employees from the customer data set.
As with every emerging tech, there are plenty of technical terms that might cause confusion or be thought of as synonyms when it comes to computer vision. There's classification, detection, tracking, counting, and more. However, one of the biggest confusion points involves object detection and image classification. At the most basic level, the difference between classification and detection is simple:
Image Classification: Applies a prediction to an image based on an analysis of the contents.
Objection Detection: Locates objects within an image.
Classification is a central challenge in computer vision and is typically a prerequisite for object detection. It's where big data is often leveraged — it takes a lot of information to train a general model that's capable of recognizing a wide range of objects. Custom models also typically require many positive and negative examples. There are other factors that come into play too, such as object size in images, picture quality, and so on.
Putting objects in specific categories is, of course, a useful element of detection. In fact, once you get past the threshold of recognizing that objects exist, classifying them becomes the next major step in many applications.
The process is also useful in and of itself. It could help automate the annotation of database items, for example. However, because strict classification would only provide a probability table or set of tables describing the likelihood that a representative of a class of objects was present, it alone would not facilitate detection, tracking, localizing, or other, similar tasks.
The bottom line is that classifying objects is a necessary prerequisite to detecting people in your images and videos. After all, to CV tools, humans are a class of visual objects. Exceptions might be quick-scan tools, such as YOLO, that make predictions about the presence of objects regardless of class as their first step. In a single-class model in this type of system, classification and detection would essentially happen at the same time. In any case, even though the process of detecting objects almost always relies on classification, not all classification tasks are necessarily object detection tasks.
Tracking objects is a useful advanced application of object detection. After a person is detected, you may want to follow the subject through a shopping pipeline in a retail setting or track and collate behavior over multiple different detection input streams, for example.
Imagine that you were a retailer with a sales goal for a specific product. You could analyze the effectiveness of your associates in communicating your promotion by tracking post-interaction customer behavior. If a customer is greeted upon arrival with a message about the promotion, does that customer proceed to the merchandise immediately or later in their shopping experience? Object tracking across an array of store cameras can help you understand the effectiveness of the communication tactics.
Another way you could refine your application to further handle detected objects would be by counting them. This becomes a complex issue if you have certain types of variables. For example, large crowds entering and exiting festival grounds via multiple access points could pose a challenge when deciding how to commit security resources. By using stationary CV cameras or drones with embedded counting functions, security companies could get accurate, updated population count and density data to inform personnel decisions.
Object counting typically requires that you classify the objects. However, you would not necessarily need to return their position, except insofar as to cross-check that none of the position values were identical. Therefore, object counting is more of a subset of detection.
Of course, real-world use cases of bringing artificial intelligence and intelligent sight to embedded devices demand more than just one single core computer vision service. You’ll likely want to use the various functions in such a way that they work together and make your application or device smarter. However, there is any number of things to consider when combining the various computer vision functions into your application on a device at the edge.
It all sounds good in theory, and it can work in practice, too. Here's a hypothetical example that uses all three of these computer vision techniques.
The problem: You want to detect humans as they enter a store and determine the likelihood of them interacting with a specific fixture as you moved it from place to place.
Object Detection: You would write and train a human interaction model to detect your fixture and humans, and to differentiate customers from associates.
Object Tracking: You would implement the tracking function to follow unique customers around the store once they had been detected and record details of interactions they had with the display.
Object Counting: Object detection enables the counting of individual objects in a designated class, such as counting people or counting specific products.
With a combination of computer vision primitives, the retailer could analyze the effectiveness of store displays by counting the number of customers who pause in front of the display, the amount of time spent looking at the product or communications, and ultimately, whether that product is placed in the shopping cart and purchased.
IoT embedded devices are already used as either primary or supplementary eyes for everything from driverless cars and traffic monitoring to smart cities and transportation centers. Object detection is useful for applications in various industries: Retail, Security, Agriculture, Healthcare, and Manufacturing.
Now that object detection is catching up with humans in terms of accuracy and speed, now is the time for deep learning computer vision AI to be put in the hands of everyday developers. The tools are there, the processing is affordable and the value is just waiting to be added to existing architectures.
You might be surprised at the extent to which some of your current cameras and devices can incorporate artificial intelligence, even if they're in resource-constrained, low-power environments without net connectivity or access to cloud computing.