How to Detect Pedestrians and Bicyclists in a Cityscape Video
by Eric VanBuhler | Feb 03, 2020 | Object Detection | 11 minute read
by Eric VanBuhler | Feb 03, 2020 | Object Detection | 11 minute read
Detecting pedestrians and bicyclists in a cityscape scene is a crucial part of autonomous driving applications. Autonomous vehicles need to determine how far away pedestrians and bicyclists are, as well as what their intentions are. A simple way to detect people and bicycles is to use Object Detection. However, in this case we need much more detailed information about the exact locations of the pedestrians and bicyclists than Object Detection can provide, so we’ll use a technique called Semantic Segmentation, in which detections are done pixel-by-pixel, rather than with bounding boxes.
* note - alwaysAI provides a set of open source pre-trained models in the Model Catalog. The following example uses one of the starter models with a simple algorithm in order to achieve its goal.
In this tutorial, we’ll use the enet computer vision model to segment pedestrians and bicyclists in each frame of a video, and then use the results to perform actions based on the locations of the pedestrians and bicyclists. To keep this application simple, we’ll use the detections to edit the output video, removing the pedestrians and bicyclists from the video. The alwaysAI semantic_segmentation_cityscape starter app runs the enet model on a series of cityscape images, so it will be a great starting point for us.
*note - The source for this guide can be found at: https://github.com/alwaysai/pedestrian-segmentation
First, pick out your video clip and place it in the app directory. I have a short video clip of pedestrians and bicyclists on a crosswalk. In the app.py file, swap out the series of images for a video clip by removing the image load steps, adding the video stream to our with
statement, and changing the Streamer parameters back to the defaults.
def main(): |
Inside our with
statement, we need to change the for
loop iterating over images to a while
loop reading frames of the video. Since we’re working with a video now, frame
makes more sense than image
, so change all instances of image
to frame
.
def main(): |
In order to use the “stop” button on the Streamer, we need to swap streamer.wait()
for a check of streamer.check_exit()
.
def main(): |
Next, we can try out the app to see how well it classifies pedestrians and bicyclists. After running aai app install
and aai app start
, here’s what I get on the Streamer:
We can see that the enet model is doing alright at detecting some of the people, but it also appears to be incorrectly detecting some people and bicycles as motorcycles, so we’ll need to take additional steps to correct this issue later on.
Looking at the label list for the model, the labels we are interested in are “Person,” “Rider,” and “Bicycle.” Let’s make a list of the labels we’d like to mask at the beginning of our app.
def main(): |
Now we need to select only the labels from that list for our filtering mask. The class_map
provided in the results has the label index for each pixel. We can create a matrix of labels by mapping the labels to the class map. Next, we’ll make a base class map of all zeros (the index for “unlabeled”) and add the classes from the list on top of that. Then we’ll replace the results.class_map
with the new filtered_class_map
in semantic_segmentation.build_image_mask
.
def main(): |
Now when we run our app, we can see that only people, riders, and bicycles are masked.
In order to see the mask colors a little more clearly, we can separate the image and mask so that we can see them side-by-side (rather than seeing the mask superimposed on the image). To do that, we’ll simply concatenate the frame and the mask, and send the combined images as one image to the Streamer. We’ll also remove the edgeiq.blend_images
call, as it's no longer needed.
def main(): |
To get a better idea of what is actually being detected in the video, we can mask out the background, which includes everything that is not a “Person,” “Rider,” or “Bicycle.” First, make a boolean matrix of the labeled part of the class map, and then use that to copy the labeled pixels from the frame to the masked frame.
def main(): |
In order to filter out our objects of interest, we’ll store the last known pixel value from when the object wasn’t detected, and apply that value to the same location when an object is detected. We’ll create an array called non_detection
to store the last known value of each pixel when it was not detected as a pedestrian or bicyclist. For each frame, we’ll update non_detection
with the latest non-detection pixels. Next, we’ll generate the output frame starting with the latest frame, replacing the pixels where pedestrians and bicyclists are detected.
def main(): |
We noticed earlier that some of the people and bicyclists were incorrectly classified as “Motorcycle.” Let’s add “Motorcycle” to our list of labels to mask and see if the results are more accurate.
def main(): |
We can see that now the people and bicyclists are filtered out much better!
Since we’re doing batch processing of a video file, there isn't really a need to display everything on the Streamer. We can process each frame and create a new video file for the output. To save the video clip, we’ll use the VideoWriter
class. You can learn more about saving video clips in our documentation. Let’s also create a flag so that we can easily enable and disable Streamer processing.
def main(): |
Watching our video, we see that there are times when a person or bicyclist gets captured by non_detection
, and those pixels are used rather than the background. The issue here could be that, frame to frame, some pixels are alternating between being detected and being undetected. We could add a filter to the mask to smooth out these occasional misdetections between otherwise correct detections.
Additionally, we could cut down on the code size and post-processing by combining our filtered class map and detection handling.
We’ve walked through a simple example of segmenting an image, masking only specific classes, and taking an action based on those masks. This process can be expanded to many use cases, including autonomous driving, defect detection, medical analysis, and many others.
The alwaysAI platform makes it easy to build, test, and deploy computer vision applications such as this pedestrian and bicyclist detector. We can’t wait to see what you build with alwaysAI!