Editor’s note: This is the first in a series of three posts outlining the findings of research our in-house computer vision team conducted regarding the accuracy of popular open-source object detection models for detecting vehicles, as measured by pixel level accuracy.
Why we conducted this research
Autonomous driving technology requires accurate detection of traffic participants and objects in video images. In their most basic implementation, these detection systems will take an image of a traffic scene as their input, and provide the location and class label of traffic participants in the form of bounding boxes and estimated class probabilities.
In the past years, computer vision research in object detection has been dominated by deep learning which led to the release of numerous detection frameworks and network architectures to the public. We decided to take a closer look at the accuracy and robustness of some of the more popular deep learning models when applied to a vehicle detection task.
What we set out to learn
In conducting this research, we wanted to learn about:
- The localization accuracy of some of the most popular off-the-shelf object detection systems.
- How image degradation, such as converting to grayscale and adding different types of image noise, affects the performance of the models.
In this post, we explain the scope of our research, for which we generated a highly accurate ground truth dataset for vehicle detection which allowed us to compare the localization accuracy of five state-of-the-art deep learning models across a wide range of IoU values. In part two of this series, we explore our findings of that research. In part three, we explore our findings from when we evaluated the models’ localization accuracy based on the pixel deviation between the detections and the ground truth boxes, and what we discovered when we tested the models’ robustness against image noise and conversion to grayscale.
Getting started
The typical way to benchmark detection systems is on public datasets developed with automotive applications in mind. One of the earliest and most widely used datasets is Kitti, which contains 7,500 color images with a total of 80,000 annotated objects. Some more recent datasets include CityScapes, Berkeley DeepDrive, and ApolloScape.
Kitti follows the Pascal VOC benchmarking protocol in which a true positive (TP) detection has to have the correct class label and an Intersection over Union (IoU) with a corresponding ground truth box that exceeds a fixed threshold. Class-specific precision-recall (PR) curves and average precision (AP) values are then computed based on the detector’s real-valued class probabilities. Since the PR curves are computed for relatively low IoU thresholds (0.5 and 0.7), they contain little information about the detector’s ability to localize objects with high accuracy. And, as we all know, high accuracy is often critical in automotive applications.
For our test set, we selected 219 color images of size 1920×1280 from a set of dash cam recordings of urban and highway scenarios in diverse weather and lighting conditions. The images contained a total of 847 vehicles, which we categorized into 730 passenger cars, 107 trucks, and 10 buses. Sample images from the test set follow.
An in-house team at Mighty AI generated the ground truth annotations. At least two experienced team members reviewed every annotation to ensure that each bounding box was within a single pixel of the object’s true boundary. We did not include images in the test set that contained vehicles with less than 20 pixels in width and height nor images containing vehicles with ambiguous boundaries due to motion blur and defocus.
Choosing the object detection models to evaluate
Mighty AI evaluated the following five off-the-shelf models for object detection:
- Faster R-CNN NASNet, (TensorFlow Model Zoo: faster_rcnn_nas_coco, 2018_01_28)
- Faster R-CNN ResNeXt 101, FPN (FAIR Detectron: X-101-64x4d-FPN, 2x)
- Faster R-CNN ResNet 101, FPN (FAIR Detectron: R-101-FPN, 1x)
- Mask R-CNN Inception ResNet V2 (TensorFlow Model Zoo: mask_rcnn_inception_resnet_v2_atrous_coco, 2018_01_28)
- SSD, Mobilenet V1 (TensorFlow Model Zoo: ssd_mobilenet_v1_coco, 2017_11_17)
The first three systems are Faster R-CNN detectors with different backbones, specifically the NASNet architecture, ReNeXt, and ResNet in a feature pyramid network. The fourth system is Mask R-CNN with a ResNet V2 backbone, of which we only used the regressed bounding box coordinates. The fifth model, SSD Mobilenet V1, is designed to run in mobile applications.
The detectors were trained on the MS COCO dataset on 80 classes. The trained models can be found in TensorFlow’s Detection Model Zoo and in FAIR’s Detectron Model Zoo.
Table: Input resolution and inference times of the evaluated detection systems
Next up: Check out part two of this series to see what we discovered when we compared the localization accuracy of these five, state-of-the-art deep learning models across a wide range of IoU values.
Leave a Reply