Editor’s note: This is the third in a series of three posts outlining the findings of research our in-house computer vision team conducted regarding the accuracy of popular open-source object detection models for detecting vehicles, as measured by pixel level accuracy. Before diving in, be sure to check out part one and part two to understand the scope of our experiment.
Localization accuracy and object size
To determine if the localization accuracy is correlated with the size of an object in the image, our team computed the box size (√(w*h)) of the true positives (TP) over Intersection over Union (IoU) and pixel deviation.
Following is the curve for Mask R-CNN which shows that larger objects tend to have higher IoU values than smaller objects. At a box size of 130 pixels, the average IoU is 0.9, at 75 pixels it is 0.8, and at 50 pixels it is 0.7.
We see the opposite correlation when using pixel deviation to measure localization accuracy. Smaller objects tend to have a smaller pixel deviation than larger objects. At a box size of 80 pixels, the average pixel deviation was 3 pixels, at 110 pixels it was 5 pixels, and at 150 pixels it was 13 pixels. Intuitively this makes sense, since Faster R-CNN and Mask R-CNN rescale the features inside every box generated by the region proposal network to a fixed size prior to classification.
Performance on gray images and noisy images
In our final set of experiments, we evaluated the models on degraded image data. We generated three new test datasets by converting images to grayscale, adding 5% of salt and pepper noise, and adding Gaussian noise with a variance of 0.05. Sample images from all four test datasets follow.
The precision-recall (PR) curves for the top two performing models follow. At 90% precision, Faster R-CNN NAS achieves 75% recall on the original image set, 72% recall on the gray images, 65% recall on the images with salt and pepper noise, and 55% recall on the images with added Gaussian noise (var = 0.05).
Mask R-CNN does better, achieving 75% recall in original color images, 72% recall on the gray images, 67% recall on the images with salt and pepper noise, and 62% recall on the images with Gaussian noise.
Following are some examples of Mask R-CNN on images with added Gaussian noise.
Summary
Across our experiments, we uncovered several interesting findings.
The winners: The top performing models were Mask R-CNN ResNet v2 and Faster R-CNN NAS. For Mask R-CNN, 50% of the TPs had an IoU > 0.9, 80% had an IoU > 0.8, and 90% had an IoU > 0.7. When using pixel deviation to measure localization accuracy, Faster R-CNN NAS performed best with 25% of TPs having a deviation of < 3 pixels, 50% of TPs having a deviation of < 5 pixels, and 90% of TPs having a deviation of < 13 pixels.
Object size and localization accuracy: We found that the IoU of the detections increases with the size of the vehicles in the image, while the pixel error favors smaller objects, i.e., the pixel deviation decreases with object size.
Image quality: Surprisingly, removing color from images only slightly decreased the models’ accuracy. Adding heavy Gaussian image noise lead to a significant drop in recall.
Humans are still superior: Human annotations were more accurate—as much as three times as demonstrated by the pixel error: 24% of TPs were within 1 pixel deviation, 73% of TPs were within < 3 pixels deviation, 86% of TPs were within < 5 pixels deviation, and 94% of TPs were within < 10 pixels of deviation.
We hope you enjoyed following along with our experiments. To learn more about how Mighty AI can help your team curate high-quality training datasets for your computer vision models, get in touch today.
Leave a Reply