Manual annotations of single frames in video sequences can be a daunting and time consuming process due to the sheer amount of image data. For example, video recordings at 1080HD with 30 fps will generate 180 MB/s of uncompressed image data.
Our in-house computer vision team has been conducting research on technologies to speed up the annotation process in video sequences and to generate more accurate and consistent annotations. In this post, we focus on the labor intensive task of semantic image segmentation, in which every pixel gets assigned to a class.
Below is an example of a semantic segmentation in which the classes are color-coded.
Original image
Semantic segmentation with color-coded classes
Optical Flow
Optical flow refers to the 2D vector field that describes the motion between two successive images in a sequence. State-of-the-art systems for estimating optical flow like FlowNet2 are fully convolutional deep neural networks. They superseded the traditional correlation and gradient-based methods. In order to successfully apply optical flow methods, the frame-to-frame image motion needs to be fairly small, within the range of tens of pixels. Image motion is affected by several factors: the motion of the camera sensor, the motion of the objects in the scene, the scene geometry, the camera’s focal length, and the frame rate. In typical AD/ADAS scenarios that include highway driving, the video frame rate should be at least 25 Hz to be able to compute optical flow. In an urban scenario where traffic participants move at lower speeds, a frame rate of 15 Hz should be sufficient.
Relevant in the context of image annotation is that optical flow can be used to propagate pixel-level annotations from one frame to the next. This process is also referred to as warping. The propagated annotations can either be used to automatically annotate an image or to check annotations in consecutive images for inconsistencies.
Below is an example of an optical flow field computed by FlowNet2.
Original image from a dense video sequence
Optical flow field generated by FlowNet2
Hue encodes the direction of a vector and the saturation encodes its magnitude, as shown in the color square below. Note that optical flow is able to capture the shapes of objects in great detail. However, unreliable optical flow estimates will occur in image parts with large motion and in image parts that become occluded or not occluded.
Color coding of optical flow direction and magnitude
Semantic segmentation
Semantic segmentation algorithms take a color image as input and generate a class label map as output. State-of-the-art methods for semantic segmentation such as DeepLab are fully convolutional networks that have been trained on large sets of annotated images. Below is a semantic segmentation generated by DeepLab. While the majority of the pixels have been correctly labeled, some of the details on the vehicles, traffic signs, and road markings were lost (see regions highlighted by circles).
Original image
Semantic segmentation by DeepLab; highlighted are areas with lack of details
Fusion of Optical Flow and Semantic Segmentation
Single image semantic segmentation and optical flow computation complement each other when it comes to segmentation in videos. While optical flow is class agnostic and extracts motion features, semantic segmentation is class specific and trained on static features.
In practice, we can apply fused segmentation to two use cases:
- To generate automated annotations in video sequences
- To check the quality of annotations in consecutive frames for inconsistencies
Following are examples for each use case.
Generating automated annotations in video sequences
The first step in the fusion process is to warp a given segmentation from one image to the next, using optical flow. As an example, Mighty AI’s community of annotators generated the segmentation in the first frame. The image below is the result of warping the segmentation onto the second frame.
Result of warping the human segmentation of the first frame onto the second frame with optical flow
The details in the human annotation of vehicles and traffic signs from the first frame are well preserved in the warped result. However, small artefacts remain around the vehicles due to occlusion (see regions highlighted by circles).
In a second step, we fused the warped segmentation and the estimated occlusion map with DeepLab’s segmentation of the second frame. In regions that have been labeled as occluded, DeepLab’s segmentation will be chosen over the warped segmentation. We also implemented a heuristic that maintains the details in the annotations of foreground objects.
Final segmentation result for the second frame generated automatically by fusing the warped segmentation with DeepLab’s segmentation
Checking the quality of annotations in video sequences
In this case, we used the optical flow and semantic segmentation to check the consistency of human annotations in a video sequence.
Original image
Below is the fused segmentation using the human annotations from the previous frame as input for our experiment. White regions indicate areas that are intentionally unlabeled, such as the ego-vehicle.
Segmentation prediction for second frame generated by warping human annotations with FlowNet and fusing them with DeepLab
Finally, this is the human annotation for the second frame. If we compare it to the fused segmentation, we find several inconsistencies that indicate possible inaccuracies in the human annotations. In this example, we could easily identify problems with annotations in the second frame, highlighted with boxes. An annotator failed to label parts of the road that are close to the curb; another annotator labeled the pole of the stop sign as vegetation; and someone neglected to label the stop sign in the background.
Human annotation of the second frame with boxes around possible annotation errors, i.e., areas where machine and human annotation differ
To sum up, we have found that manual annotation of image sequences can be greatly improved in terms of speed and quality by fusing the results of state-of-the-art deep learning models for optical flow computation and semantic segmentation. We applied the fusion to two use cases: to generate automated annotations in consecutive frames and to find inconsistencies between human annotations and model predictions.
To learn more about how Mighty AI can help you annotate and validate data for your computer vision models, contact us today.
Leave a Reply