How to Process VOC2012 Dataset and Implement IoU, mAP for Object Detection

8 min readOct 16, 2021

We will implement a series of two-stage object detection algorithms from the initial works of R-CNN to state-of-the-art techniques such as DetectoRS from scratch using PyTorch. This article is the first post on this series, and we will discuss the implementations of various components in the object detection pipeline. In particular, we will discuss

How to load and read the VOC2012 object detection dataset annotations
How to visualize bounding boxes in Python+OpenCV.
Implement IoU and mAP to evaluate the performance of object detection methods.

The complete code for the complete series is provided in this COLAB notebook. The code presented in this post can omit some detail, refer to the notebook for full working implementation.

VOC2012 Dataset

The Pascal VOC 2012 dataset is a dataset containing 17,125 pairs of images and bounding boxes of 20 classes. The data is also annotated and used for semantic segmentation but we will simply consider the usage on object detection. The training set and validation set of the data is publically available in the link below and the test set is reserved privately for validation on its server.

http://host.robots.ox.ac.uk/pascal/VOC

While most object detection algorithms train their neural network on larger datasets such as the MS COCO and much larger datasets for self-supervised pretraining(a.k.a. semi-supervised learning), the VOC 2012 dataset is still commonly used as a benchmark for evaluating performance and the modest size of the dataset is suited for this series considering the computational limitations of COLAB.

Since anyone can download the dataset from the internet, we use urllib to automatically download the .tar file containing images and annotations of the VOC 2012 dataset in line 5. We then extract the .tar file in line 11.

Previous VOC datasets(e.g. VOC 2007) have separate tar files for training and validation but the VOC 2012 dataset combines both sets in one train-val file and the sets are classified based on additional metadata provided in the dataset. However, we instead sample 5000 pairs randomly as the validation set. This is because we are not reproducing the results of previous work on the validation set, there is no need to strictly isolate the original data and I prefer to have more training data instead. We find the validation data in line 19 and move these images into a separate folder.

XML Annotation file

Annotations of each image in the VOC 2012 dataset are presented in .xml files. The structure of the file is illustrated above. The file contains some metadata about the image and a list of object headers that contain bounding box dimensions and class labels of each object in the image.

We can easily read and process .xml files in Python using an element tree defined in the xml library.

We implement a function read_xml to read the .xml file and return a list of bounding boxes given the path of the file. In line 9, we find all the object headers and loop through them while saving the details of each object in object_list. In line 18, we convert the string class name to an integer key before appending it to object_list using a dictionary self.convert_label.

We define a simple torch Dataset for loading the raw image and annotations VOC 2012 dataset. We identify all the images in __init__ , read images using cv2.imread , and read annotations using the read_xml function we discussed above.

Since various object detection algorithms each have unique methods to process the dataset for training, we can’t apply this naive Dataset object for training. We implement data processing mechanisms of each method in future posts. However, we present this as an example of how to define a Dataset and DataLoader object in Pytorch to load the VOC 2012 dataset.

In case you aren’t used to Dataset and DataLoaders in PyTorch, refer to this tutorial.

Visualizing bounding box

We will shortly discuss how to visualize the object bounding boxes in the image using OpenCV. We first read an image from the training set using the following code.

Now let’s visualize the loaded bounding boxes and label information using OpenCV. From line 8 to line 15, we configure the color and thickness of the bounding box and text we would like to draw. While looping through every object described in the .xml annotation file, we use cv2.rectangle and cv2.putText to draw the bounding box and denote the class information in the image.

IoU(Intersection over Union)

IoU is a metric that measures the distance between two regions such as the distance between the predicted bounding box with the ground truth. IoU is computed as the intersection area of the regions over the union area of the regions. In the case of object detection, each region is defined as the area inside each bounding box. We present a commonly used figure that explains IoU.

When the two boxes are adjacent, the intersection area will increase to a value close to the size of a single box, and the union area will decrease to a value close to the size of a single box. Thus, the IoU converges to 1 as the two boxes become closer and a smaller IoU means the boxes are far away. The IoU is 0 when the area of intersection is 0 and the two boxes do not overlap at all.

Given the locations of the two bounding boxes, we find the coordinates of the rectangle of intersection(red circles in the figure) from lines 6 to 9. We can easily calculate the area of intersection and calculate the union area with basic set theory from lines 14 to 17.

Find more information about IoU

Average Precision(AP)

Average precision is a metric that measures the area under a precision-recall curve. But it is a commonly used metric for evaluating the performance of object detection algorithms. Let’s dive into how we implement AP for object detection.

Precision & Recall

A short recap on precision and recall:

Precision: What proportion of positive predictions are actually correct?
Recall: What proportion of actual positives are detected correctly?

Precision and recall can be traded off according to the confidence threshold of the classifier. We can visualize this trade-off as a precision-recall curve.

Average Precision for Object detection

Average precision (AP) is defined as the area under a smoothed precision-recall curve(AUC). Precisely, AP is computed as the weighted sum of precisions at each threshold where the weight is the increase in recall. The process can be expressed as the formula below by Jonathan Hui. We can interpret AP as a single value that measures the goodness of the precision-recall curve.

Practically, we count the number of ‘correct predictions’ and ‘incorrect predictions’ one-by-one in descending order of confidence. While progressing, we calculate the

recall(proportion of positive predictions that are actually correct) as “# correct predictions / (# correct predictions + # incorrect predictions)”
precision(proportion of actual positives detected correctly) as “# correct predictions / #objects of that class”

of that state. These intermediate results are points in the precision-recall curve, which we define the area under this curve as the mAP.

Because there are many classes for object detection, we consider each class as a different problem and evaluate the per-class AP with the precision and recall of every object of that class(except the background). mAP(Mean Average Precision) is computed as the average of every per-class AP.

When deciding whether an object has been detected, we compare whether the IoU between the predicted and ground-truth bounding box is over a certain threshold. The IoU threshold commonly used when computing the AP is different for each dataset. A fixed threshold of 0.5 is mainly used for evaluating the VOC 2012 dataset. When evaluating on the COCO dataset, we commonly use the boxAP(AP@[.50 : .05 : .95]) metric which computes the average AP over 10 IoU thresholds from 0.5 to 0.95, spaced by 0.05.

A detailed explanation of how AP is measured for each dataset is provided in Jonathan Hui’s blog. I suggest checking it out if you are not familiar with AP.

Implementing mAP

We implement a function to compute the mAP given the list of bounding box estimates, ground truth bounding boxes, and the IoU threshold. To compute the mAP, we need the precision and recall of a network given the probability threshold.

First, we list all the detected objects of a certain class in descending order of confidence. We count the number of objects in that class in Line 29 for computing the precision. We initialize empty arrays needed for counting.

We then traverse the detected objects and find the most closest ground-truth object with the detected object from line 8 to 15. We consider the object as a detected object if the distance between the detected bounding box is over the IoU threshold(line 18) and the corresponding ground-truth object has not yet been detected by another bounding box estimation(line 20).

Precision and recall are calculated exactly as discussed in the sections above. Finally, we implement the per-class average precision by using the torch.trapz method to integrate the precision-recall curve.

We considered the implementation by Aladdin Persson.

In this post, we described how to handle object detection data by loading annotations of multiple bounding boxes from an .xml file and visualizing it. We also implemented core metrics used in many object detection algorithms such as IoU and mAP.