SEMI-AUTOMATED LABELING OF VIDEO USING ACTIVE LEARNING FOR OBJECT
                                   DETECTION
                                        By
                           Roberto Muntaner Whitley
                                     A THESIS
                                    Submitted to
                            Michigan State University
                    in partial fulfillment of the requirements
                                 for the degree of
           Electrical and Computer Engineering – Master of Science
                                       2023


                                             ABSTRACT
         Labeling video sequences is a critical task that is required for a wide range of supervised
learning applications. In general, manually labeling videos is an extremely repetitive and time-
consuming task. Often, the process is sped up by sharing the workload across multiple workers,
but this can create other problems, such as varying quality and consistency of labels. Meanwhile,
the area of active learning has been proposed for assisting in the labeling of images for
classification and object detection tasks. However, minimal prior work is centered around the
utility of active learning for video labeling. In this thesis, we attempt to address the gap in prior
efforts by proposing a Semi-Automated Labeling of Video (SALV) framework using active
learning to support supervised object detection applications. Firstly, we propose a general
architecture for the SALV framework that is built on intra-video training and testing. The
proposed SALV architecture exploits the fact that labeling video provides a unique opportunity
where training and testing can be performed on consecutive frames that contain highly correlated
information. Secondly, we incorporate traditional active learning methods that utilize the
confidence values produced by detections to select important frames for the next iteration.
Thirdly, we propose two strategies for active learning of video labeling: minimal-Distance
Iterative Active Learning (min-DIAL) and maximal-Distance Iterative Active Learning (max-
DIAL). Lastly, we explore information theory to select frames with the most diversity using the
Jensen-Shannon divergence to calculate the difference between certain frames based on the
location of detections. We analyze the performance of the proposed SALV architecture in terms
of the time taken to complete the labeling of the video sequences and present our results using
the popular KITTI Tracking dataset. We show that our proposed max-DIAL framework is the
most efficient method and can reduce the time taken to label video by a factor of 10.


                                     ACKNOWLEDGEMENTS
        I would like to thank my advisor, Dr. Radha, for his belief, patience, and guidance.
Without his belief, I would not have had the opportunity to join the Connected and Autonomous
Networked Vehicles for Active Safety (CANVAS) group and conduct research in a field that I
find intriguing. His patience allowed me to explore unfamiliar areas and generate the knowledge
needed to complete my research without any added pressure. His knowledge and guidance were
pivotal in helping me achieve my goals, which I will always remember and be extremely grateful
for. Without him, I would not be where I am today.
        I would also like to thank my committee members, Dr. Morris and Dr. Bopardikar, for
their willingness to assist in the completion of my thesis. I value your time, knowledge, and
opinions, and appreciate your flexibility during these busy times.
                                                  iii


                                                TABLE OF CONTENTS
INTRODUCTION .......................................................................................................................... 1
CHAPTER 1: CURRENT LITERATURE ..................................................................................... 4
  1.1 Linear Interpolation .............................................................................................................. 4
  1.2 Weakly Supervised ............................................................................................................... 4
  1.3 Efficient Labeling ................................................................................................................. 5
  1.4 Object Detection and Tracking ............................................................................................. 5
  1.5 Active Learning .................................................................................................................... 5
CHAPTER 2: SPATIAL-TEMPORAL COHERENCE ................................................................. 7
  2.1 Intra Training and Testing .................................................................................................... 7
  2.2 Dataset................................................................................................................................... 8
  2.3 Evaluation Metric.................................................................................................................. 9
  2.4 Labeling Times ................................................................................................................... 11
CHAPTER 3: ACTIVE LEARNING ........................................................................................... 14
  3.1 Selecting Uncertain Frames ................................................................................................ 15
  3.2 Comparing Uncertainty Selections ..................................................................................... 17
  3.3 Comparing Times................................................................................................................ 19
CHAPTER 4: MIN-DIAL AND MAX-DIAL.............................................................................. 23
  4.1 Min-DIAL ........................................................................................................................... 23
  4.2 Max-DIAL .......................................................................................................................... 23
  4.3 Comparing Max-DIAL with Traditional Active Learning ................................................. 27
CHAPTER 5: JENSEN-SHANNON DIVERGENCE ................................................................. 30
  5.1 Creating Distributions ......................................................................................................... 30
  5.2 Calculating Difference Value ............................................................................................. 31
  5.3 JSD Results and Comparison .............................................................................................. 32
CONCLUSION ............................................................................................................................. 36
BIBLIOGRAPHY ......................................................................................................................... 37
                                                                   iv


                                       INTRODUCTION
        The vast amount of data available has allowed the area of deep learning to advance
dramatically over the past decade. In particular, an increase in accessible labeled data has
enabled a wide range of supervised deep learning models to become state-of-the-art for many
applications and emerging services. In general, increasing the amount of data used during
training improves the performance of the model, considering that the data is relevant and diverse.
However, in a supervised learning environment, ground truth labels are required to train the
models. Despite recent advancements in developing a variety of software annotation and labeling
tools, ground truth labels are generally produced manually by humans. For tasks like object
detection, where each object requires its own bounding box, this process is extremely time-
consuming.
        ImageNet [1] was created at a time when most researchers were heavily focused on
designing new machine learning models when the more pressing issue was the lack of large-scale
datasets to train current models. However, after early calculations highlighted an unrealistic
timeline to create the enormous dataset, crowdsourcing was explored instead. Amazon
Mechanical Turk (AMT) [2], the crowdsourcing marketplace used for creating ImageNet, is a
platform that allows businesses the opportunity to utilize endless remote workers for a wide
range of demanding tasks. In over 2 years, using more than 25,000 AMT workers [3], ImageNet
was created. However, while the time to create the dataset was reduced, the cost of creating the
dataset was increased. ImageNet was only possible with support from several sponsors;
therefore, some researchers and smaller businesses may not have sufficient funding to explore
crowdsourcing options.
                                                  1


        Other factors that need to be accounted for when planning to manually label data are the
size and experience of the group of annotators. The larger the group, the larger the variation in
the quality of the ground truth labels. The smaller the group, the larger the workload for each
individual. This is likely to result in error-prone work due to tedious, repetitive actions. A lack of
understanding in advanced areas, which is likely to be extremely common on crowdsourcing
websites, can further decrease the quality and consistency of the labels.
        In this thesis, we propose a Semi-Automated Labeling of Video (SALV) framework that
minimizes the amount of human interaction required during the labeling process. Combining a
small subset of manually labeled images with current deep learning applications, the amount of
human input can be greatly reduced, while simultaneously improving the accuracy and
consistency of the labels. Leveraging the fact that consecutive frames in a video sequence share
many similarities, we manually label a small subset of the data that are used to train an object
detector. The trained object detector is used to predict objects in the remaining frames of the
video. As many of the objects used to train the model will also appear in the unlabeled frames,
perhaps closer or at a slightly different angle, the detections are extremely accurate. After testing
the object detector on the remaining unlabeled frames, human verification is required to fix any
false positives (FPs) or false negatives (FNs). The highly accurate detections greatly reduce the
time required for human verifiers to add or remove any bounding boxes. The main contributions
of this work include:
    •   A Semi-Automated Labeling of Video (SALV) architecture that exploits an intra-video
        training and testing strategy. The proposed SALV framework is built on the notion that
        video labeling provides a unique opportunity where training and testing can be
        accomplished over the same dataset with highly correlated information.
                                                    2


• Minimal-Distance Iterative Active Learning (min-DIAL) and maximal-Distance Iterative
  Active Learning (max-DIAL) strategies for the SALV framework. They are simple but
  effective approaches for iteratively selecting new, unlabeled video frames based on their
  respective locations in the video sequence.
• A Jensen-Shannon Diverge (JSD) metric used to calculate the distance between frames
  based on their distributions. Each frame is divided into 8 sections and the distribution is
  based on the number of detections that are present in each section.
• We analyze the performance of the proposed SALV architecture based on the time taken
  to label video sequences using traditional active learning methods, our proposed min-
  DIAL and max-DIAL approaches, and the JSD metric. Our analysis is presented using
  the popular KITTI Tracking dataset [17].
                                            3


                        CHAPTER 1: CURRENT LITERATURE
        It is well understood that supervised deep learning models require an enormous amount
of labeled data for training. However, many researchers are still focusing their attention on
building more accurate models using current datasets, rather than producing more efficient ways
to label new data. Nevertheless, different directions have been explored to combat the bottleneck
caused by high annotation costs.
1.1 Linear Interpolation
        Utilizing the spatial-temporal characteristics of a video, [4] estimates object locations
between manually labeled keyframes using linear interpolation and homography-preserving
techniques. Although this technique could be beneficial for basic videos, for applications like
autonomous driving, where vehicles are accelerating and decelerating at random and
unpredictable rates, the majority of the predicted bounding boxes are likely to be incorrect.
1.2 Weakly Supervised
        More recent work has focused on weakly supervised labeling techniques. Instead of
drawing a bounding box around an object, researchers have attempted to create ways to label
objects in a much less time-consuming manner. [5] proposes a center-click technique, requiring
the human annotator to click where they imagine the center of the bounding box around the
object would lie, reducing labeling time by more than 9x. [6] uses class labels to indicate what
object categories belong to the image, without any type of localization of the objects present, due
to the vast number of image-level annotations available on the internet. However, while these
techniques greatly reduce the annotation time, the accuracy of the model is also significantly
reduced. [7] only requires human annotators to verify bounding boxes produced automatically
during an iterative learning process, reducing annotation time by more than a factor of 6x while
                                                  4


performing better than weak supervision. Despite these improvements, fully supervised learning
remains the most accurate method for training object detectors.
1.3 Efficient Labeling
        The traditional way of drawing bounding boxes, where annotators click and drag a box to
enclose an object, is often cognitively demanding and inefficient. Therefore, [8] proposes a more
natural way for human annotators to label objects. Instead of drawing a box around the object of
interest, annotators are asked to click the 4 extreme points of the object: the top, bottom, left, and
rightmost points. This simple but effective difference allows annotators to label boxes 5x faster
than regular bounding box annotations while maintaining the same quality.
1.4 Object Detection and Tracking
        Although objects may vary in different environments, many features can be learned and
transferred to different situations. [9,10] both use a pre-trained object detector and tracker to
label all frames in a video. Afterward, the predictions are passed to a human for verification and
correction. Although both papers reduce the time required to label their respective datasets, this
technique is only possible if similar datasets are available for pretraining.
        The most similar work to ours is [11] which uses a manually labeled subset to train a
model before testing on all remaining frames. However, the dataset used in this paper is a
relatively simple indoor dataset with the majority of the objects being fire extinguishers and
chairs that were recorded on a hand-held device. Unlike our work, they use a single iteration
which restricts the learning of the model as it sees new data.
1.5 Active Learning
        Active learning (AL) has become a widely used technique for selecting specific samples
from a dataset. As different objects provide different amounts of information to the model,
                                                   5


selecting images that contain the most information has proven to be a more efficient method for
data labeling. Although the majority of active learning research is based on image classification,
recent work has extended this method to the more challenging task of object detection
[12,13,14]. The general idea of active learning is to randomly choose a subset of unlabeled data.
A human annotator manually labels these samples and uses them to train an object detector,
which is then tested on the remaining unlabeled samples. The detections produced are used to
formulate an uncertainty measure for each image that can be used to extract the images that the
model struggles with the most. These often correspond to the objects that provide the most
information to the model, considering they are the most challenging. As many of the images are
likely to contain multiple object instances, there are many ways to derive an uncertainty value for
a specific image. [12] experiments with the confidence values produced by the bounding box
detections, a natural extension to the techniques used in image classification. Summing,
averaging, and taking the minimum confidence value are just a few different ways that
confidence scores can be used to measure the uncertainty of an image. However, as the images
contain more objects from a range of different classes, these techniques become less effective.
[13] and [14] propose more advanced approaches that use adversarial instance classifiers and
mixture density networks, respectively, to learn the uncertainty of different images. [15] further
extends active learning for object detection into video, using temporal coherence to detect where
FPs and FNs may have occurred.
                                                  6


                  CHAPTER 2: SPATIAL-TEMPORAL COHERENCE
2.1 Intra Training and Testing
         Modern cameras are capable of capturing an extremely large number of frames every
second, providing a smooth visual experience. However, this produces a high level of similarity
and redundancy when manually labeling object instances in consecutive frames. Even when a
vehicle is driving at a relatively high speed, the surrounding scene barely changes on a frame-by-
frame basis.
         In traditional object detection models, the images used during training are required to be
different from the images used for testing. This allows the performance of models to be
compared fairly when tested on images it has never seen before. However, for applications such
as automated labeling, we can approach the problem from a different angle. If, for example, we
looked at the 1st and 5th frames of a video sequence, it is highly likely that the majority of the
objects in the 1st frame are also present in the 5th frame. Of course, these objects could move
closer, further away, or become partially occluded, but the same objects are often present for
several consecutive frames. Therefore, as an example, if we trained an object detector using the
1st and 5th frames of a video sequence, we would expect it to perform well when tested on the
2nd, 3rd, and 4th frames. A visual example of this methodology can be seen in Figure 1.
         Using this logic, we create initial subsets using a range of different subsampling rates.
These initial subsets are used to train an object detector which is then tested on the remaining
frames. We decided to use YOLOv5 [16] for this task because of its speed, accuracy, and
usability. It is important to note that 100% of the dataset is used each time. If a sampling rate of
10 is used for training, then 1 in every 10 frames (including the first frame) is uniformly added to
the training set, with the other 9 being added to the test set. This means that for different
                                                    7


sampling rates, the train and test sets will be different sizes. Although this is normally bad
practice when comparing models, for this application we are more concerned about minimizing
the size of the train set while maintaining a high enough accuracy on the test set. For fairer
testing, the higher the sampling frequency, the more epochs were allowed during training to
offset the smaller number of training samples.
      Figure 1. Initially, all frames are unlabeled (white). We manually label every 5th frame
     (red) which are used to train our model. The model is then tested, verified, and corrected
                                    on all remaining frames (blue).
2.2 Dataset
        Although the primary target of our proposed framework is an unlabeled dataset, to show
the effectiveness of our method, we must exploit a publicly available dataset that provides
ground truth annotations. Although there are many suitable datasets accepted by the broader
research community, this thesis focuses on the KITTI Tracking [17] dataset due to its small size
and ease of use. It is important to highlight that we used this dataset because it is a collection of
video sequences and not a set of isolated images. Moreover, although the KITTI dataset is a well-
                                                    8


established benchmark, some bounding boxes need to be added or removed to provide accurate
results for our application. These consist of:
    •   Objects from preceding frames that are still labeled although no longer visible.
    •   Objects that have become fully occluded but are still labeled although no longer visible.
    •   Some objects of interest that are not labeled.
        As many of the ’Van’, ’Car’, and ’Truck’ classes are very similar, and labeled
inconsistently, we combined them into a single ’Vehicle’ class. The ’Pedestrian’ class is the only
other class that has a reasonable number of instances for consideration. However, many of the
labels contain only part of a pedestrian or multiple pedestrians within the same bounding box.
The ’Tram’, ’Person Sitting’, ’Misc’, and ’Cyclist’ classes were also removed due to insufficient
object instances; therefore, the ’Vehicle’ class is the only class considered in this thesis.
2.3 Evaluation Metric
        Considering that our goal is to minimize the amount of human interaction required to
label the dataset, we need to calculate an accurate estimation for the time taken to complete each
experiment. The initial labeling consists of drawing bounding boxes from scratch around all the
object instances in the uniformly subsampled frames. After training our model and then testing
on a specific subset, a human is required to verify each frame to remove any FPs and add any
FNs. After testing our model on the frames of interest, we are provided with values for the
precision and recall for that subset. This allows us to calculate the number of false positives and
false negatives using Equation 1 and Equation 2, respectively. Our experiments use 6255 frames
containing 29,080 vehicle instances, which the train and test split will always sum to. We used
an intersection-over-union (IOU) threshold of 0.6 to calculate the precision and recall values
because many of the KITTI labels are inconsistent and insufficiently tight around many of the
                                                   9


object instances, as seen in Figure 2. This leads to many extremely accurate detections, and even
tighter around the object than the ground truth label, being counted as false positives at higher
IOU thresholds.
    Figure 2. The bounding boxes for many object instances in the KITTI dataset should be
                  tighter, which would allow more accurate IOU calculations.
                             False Positives = Instances ∗ (1 − Precision)                        (1)
                              False Negatives = Instances ∗ (1 − Recall)                          (2)
        We use the information provided in [18] as a good estimate for the time taken to
complete the different tasks. Although they provide both median and mean values for each task,
we opted to use the median values. This is because they show that there are a minority of
workers that take an unreasonable amount of time, which may be caused by taking breaks or not
performing the tasks properly. The median time for drawing a single bounding box is 34.5s,
which includes a 9s quality verification check. They also provide a ’Coverage Verification’ time
of 7.8s for measuring how long it took annotators to scan an image for all instances. This time
will be used for the task of scanning each frame to find any FPs and FNs produced by our
detector. We use a modest value of 8s to remove FPs, based on the fact it is a much simpler task
than verifying the quality of the box. Therefore, the total time consists of manually labeling the
                                                 10


initial bounding boxes to train our model, scanning the frames that our model was tested on,
removing any false positives, and finally drawing any false negatives that were missed, shown in
Equation 3. Note that we have not included the time taken to train the model as we are more
interested in minimizing the amount of human interaction.
                     Total Time = Initial Label + Scan + Add FN + Remove FP                       (3)
 2.4 Labeling Times
         We calculate the time taken to label the full dataset by training our model using different
subsampling frequencies. The model is then tested on all remaining frames before verifying and
correcting detections. We sum the time it would take to draw initial boxes, scan test frames, and
add/remove any boxes, as shown in Table 1. When manually labeling the whole dataset
(frequency of 1), we are not verifying or removing any boxes. For any of the splits afterward, the
more frames that we manually label for training, the less we have to verify and correct, and vice
versa. While varying the subsampling rate alters the number of frames that need to be manually
labeled and verified, it also affects the accuracy of the model. Generally, the more frames used
for training, the more accurate the model will be. However, as manual labeling is extremely
expensive and time-consuming, we are looking for an optimal subsampling rate that uses the
fewest number of frames to provide a reasonably high accuracy. The total times from Table 1
can be visualized easier in Figure 3, where the optimal subsampling frequency is 20. Using a
subsampling rate of 20 takes an estimated time of 44.99 hours, which is around 16% of the
estimated time taken to manually label the whole dataset.
                                                  11


      Frequency        Initial          Scan           Add      Remove       Total (hours)
            1          278.68              0             0          0          278.68
            2          139.41            6.78          1.95       0.65         148.79
            5           55.72           10.84          5.58       1.19          73.33
           10           28.09            12.2          8.52       2.56          51.37
           20           14.26           12.87         14.55       3.31          44.99
           30            9.11            13.1         18.59       4.37          45.17
           40            7.08           13.21           22        4.41           46.7
           50            5.76           13.28         22.38         5           46.42
           60            4.61           13.33          30.7       5.02          53.66
           70            4.07           13.36         25.82       5.67          48.92
           80            3.51           13.38          31.1       5.17          53.16
           90            2.88            13.4         36.13       5.82          58.23
          100            2.94           13.42         31.16       5.63          53.15
          160            1.61           13.47         39.62       6.04          60.74
          320            0.72           13.51         40.31       7.41          61.95
          640            0.31           13.53         54.84       6.39          75.07
         1280            0.03           13.54         98.09        6.2         117.86
Table 1. Time taken to label the full dataset. Different subsampling rates vary the number of
              frames that need to be manually labeled for training the model.
                                               12


                     300
                     250
                     200
  Total Time (hrs)
                     150
                     100
                      50
                       0
                           1   2   5   10   20   30   40    50     60   70    80   90   100   160   320   640 1280
                                                           Subsampling Rate
Figure 3. Total time to label, verify, and correct using different subsampling frequencies.
                                                              13


                           CHAPTER 3: ACTIVE LEARNING
        Although recent work using active learning has produced more complex algorithms for
selecting uncertain images, due to the highly accurate detections produced by the initial stage of
our model, we explore low-complexity active learning strategies to further improve our
framework. After manually labeling the subsampled frames (anchor frames), we train our model
and then test on all remaining frames. However, unlike our previous method, we use the
confidence values from the outputs to select the more difficult frames. We know that the
detections from our model are extremely accurate, allowing us to locate the frames that the
model struggles with the most. These frames likely provide our model with the most information
and allow us to perform an iterative process that updates our model as it sees more data. After
selecting the data to use for the next iteration, we verify, correct, and add those samples to our
training set. We train our model with our updated train set before testing on all remaining frames,
with an example for uniformly sampling every 10 frames shown in Figure 4. This iterative active
learning process can be repeated multiple times, but it is important to note that each iteration
requires retraining of the model.
    Figure 4. The anchor frames (red) are manually labeled and used to train our model. All
   frames between the anchor frames are tested and the most informative frames (yellow) are
   selected. The selected frames are verified, corrected, and added to the train set. The model
         is retrained then all remaining frames (blue) are tested, verified, and corrected.
                                                   14


        A high-level architecture for our proposed Semi-Automated Labeling of Video (SALV)
framework can be seen in Figure 5, where the goal is to minimize the amount of human
interaction required to label visual data. Starting with a completely unlabeled dataset, an initial
subset is selected to be manually annotated. These annotations provide the building blocks for
our model to label selected frames from our iterative active learning process.
3.1 Selecting Uncertain Frames
        There are many techniques used to select the most uncertain frames from a dataset.
However, we use a simple, more traditional method that utilizes the confidence values of the
detections provided by our model. Along with bounding box predictions, object detectors
provide confidence scores that express how confident the model is that the bounding box
encloses the correct object. In general, high confidence values are given to detections that are
relatively simple objects that the model has no trouble identifying. Whereas lower confidence
values are often given to objects that are smaller, further away, or partially occluded. These more
challenging objects are what we are interested in as they provide the model with more
information than objects it already detects with ease.
        With the majority of frames containing multiple object instances, there are many ways to
calculate an overall confidence score for each frame. Averaging the individual confidence scores
within a frame allows there to be no bias regarding the number of detected objects in that
particular frame. Calculating the median value for each frame allows outliers to be overlooked
and is more likely to select frames with multiple low-confidence detections. Using the lowest
confidence value from each frame to represent the overall confidence value allows us to select
frames with a challenging object in it but doesn’t consider any of the other objects that are
                                                  15


present in that frame. Examples of detections and their confidence values can be seen in Figure 6
and Figure 7.
    Figure 5. Architecture of our SALV framework. After initially training our model, uncertain
                         data is selected based on several different metrics.
                                                16


     Figure 6. Frame containing 3 detections with 0.99 confidence. The average, median, and
     lowest values are all 0.99, so it’s unlikely this particular frame will be selected in any of
                                    our active learning algorithms.
     Figure 7. Frame containing 5 detections with varying confidence values. The average is
     0.742, the median is 0.97, and the lowest value is 0.34. This frame is likely to be selected
        when using the average or single lowest value as an uncertainty measure but unlike.
3.2 Comparing Uncertainty Selections
         To understand the distribution of which frames have the lowest uncertainty at each
iteration, it is important to generate histograms that help visualize any patterns during the
selection process. A measurement is given to selected frames based on their distance from the
anchor frame to its left. An example of the distance calculation can be seen in Figure 8, where
the yellow frames would be selected based on their uncertainty value. After selecting the most
uncertain frames and calculating their respective distances, we can sum up the number of
occurrences for each distance. The examples shown in Figure 9 are for a model that was trained
                                                   17


on every 20th frame, meaning 19 possible frames can be selected between each anchor frame.
Apart from the average, median, and single lowest confidence values, we also randomly selected
frames between each anchor frame for comparison. It is interesting to note that for all of the
average, median, and single lowest confidence values, more frames are being selected from
around the middle. This is expected because the anchor frames are used to train the model, so the
majority of the frames close to the anchor frames are the most similar and contain many of the
same object instances.
      Figure 8. Each selected frame’s (yellow) distance is measured from the anchor frame
     (red) to its left. The anchor frames are used to train the model, and the selected frames
         are chosen based on their average, median, or single lowest confidence values.
     Figure 9. Top Left: Uniformly randomly selected frames show the flattest distribution, as
       expected. Top Right: Lowest single confidence value, Bottom Left: Average confidence
                             value, Bottom Right: Median confidence value.
                                                  18


3.3 Comparing Times
        When comparing times, it is important to note that the higher the initial subsampling rate,
the more active learning iterations can be performed. When initially training the model with a
subsampling rate of 10, we only perform one active learning iteration. This is based on the fact
that the detections at this point are extremely accurate and performing another iteration barely
improves the model. However, when we initialize with a subsampling rate of 80, we perform 4
iterations. The 4th and final iteration when using a subsampling rate of 80 produces the same
number of labeled images as the 1st and final iteration when using a subsampling rate of 10, for a
fair comparison. Regardless of the number of iterations performed for each subsampling rate, we
analyze the time taken to label the full dataset at each iteration. Each uncertainty measure is
compared with the randomly selected frames for comparison. A positive time difference shows
that the chosen uncertainty measure performed worse than if we were to randomly select frames
instead.
        The 1st and only iteration for a subsampling rate of 10 can be seen in Table 2, with the
same setup outlined in Figure 4. Out of the 3 uncertainty measures applied, selecting the frame
with the lowest average confidence value worked the best. However, all 3 of these measures
performed worse than randomly selecting frames, although the difference is relatively small.
When training the initial model with a subsampling rate of 20, 1 iteration and 2 iterations of
active learning were explored. The higher subsampling rate means we are training the initial
model with fewer frames, allowing us to increase the number of iterations. When performing
only 1 iteration, the setup is similar to that detailed in Figure 4 and the results can be found in
Table 3. However, when performing 2 iterations, the setup would be exactly the one shown in
Figure 10, with the results expressed in Table 4.
                                                    19


    Uncertainty       Initial       Scan         Add         Remove      Total (hours)  Difference
     Random           28.09         12.2         6.99          1.72           49             0
     Average          28.09         12.2         7.45          1.63          49.37         0.37
      Median          28.09         12.2         7.48          1.77          49.54         0.54
      Lowest          28.09         12.2         7.56          1.71          49.56         0.56
 Table 2. Comparing different uncertainty selections with randomly selected frames for the 1st
             and only active learning iteration for an initial subsampling rate of 10.
  Figure 10. Performing 2 active learning iterations. The arrows beneath the frames signify
    the groups that the lowest confidence values are considered from. After both iterations,
   the selected frames are verified, corrected, and then added to the training set for the next
        iteration. Lastly, all remaining frames (blue) are tested, verified, and corrected.
    Uncertainty       Initial       Scan         Add         Remove      Total (hours)  Difference
     Random           14.26        12.87        10.72          2.98          40.83           0
     Average          14.26        12.87        11.09          2.53          40.75        -0.08
      Median          14.26        12.87        10.96          2.64          40.73         -0.1
      Lowest          14.26        12.87        11.43          2.59          41.15         0.32
Table 3. Comparing different uncertainty selections with randomly selected frames for the 1st of
                2 active learning iterations for an initial subsampling rate of 20.
                                                 20


        It is interesting to note that the 1st iteration for a subsampling rate of 20 produces smaller
differences when compared to the random selection than the 2nd iteration. This theme continues
when we perform 3 iterations with an initial subsampling rate of 40, as shown in Table 5, Table
6, and Table 7. The 1st iteration for all uncertainty measures improves greatly over the random
selection. However, as we perform more iterations, we can see the difference between the
random selections becoming minimal.
     Uncertainty        Initial        Scan           Add        Remove      Total (hours) Difference
       Random            14.26        12.87           8.11         2.19         37.43           0
       Average           14.26        12.87           8.86         1.83         37.82         0.39
       Median            14.26        12.87           8.48          1.9         37.51         0.08
       Lowest            14.26        12.87           8.79          1.9         37.82         0.39
  Table 4. Comparing different uncertainty selections with randomly selected frames for the 2nd
               and final active learning iteration for an initial subsampling rate of 20.
     Uncertainty        Initial        Scan           Add        Remove      Total (hours) Difference
       Random             7.08        13.21           18.3         3.27         41.86           0
       Average            7.08        13.21           16.3         3.56         40.15        -1.71
       Median             7.08        13.21          15.99         3.41         39.69        -2.17
       Lowest             7.08        13.21          15.75         3.63         39.67        -2.19
 Table 5. Comparing different uncertainty selections with randomly selected frames for the 1st of
                   3 active learning iterations for an initial subsampling rate of 40.
        As the subsampling rate increases, selecting frames based on their confidence values
significantly decreases the time taken to label the dataset. However, selecting frames based on
their confidence values becomes redundant as more iterations are performed. There is little
difference between selecting random frames because as we perform more iterations, the frames
                                                      21


become close enough together that there is little difference between them. So selecting specific
frames doesn’t benefit the model too much as all frames are similar and provide the same
information.
     Uncertainty       Initial      Scan          Add         Remove      Total (hours)  Difference
      Random            7.08        13.21         12.4          3.12         35.81            0
      Average           7.08        13.21        11.87          2.83         34.99         -0.82
      Median            7.08        13.21         11.6          2.74         34.63         -1.18
      Lowest            7.08        13.21        12.28          2.63          35.2         -0.61
Table 6. Comparing different uncertainty selections with randomly selected frames for the 2nd of
                 3 active learning iterations for an initial subsampling rate of 40.
     Uncertainty       Initial      Scan          Add         Remove      Total (hours)  Difference
      Random            7.08        13.21         9.48          2.12         31.89            0
      Average           7.08        13.21         9.64          2.06         31.99           0.1
      Median            7.08        13.21          9.3           2.2         31.79          -0.1
      Lowest            7.08        13.21         9.88          2.12         32.29           0.4
  Table 7. Comparing different uncertainty selections with randomly selected frames for the 3rd
              and final active learning iteration for an initial subsampling rate of 40.
                                                  22


                       CHAPTER 4: MIN-DIAL AND MAX-DIAL
        Since we are exploiting intra-video training and testing, we developed Minimal-Distance
Iterative Active Learning (min-DIAL) and Maximal-Distance Iterative Active Learning (max-
DIAL) approaches. These approaches do not use utilize any confidence values from detections
but merely select frames based on their relative location to the anchor frames used to train our
model. These methods are much simpler than calculating the average, median, or lowest
confidence value for each frame.
4.1 Min-DIAL
        Under min-DIAL, after manually labeling an initial subset using a specific subsampling
rate, we train our model before testing on the frames closest to the anchor frames. For example,
if we trained our model using the 20th and 40th frames, we would test our model on the frames
either side of the 20th and the 40th frames. This method was chosen due to the frames on either
side of the anchor frames being the most similar. Therefore, the outputs will be the most
accurate, and fewer corrections are likely to be required. At each iteration, we are updating the
model as we propagate inwards. The visualization of the min-DIAL method can be seen in
Figure 11.
4.2 Max-DIAL
        Under max-DIAL, after manually labeling an initial subset using a specific subsampling
rate, we train our model before testing on the next subsampling rate down. For example, if we
trained our model using every 80th frame (including the 0th frame), we would then test on every
40th frame that wasn’t included in the training set. This method was chosen based on the fact
that the unlabeled frames that sit centrally between two anchor frames are likely to be the most
uncertain from that particular group. This is based on the fact that they are the furthest distance
                                                  23


away from any frames used to train the model. After manually verifying and correcting any
incorrect detections, we are left with the ground truth labels for every 40th frame. The selection
of frames using the max-DIAL approach can be visualized in Figure 12.
    Figure 11. Overview of the min-DIAL method. The trained model is tested on the frames
      closest to either side of the anchor frames. After each iteration, the frames tested are
               verified, corrected, and added to the training set for the next iteration.
    Figure 12. Overview of the max-DIAL method. The trained model is tested on the frames
      that sit centrally between the anchor frames. After each iteration, the frames that are
        tested are verified, corrected, and added to the training set for the next iteration.
        To see which method works best, we compare the time taken to label the full dataset
using different subsampling frequencies. Table 8 shows the different times using min-DIAL and
max-DIAL and Figure 13 provides a graphical view of the time taken at each subsampling
frequency. From Table 8 and Figure 13, we can see that max-DIAL performs the best out of the
                                                   24


two proposed methods. In fact, as the subsampling frequency increases, the distance between the
times for each method increases. A potential reason for this could be as we increase the initial
subsampling frequency, min-DIAL requires more neighboring frames to be tested on at each
iteration to compensate for the larger number of unlabeled frames between anchor frames.
                Frequency           Initial      Scan        Add        Remove       Total (hours)
                    5               55.72       10.84        5.58         1.19          73.33
           10 + 1xMin-DIAL          28.09       12.19        7.03         1.84          49.15
           10 + 1xMax-DIAL          28.09        12.2        6.52          1.6          48.41
           20 + 2xMin-DIAL          14.26       12.87        9.03         1.85          38.01
           20 + 2xMax-DIAL          14.26       12.87        7.37         1.86          36.36
           40 + 3xMin-DIAL           7.08       13.21        11.5          2.3          34.09
           40 + 3xMax-DIAL           7.08       13.21         8.3         1.95          30.54
           80 + 4xMin-DIAL           3.51       13.38       15.73          2.8          35.42
           80 + 4xMax-DIAL           3.51       13.38         8.8         2.01           27.7
       Table 8. Time taken to completely label the dataset using different initial subsampling
              frequencies and different numbers of min-DIAL and max-DIAL iterations.
         Moving forward, we will only consider Max-DIAL due to its superiority. Table 9 outlines
the full range of subsampling frequencies for our max-DIAL method. Note that we add training
times because as we increase the subsampling frequency, which subsequently increases the
number of possible active learning iterations, there becomes a point where minimal time is saved
by increasing the initial subsampling frequency. By including the training times, we can show
that the minimal time decrease becomes redundant with the extra time taken to train the model
for another iteration. The effect of adding the training time can be better expressed in Figure 14.
                                                  25


                 80
                 70
                 60
                 50
    Time (hrs)
                 40
                 30
                 20
                 10
                  0
                              5                    10                    20              40              80
                                                              Initial Subsampling Rate
                                                          Min-DIAL            Max-DIAL
   Figure 13. Visual representation of the total time taken to label the full dataset using
           min-DIAL and max-DIAL using different subsampling frequencies.
                      Frequency   Initial   Scan Add           Remove Total (hours)      Train   Complete (hours)
                         5        55.72     10.84 5.58           1.19          73.33      3           76.33
                      10 + 1xAL   28.09     12.2    6.52          1.6          48.41      6           54.41
                      20 + 2xAL   14.26     12.87 7.37           1.86          36.36      9           45.36
                      40 + 3xAL   7.08      13.21       8.3      1.95          30.54      12          42.54
                      80 + 4xAL   3.51      13.38       8.8      2.01          27.7       15           42.7
                  160 + 5xAL      1.61      13.47 9.09           2.06          26.23      18          44.23
                  320 + 6xAL      0.72      13.51 9.22           2.07          25.52      21          46.52
                  640 + 7xAL      0.31      13.53 9.31           2.07          25.22      24          49.22
                 1280 + 8xAL      0.03      13.54       9.4      2.08          25.05      27          52.05
 Table 9. Different initial subsampling frequencies with our max-DIAL method. The total time
refers to the amount of human time required to label the dataset. The complete time is the total
                  human time but also accounts for the model's training time.
                                                                        26


            90
            80
            70
            60
            50
            40
            30
            20
            10
             0
               5         10       20       40          80       160    320      640       1280
                                            Starting Sampling Rate
                                                 Human        Total
         Figure 14. Max-DIAL method using different initial subsampling frequencies. Total
          time considers training time also. Although increasing the initial subsampling rate
         decreases the time taken, after a subsampling rate of 80, the time begins to level out.
4.3 Comparing Max-DIAL with Traditional Active Learning
        The results for an initial subsampling rate of 10, with one iteration of active learning, can
be seen in Table 10. The max-DIAL results are compared to the randomly selected frames and
the most efficient uncertainty measure from the traditional active learning methods. Even when
performing 1 iteration, our max-DIAL method decreases the time taken to label the dataset.
When applying our max-DIAL method to a subsampled frequency of 20, we continue to see
promising improvements over traditional active learning methods. Table 11 and Table 12 outline
the times for performing 1 and 2 iterations, respectively, using our max-DIAL method. The 3
iterations for an initial subsampling frequency of 40 can be seen in Tables 13, 14, and 15. At
every iteration, across all subsampling frequencies, max-DIAL outperforms the random selection
of frames. It also outperforms or performs equivalently to all other uncertainty measures.
Following previous patterns, the initial iterations produce a larger gap in performance over the
                                                    27


randomly selected sample. As we increase the number of iterations, the difference becomes
smaller as we incorporate more and more frames.
      Uncertainty     Initial         Scan          Add       Remove       Total (hours) Difference
       Random         28.09           12.2          6.99        1.72            49            0
       Average        28.09           12.2          7.45        1.63          49.37         0.37
      Max-DIAL        28.09           12.2          6.52         1.7          48.51        -0.49
      Table 10. Comparing random selection and the most efficient traditional active learning
     technique previously computed (average) with our max-DIAL method for the 1st and only
                            iteration for an initial subsampling rate of 10.
      Uncertainty     Initial         Scan          Add       Remove       Total (hours) Difference
       Random         14.26          12.87         10.72        2.98          40.83           0
        Median        14.26          12.87         10.96        2.64          40.73         -0.1
      Max-DIAL        14.26          12.87         10.13        2.64           39.9        -0.93
      Table 11. Comparing random selection and the most efficient traditional active learning
  technique previously computed (median) with our max-DIAL method for the 1st of 2 iterations
                                 for an initial subsampling rate of 20.
      Uncertainty     Initial         Scan          Add       Remove       Total (hours) Difference
       Random         14.26          12.87          8.11        2.19          37.43           0
        Median        14.26          12.87          8.48         1.9          37.51         0.08
      Max-DIAL        14.26          12.87           7.9        1.87           36.9        -0.53
      Table 12. Comparing random selection and the most efficient traditional active learning
    technique previously computed (median) with our max-DIAL method for the 2nd and final
                            iteration for an initial subsampling rate of 20.
                                                    28


     Uncertainty     Initial         Scan          Add       Remove       Total (hours) Difference
      Random          7.08          13.21          18.3        3.27          41.86           0
       Single         7.08          13.21         15.75        3.63          39.67        -2.19
     Max-DIAL         7.08          13.21           16         3.59          39.88        -1.98
     Table 13. Comparing random selection and the most efficient traditional active learning
technique previously computed (single) with our max-DIAL method for the 1st of 3 iterations for
                                  an initial subsampling rate of 40.
     Uncertainty     Initial         Scan          Add       Remove       Total (hours) Difference
      Random          7.08          13.21          12.4        3.12          35.81           0
       Median         7.08          13.21          11.6        2.74          34.63        -1.18
     Max-DIAL         7.08          13.21         11.16        2.68          34.13        -1.68
     Table 14. Comparing random selection and the most efficient traditional active learning
 technique previously computed (median) with our max-DIAL method for the 2nd of 3 iterations
                                for an initial subsampling rate of 40.
     Uncertainty     Initial         Scan          Add       Remove       Total (hours) Difference
      Random          7.08          13.21          9.48        2.12          31.89           0
       Median         7.08          13.21           9.3         2.2          31.79         -0.1
     Max-DIAL         7.08          13.21          8.93        1.96          31.18        -0.71
     Table 15. Comparing random selection and the most efficient traditional active learning
    technique previously computed (median) with our max-DIAL method for the 3rd and final
                           iteration for an initial subsampling rate of 40.
                                                   29


                 CHAPTER 5: JENSEN-SHANNON DIVERGENCE
        After training a model with uniformly sampled anchor frames, more advanced methods
can be explored to select uncertain frames to be used in the next iteration. The Jensen-Shannon
Divergence (JSD) calculates the distance between two distributions, P and Q, shown in Equation
4. The JSD distance metric is based on the Kullback-Leibler divergence between 2 distributions,
KL(P || Q), but is symmetric, meaning that the order of the distributions is irrelevant. The output
is between 0 and 1, with 0 showing no difference between the two distributions.
                                  1              1                         1
                  JSD(P || Q) = 2KL(P || M) + 2KL(Q || M), where M = 2 (𝑃 + 𝑄)                    (4)
        If we create a distribution for each frame, we can select the frames of interest based on
the difference between the anchor frames. The frames with a larger difference from the anchor
frames have a higher probability of providing the model with more information. As these frames
will contain different objects, or the same objects but located in different parts of the frame,
adding these frames to the training set will likely improve the model the most.
5.1 Creating Distributions
        There are many ways to create a distribution for an image. Creating histograms based on
the number of different colored pixel values is one option. However, many of the frames in our
video sequences may contain completely different backgrounds but include the same objects. As
we are more interested in the objects in the frame, this technique is not beneficial for our
application. Our method for creating histograms is to divide each frame into smaller sections and
count the number of detections in each subsection, shown in Figure 15. Based on the fact that the
detections are highly accurate using the intra-video training and testing method, the majority of
the detections are correct and can be considered as actual objects, with the distribution shown in
Figure 16. If we divide each frame into 2 or 3 sections, there is not much difference in the
                                                  30


distributions because the locations of the detections have to change significantly. Conversely, if
we divide the frame into a large number of sections, almost every frame that is compared, even if
they are neighboring frames, receive a high difference value. This is because an object only has
to move a small distance to enter another section, meaning the distribution will change almost
every frame. For this reason, we used 8 different sections as they produce reasonably wide
sections. We use ground truth labels to calculate the distributions for the anchor frames.
 Figure 15. A frame divided into 8 sections containing multiple detections. The center point of
the bounding box for each detection is marked as a red cross. Each center location is placed in
                        one of the 8 bins based on which section it falls in.
5.2 Calculating Difference Value
        When calculating a difference value for a frame, it is important to incorporate the anchor
frames from either side. For example, if we wanted to select a frame from between the 10th and
20th frames of a video sequence, we need to calculate each frame’s difference from both the 10 th
and the 20th frames. It is clear that the 11th frame will share the most similarities with the lower
anchor frame (the 10th frame) and be most different from the upper anchor frame (the 20th
frame). So, calculating the difference between a frame and only one of the anchor frames doesn’t
provide much information. Therefore, we compute the average of the differences between each
frame and the lower and upper anchor frames. By averaging the two differences, we are more
                                                   31


likely to select frames from around the middle, where the frames are different from both anchor
frames. If an anchor frame has no ground truth labels or a frame from between the two anchor
frames we are comparing with has no detections, the difference score from the two frames is
given a value of zero. If all frames between two anchor frames have the same difference value,
we select the middle frame. This is based on the success shown using our max-DIAL method.
5.3 JSD Results and Comparison
         The results for selecting the frames to use for the next iteration based on our JSD
uncertainty measure compared with all previous active learning methods for a subsampled rate of
10 can be seen in Table 16. Although our JSD approach takes slightly less time than the
traditional methods, it doesn’t perform as well as our max-DIAL approach.
        Figure 16. The distribution of the center points for each detection from Figure 15.
Extending our JSD method to a subsampling rate of 20, we can see the results for 1 and 2
iterations in Tables 17 and 18. Our JSD method now performs slightly worse than the traditional
active learning method. However, our max-DIAL approach still performs the best out of all
approaches. Lastly, we use our JSD uncertainty measure to select frames from the 3 iterations
                                                  32


performed using a subsampling rate of 40. The results are shown in Figures 19, 20, and 21,
respectively. The JSD method appears to perform similarly to traditional active learning
methods. The difference is often minimal, showing that the JSD method isn’t a very effective
method for selecting uncertain frames. Again, however, our max-DIAL method performs
substantially better than all other active learning methods.
     Uncertainty       Initial         Scan          Add       Remove       Total (hours) Difference
      Random           28.09           12.2         6.99         1.72            49            0
      Average          28.09           12.2         7.45         1.63          49.37         0.37
      Median           28.09           12.2         7.48         1.77          49.54         0.54
       Single          28.09           12.2         7.56         1.71          49.56         0.56
     Max-DIAL          28.09           12.2         6.52          1.7          48.51        -0.49
        JSD            28.09           12.2         6.97         1.79          49.05         0.05
    Table 16. Comparison of JSD with all other active learning methods for the 1st and only
                             iteration for an initial subsampling rate of 10.
     Uncertainty       Initial         Scan          Add       Remove       Total (hours) Difference
      Random           14.26          12.87         10.72        2.98          40.83           0
      Average          14.26          12.87         11.09        2.53          40.75        -0.08
      Median           14.26          12.87         10.96        2.64          40.73         -0.1
       Single          14.26          12.87         11.43        2.59          41.15         0.32
     Max-DIAL          14.26          12.87         10.13        2.64           39.9        -0.93
        JSD            14.26          12.87         11.77        2.55          41.45         0.62
Table 17. Comparison of JSD with all other active learning methods for the 1st of 2 iterations for
                                    an initial subsampling rate of 20.
                                                     33


    Uncertainty    Initial         Scan          Add       Remove       Total (hours) Difference
     Random        14.26          12.87          8.11        2.19          37.43           0
     Average       14.26          12.87          8.86        1.83          37.82         0.39
      Median       14.26          12.87          8.48         1.9          37.51         0.08
      Single       14.26          12.87          8.79         1.9          37.82         0.39
    Max-DIAL       14.26          12.87           7.9        1.87           36.9        -0.53
       JSD         14.26          12.87          8.85        1.94          37.92         0.49
   Table 18. Comparison of JSD with all other active learning methods for the 2nd and final
                         iteration for an initial subsampling rate of 20.
    Uncertainty    Initial         Scan          Add       Remove       Total (hours) Difference
     Random         7.08          13.21          18.3        3.27          41.86           0
     Average        7.08          13.21          16.3        3.56          40.15        -1.71
      Median        7.08          13.21         15.99        3.41          39.69        -2.17
      Single        7.08          13.21         15.75        3.63          39.67        -2.19
    Max-DIAL        7.08          13.21           16         3.59          39.88        -1.98
       JSD          7.08          13.21         16.25        3.67          40.21        -1.65
Table 19. Comparison of JSD with all other active learning methods for the 1st of 3 iterations for
                                an initial subsampling rate of 40.
                                                 34


    Uncertainty    Initial         Scan          Add       Remove       Total (hours) Difference
      Random        7.08          13.21          12.4        3.12          35.81           0
      Average       7.08          13.21         11.87        2.83          34.99        -0.82
      Median        7.08          13.21          11.6        2.74          34.63        -1.18
       Single       7.08          13.21         12.28        2.63           35.2        -0.61
    Max-DIAL        7.08          13.21         11.16        2.68          34.13        -1.68
        JSD         7.08          13.21         12.77         2.8          35.86         0.05
Table 20. Comparison of JSD with all other active learning methods for the 2nd of 3 iterations for
                                an initial subsampling rate of 40.
    Uncertainty    Initial         Scan          Add       Remove       Total (hours) Difference
      Random        7.08          13.21          9.48        2.12          31.89           0
      Average       7.08          13.21          9.64        2.06          31.99          0.1
      Median        7.08          13.21           9.3         2.2          31.79         -0.1
       Single       7.08          13.21          9.88        2.12          32.29          0.4
    Max-DIAL        7.08          13.21          8.93        1.96          31.18        -0.71
        JSD         7.08          13.21          9.32        2.12          31.73        -0.16
    Table 21. Comparison of JSD with all other active learning methods for the 3rd and final
                         iteration for an initial subsampling rate of 40.
                                                 35


                                         CONCLUSION
        We have introduced a semi-automated video labeling framework that attempts to
minimize human interaction time while applying active learning strategies to maximize the
accuracy of our automated labeling process. We have shown that applying our proposed SALV
framework, which exploits training, testing, and active learning on frames from the same video
sequences, produces highly accurate results. This is due to the similarity of the environment
between images captured within a small timeframe. Combining intra-video sequence training and
testing with our max-DIAL approach for active learning, we further improved the accuracy of
the detections and reduced the time taken to label the full dataset. Our max-DIAL approach
outperformed all the traditional active learning methods explored, as well as our proposed JSD
approach, allowing us to reduce the labeling time by more than 90%, compared to manual
labeling. It is important to note that we did not include the training times of the model in our
calculations. This is based on the fact that training does not require any assistance from a human.
While the training is running, other tasks for semi-automated labeling can be completed. We are
also more interested in reducing the workload on the human; therefore, we are only interested in
how much time the human needs to spend on the overall labeling process.
        Future work could look at an ablation study that varies the confidence threshold of the
detections. In this thesis we allowed all detections to be present. However, only allowing high-
confidence detections would improve the precision, meaning fewer detections need to be
removed. Conversely, this may reduce the recall, meaning that more bounding boxes have to be
added, which is more time-consuming than removing them. An optimum confidence threshold
could potentially reduce the time even further.
                                                   36


                                       BIBLIOGRAPHY
[1] J. Deng, W. Dong, R. Socher, L. -J. Li, Kai Li, and Li Fei-Fei, ”ImageNet: A large-scale
    hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern
    Recognition, Miami, FL, USA, 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848.
[2] K. Crowston, ”Amazon Mechanical Turk: A Research Tool for Organizations and
    Information Systems Scholars,” IFIP Advances in Information and Communication
    Technology, vol 389, 2012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642
    35142-6 14.
[3] L. Fei-Fei. (2010). ImageNet. Crowdsourcing, benchmarking other cool things [Online].
    Available: https://www.image- net.org/static files/papers/ImageNet2010.pdf.
[4] T. A. Biresaw, T. Nawaz, J. Ferryman, and A. I. Dell, ”ViTBAT: Video tracking and
    behavior annotation tool,” 2016 13th IEEE International Conference on Advanced Video and
    Signal Based Surveillance (AVSS), Colorado Springs, CO, USA, 2016, pp. 295-301, doi:
    10.1109/AVSS.2016.7738055.
[5] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller and V. Ferrari, ”Training Object Class
    Detectors with Click Supervision,” 2017 IEEE Conference on Computer Vision and Pattern
    Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 180-189, doi: 10.1109/CVPR.2017.27.
[6] D. Li, J. -B. Huang, Y. Li, S. Wang and M. -H. Yang, ”Weakly Supervised Object
    Localization with Progressive Domain Adaptation,” 2016 IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 3512-3520, doi:
    10.1109/CVPR.2016.382.
[7] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller and V. Ferrari, ”We Don’t Need No
    Bounding-Boxes: Training Object Class Detectors Using Only Human Verification,” 2016
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV,
    USA, 2016, pp. 854-863, doi: 10.1109/CVPR.2016.99.
[8] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller and V. Ferrari, ”Extreme Clicking for
    Efficient Object Annotation,” 2017 IEEE International Conference on Computer Vision
    (ICCV), Venice, Italy, 2017, pp. 4940-4949, doi: 10.1109/ICCV.2017.528.
[9] D.Schorkhuber, F.Grohand, and M.Gelautz, ”Bounding Box Propagation for Semi-automatic
    Video Annotation of Nighttime Driving Scenes,” 2021 12th International Symposium on
    Image and Signal Processing and Analysis (ISPA), Zagreb, Croatia, 2021, pp. 131-137, doi:
    10.1109/ISPA52656.2021.9552141.
[10] B. -L. Wang, C. -T. King and H. -K. Chu, ”A Semi-Automatic Video Labeling Tool for
     Autonomous Driving Based on Multi-Object Detector and Tracker,” 2018 Sixth
     International Symposium on Computing and Networking (CANDAR), Takayama, Japan,
     2018, pp. 201-206, doi: 10.1109/CANDAR.2018.00035.
                                                37


[11] B. Adhikari, J. Peltomaki, J. Puura, and H. Huttunen, ”Faster Bounding Box Annotation for
     Object Detection in Indoor Scenes,” 2018 7th European Workshop on Visual Information
     Processing (EUVIP), Tampere, Finland, 2018, pp. 1-6, doi: 10.1109/EUVIP.2018.8611732.
[12] C. -A. Brust, C. Ka ̈ding, and J. Denzler, ”Active learning for deep object detection,” arXiv
     preprint arXiv:1809.09875, 2018.
[13] T. Yuan et al., ”Multiple Instance Active Learning for Object Detection,” 2021 IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA,
     2021, pp. 5326-5335, doi: 10.1109/CVPR46437.2021.00529.
[14] J. Choi, I. Elezi, H. -J. Lee, C. Farabet and J. M. Alvarez, ” Active Learning for Deep
     Object Detection via Probabilistic Modeling,” 2021 IEEE/CVF International Conference on
     Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 10244-10253, doi:
     10.1109/ICCV48922.2021.01010.
[15] J. Zolfaghari Bengar et al., ”Temporal Coherence for Active Learning in Videos,” 2019
     IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul,
     Korea (South), 2019, pp. 914-923, doi: 10.1109/ICCVW.2019.00120.
[16] G. Jocher et al. YOLOv5, v7.0, doi: 10.5281/zenodo.3908559 [online]. Available:
     https://github.com/ultralytics/yolov5.
[17] A. Geiger, P. Lenz, and R. Urtasun, ”Are we ready for autonomous driving? The KITTI
     vision benchmark suite,” 2012 IEEE Conference on Computer Vision and Pattern
     Recognition, Providence, RI, USA, 2012, pp. 3354-3361, doi:
     10.1109/CVPR.2012.6248074.
[18] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing Annotations for Visual Object Detection.
     [Online]. Available: http://vision.stanford.edu/pdf/bbox submission.pdf.
                                                  38