Active Learning#

If you are wondering why images seem to appear in a random order for you to annotate (compared to the original dataset), that is active learning’s doing. Active learning intelligently chooses the next image to decrease annotation time, effort and improve the model’s accuracy.

What is it?#

Active learning is an algorithm that intelligently selects images from the pool of unlabeled data for you to annotate. The underlying goal of this mechanism is to reach a high level of accuracy with small, labeled data. It is achieved through annotating the most informative and demanding examples, for instance the ones which have low probability or are distinct in some aspect from other images. After the training phase, the algorithm will pick the examples dynamically for you to annotate. By repeating rounds of annotation, training and active learning, the platform incrementally increases the accuracy of the model while limiting annotation effort.

If you start annotating a large dataset without active learning enabled, you will face the risk of bias in labelled data. Sequential (non-active) labelling of your dataset may lead to overrepresentation of image categories that have many examples in your dataset, compared to other categories. That, in turn, will make it harder to train a model for categories which have (comparatively) few examples. In other words, your model’s accuracy will be low for these “minority” categories and will not solve real world problems.

What happens behind the scenes?#

The Intel® Geti™ platform’s backend implements two complementary use cases for active learning:

  • ActiveSetRetrieval: invoked synchronously by a HTTP GET request, it allows the user to fetch an active set for the given project/task.

  • ActiveScoresUpdate: triggered by a Kafka event (async), it recomputes the active scores of unannotated media based on the artifacts of a batch inference (predictions and metadata).

Together they form the core of the active learning workflow, which is illustrated in the following picture:

Placeholder
  1. The user annotates part of the dataset.

  2. The annotated data is used to train a model.

  3. The model is used to produce inference artifacts on the training (seen) dataset.

  4. The model is used to produce inference artifacts on the unannotated (unseen) dataset.

  5. The active learning algorithm processes such data and ranks the unannotated media.

  6. The items are sorted by active score.

  7. The user is suggested an active set containing the best items.