Building a Good Dataset#

Once you define what sort of problems you want to solve with AI, you will need to create a dataset. That is one of the first steps in AI development. Below, you will read the basics of building a good dataset but let us first define what the dataset is.

What is a Dataset#

A dataset is a collection of data. In the context of the Intel® Geti™ platform, it is a collection of images or videos with additional information in the form of annotations. These annotations are marked regions in an image with names characterizing the regions or the whole image with one or more tags describing the image.

In other disciplines of artificial intelligence, datasets could constitute tabular data, think about spreadsheets and how they collate data, or language corpora for natural language processing. Nevertheless, our focal point is on the dataset comprised of media items for computer vision tasks.

Machine learning (ML), in many cases, leverages data on which ML algorithms learn. Hence, building a good dataset is a crucial part of solving your problem. So, remember to take your time while collecting your data.

Dataset size#

You may wonder how many examples (images or video frames) you need to annotate to get an accurate model. Determining the size of the dataset is an important aspect in building your dataset. However, the answer to this question is not clear-cut and there is no one-size-fits-all formula for how big or small your dataset should be. It depends on the complexity of the problem you are trying to solve.

There are several factors for how many examples you need to get an accurate machine learning model. One of the most determining ones is how hard your problem is; how (dis)similar the categories (classes) you are trying to differentiate, detect, or segment. For example, when your classes are similar in a classification you need more examples than when they are very distinct. In detection and segmentation tasks, you need more examples when the objects you are trying to segment look similar to the background or to other objects in the images.

Another factor is the total number of classes you are trying to recognize. In general, having a high number of classes requires a (disproportionally) larger number of examples. Another factor is the quality of your images. For example: (1) are your images clean or (2) do they contain a lot of noise? Image resolution (the number of pixels in the image) can also play a role.

The Intel® Geti™ platform leverages transfer learning, a method for retraining pre-trained neural networks (trained on very large amounts of data) with your dataset so that you do not have to start training a neural network from scratch. However, you can also start from scratch if you want to.

Transfer learning removes the need of collecting thousands of images for you to start building your dataset to create a model. You can start with as few as 30 annotated images to see the first results of your model. Subsequently, you can build on your model by providing new media images. Transfer learning leverages neural networks pre-trained on big public datasets (such as Image-net, COCO, and more). This is why you can start your training with as few as 30 annotated images.

In the Intel® Geti™ platform, as a user you are informed on the quality of the model after every round of annotation and model training. This means that you can decide to stop annotating (certain classes) when the performance for your whole model or these classes is sufficient. Thanks to that you will not annotate more data than is necessary. You can always check the balance of annotated classes in the Annotations Statistics tab

Hint

The more overlap (i.e. similar features) between objects, the more data you will need e.g. if you would like to classify cat breeds, you will need a bigger dataset than in the case of classifying cats and dogs.

Dataset quality#

Since a dataset is the first building block and constitutes the basis of machine learning for teaching models, you need to pay heed to the quality of your dataset. In computer science, there is a famous saying “garbage in, garbage out”. What it means is that poor inputs produce bad results. If you feed your model with inaccurate samples, no matter how good your AI development platform, your AI team, your data scientists are, the results will be far from desirable.

First off, make sure your dataset contains relevant samples. If you want to build a model discriminating between several cat breeds, make sure you collect images and videos of cats and not dogs. Subsequently, ensure you collect a diverse dataset with unique representations.

Continuing our cat example, you need to gather enough unique representations of cat breeds you would like to discriminate. When you think of a cat, you may be thinking about your own cat or a neighbor’s cat. However, cats (Felidae) are diverse in terms of their size, fur color, and species. Lions are different from cheetahs, not to mention breeds of domestic cats. Also, be aware that machine learning models can only learn from the data, this means that they will only recognize classes that are represented in the dataset.

When you build your dataset, you also need to think about the background against which the objects of interest are presented e.g. you need to have images of cats in various settings, for instance in a garden, in a building, on a sofa, in the grass etc. Lighting and angles at which the images are taken also play an important role. So, you need to provide images of cats in broad daylight, at night, at dusk, and at dawn from various perspectives. In general the dataset should reflect the image quality and content that you will encounter when applying the model in practice.

Hint

If you provide images of a cat against the garden background and train your model only on a dataset with a garden background, the model may classify an empty garden as cat, which means that you taught the model how to classify a garden, not a cat. In general, make sure that there are no confounding factors in your data through which your machine learning model can “inappropriately” learn how to discriminate classes.

The camera from which the photos were taken is substantive in building a dataset. The image saturation, temperature, resolution, and other properties of images have influence on the predictive capability of a model. It is recommended to use the same quality of images in your dataset against which the model will make predictions. If you record a video with your smartphone and train your model with the images from your smartphone, the model may manifest some unexpected behavior when you run predictions on images from a professional studio camera.

Dataset Annotation#

Once you have collected your dataset, it is time to annotate it. You can do it by selecting a label for an image or a region in the image through the Intel® Geti™ graphical user interface with our annotation tools. For example, to teach a child what an orange is, you would show the orange to the child. Similarly, the machines need to be told what an object is, before it can be recognized automatically.

Since annotation requires manual labor by one or many individuals, the dataset is prone to contain some mislabeled data samples by human error. That is why, it is a good practice to double check the annotations you or your colleagues created. If you accidentally misannotated a whole bunch of oranges and labeled them as tangerines, the model would produce erroneous predictions.

When you work with others on annotating the same dataset, you should establish general guidelines for annotating objects. Without these guidelines, annotators will interpret the task on their own and may use suboptimal annotation methods or completely misunderstand the task. For example, giving a group of people the task of annotating circular bacteria in a Petri dish may lead to various interpretations:

using a bounding box to annotate a single bacterium
using a circle tool to annotate a single bacterium
using a bounding box to annotate a colony of bacteria
using a circle tool to annotate a colony of bacteria

There are more possible interpretations than the ones listed above, which are for illustration purposes only.

To maintain homogeneous annotations, you should avoid various individual interpretations of annotating the same dataset. Inconsistent annotations can cause significant problems in trained models such as poor performance, bias, and unexplainable behavior. As a result, a common understanding of the annotation process should be shared across all annotators.

As with the case of annotating circular bacteria in the Petri dish, annotators should label each single bacterium they see in the images. They should also use the circle tool since bacteria are circular and annotate only the object of interest as opposed to annotating the object and its redundant pixel area if they choose the bounding box. A bacterial colony is not representative of a bacterium specimen and the task is to annotate bacteria and not the whole colonies.

Also worth bearing in mind, when labelling, to efficiently split the annotation workload. Some datasets require specialized knowledge to annotate them, and you can divide the dataset based on the level of expertise required to annotate the media items within the dataset.

Make sure that you outline objects in segmentation and detection tasks precisely. If annotations are imprecise or inconsistent the machine learning might have problems learning an accurate model.

The annotation process consumes most of the dataset preparation time. However, the Intel® Geti™ platform reduces this time by actively suggesting images to be annotated that give the best predictive results.