Published in IEEE Micro in 2016, the Image Net: A Large-Scale Hierarchical Image Database paper was a major step forward in computer vision. Largely inspired by work on the related WordNet project, the ImageNet authors developed a database of labeled images.
The hierarchical structure comes from the relationship between each of the labels, typically using the ‘is-a’ relation to form the branches between labels. These is-a relationships are further organized in an acyclic tree structure. A primary example used in the paper is that of the “mammal” to “husky” path:
Mammal → placental → carnivore → canine → dog → working dog → Husky
As can clearly be seen, each layer deeper in this path is more specific, but also is-a instance of all labels above it. A husky is-a dog, and a dog is-a mammal.
All images are held at the leaf labels, but this relationship structure allows for easily collecting all images at any node level, simply by collecting all images in all descendants. Collecting all carnivore images will include all canine, feline, ursa, orcinus, etc.
While some terms may have a large branching factor others naturally may not. For example, the canine sub-tree may have a very large number of distinct leaf nodes, each of which contains 500 to 1000 distinct images allowing for a good spread of diverse data. However, a less naturally diverse sub tree like “Bovidae”, the scientific family name that includes cattle, may not have as many diverse nodes. While the authors of ImageNet do provide some cherry picked examples when comparing their hierarchies to other comparable image hierarchy databases, the branching factor explicitly is not provided. This diversity issue may be problematic for higher order labels skewing the label’s data towards an example that may have high diversity of sub labels. For example, consider the carnivore tag which may contain dozens of dog leaf nodes, one for each breed, but comparatively fewer bears.
However, this potential lack of branching factor distribution is countered by the diversity of each leaf node image collection. The authors target for each synset to contain 500 to 1000 images of a disambiguated instance. To promote diversity of image type, they employ a unique measurement for each synset: The authors compute the pixel level average for every image in the synset, convert the result to a lossless JPEG file, and measure the file size. JPEG file types are naturally self compressing, so a perfectly even gray image will have a relatively small file size compared to a JPEG image that has many sharp details. So a smaller JPEG file size means the average synset image is evenly gray, meaning that each image in the synset is visually distinct from each other image.
In simple terms: the file size is inversely proportional to the image distribution. A smaller pixel average JPEG file means the synset contains many images with distinct backgrounds, poses, orientations, etc.
Image diversity is difficult to measure, but this is an effective and, importantly, easy metric to calculate and quantify.
Perhaps one of the largest contributions ImageNet has provided to modern ML development is the sheer size of the dataset. Data diversity is important, but so is size. The original paper offers 5247 synsets and 3.2 million images in total. As of 2024, the ImageNet website boasts 21,841 synsets and over 14 million images. The original 3.2 million images offer a 99.7% label precision, meaning that if an ImageNet image is labeled as “Husky”, or “dog” or “mammal” there is a 99.7% chance that label is correct. This precision is impressive, especially because the majority of these labels were provided by a Mechanical Turk service, effectively citizen scientist labeling as opposed to domain expert labeling.
With a dataset of this size, equality and diversity a lot of computer vision systems can be trained or benchmarked. Many different sub problems can be considered including image labeling, box bounding, image generation, or new problems that have not been considered before. While this paper, and my literature review article, focus mainly on the development of the dataset itself, the authors and I are excited to see what possible research will develop off of this resource.