Visio-Linguistic Dataset Curation

How to efficiently annotate a million-scale Internet image collection?

We start from the public YFCC100M collection that contains Flickr images uploaded between 2004 and 2014. We downloaded a 8M subset to build CLEAR-10, and a 40M subset to build CLEAR-100.

We use the upload time to recreate the temporal stream and split the 8M/40M into 11 buckets of images, each spanning on average 1 year. The 0th bucket is reserved for unsupervised pre-training (e.g., MoCo), and we curate a small but high-quality labeled set (with CLEAR-10 and CLEAR-10 ontology defined here) for each of the 1st to 10th buckets.

To avoid excessive human annotation cost on web-scale data, we use the visio-linguistic dataset curation approach proposed in our NeurIPS'21 paper. The key idea is to leverage OpenAI's recent CLIP model and prompt engineering techniques for efficient image retrieval. The top-scoring images retrieved by CLIP are later verified by human annotators to ensure 99% precision (crowdsourced MTurk workers for CLEAR-10 and commericial labelling service for CLEAR-100).

The entire pipeline can be summarized:

Assets for Future Research

In addition to the high-quality labeled subset, we also release a wealth of assets per time bucket for future research on continual learning, including:

  • Abundant unlabeled data: For research on unsupervised continual learning. This includes ~0.8M unlabeled images per bucket for CLEAR-10, and ~3.6M unlabeled images per bucket for CLEAR-100.

  • Metadata: For research on continual multi-modal learning. This includes all the YFCC100M released metadata such as upload and captured timestamps, captured location, social media hashtags, user description, image title, and etc.

  • Instruction sets for human annotation: For improving dataset transparency. CLEAR-10 instruction set on MTurk platform can be found in the supplemental of NeurIPS'21 paper. CLEAR-100 instruction set can be found here (Chinese ver.).

Check out this repo if you want to make your own dataset out of YFCC100M with a visio-linguistic approach.

Last updated