As AI researchers and companies race to train bigger and better machine learning models, curating suitable datasets is becoming a growing challenge.
To solve this problem, researchers from Meta AI, Google, INRIA, and Université Paris Saclay have introduced a new technique for automatically curating high-quality datasets for self-supervised learning (SSL).
Their method uses embedding models and clustering algorithms to curate large, diverse, and balanced datasets without the need for manual annotation.
Balanced datasets in self-supervised learning
Self-supervised learning has become a cornerstone of modern AI, powering large language models, visual encoders, and even domain-specific applications like medical imaging.
Unlike supervised learning, which requires every training example to be annotated, SSL trains models on unlabeled data, enabling the scaling of both models and datasets on raw data.
However, data quality is crucial for the performance of SSL models. Datasets assembled randomly from the internet are not evenly distributed.
This means that a few dominant concepts take up a large portion of the dataset while others appear less frequently. This skewed distribution can bias the model toward the frequent concepts and prevent it from generalizing to unseen examples.
“Datasets for self-supervised learning should be large, diverse, and balanced,” the researchers write. “Data curation for SSL thus involves building datasets with all these properties. We propose to build such datasets by selecting balanced subsets of large online data repositories.”
Currently, much manual effort goes into curating balanced datasets for SSL. While not as time-consuming as labeling every training example, manual curation is still a bottleneck that hinders training models at scale.
Automatic dataset curation
To address this challenge, the researchers propose an automatic curation technique that creates balanced training datasets from raw data.
Their approach leverages embedding models and clustering-based algorithms to rebalance the data, making less frequent/rarer concepts more prominent relative to prevalent ones.
First, a feature-extraction model computes the embeddings of all data points. Embeddings are numerical representations of the semantic and conceptual features of different data such as images, audio, and text.
Next, the researchers use k-means, a popular clustering algorithm that randomly scatters data points and then groups it according to similarities, recalculating a new mean value for each group, or cluster, as it goes along, thereby constructing groups of related examples.
However, classic k-means clustering tends to create more groups for concepts that are overly represented in the dataset.
To overcome this issue and create balanced clusters, the researchers apply a multi-step hierarchical k-means approach, which builds a tree of data clusters in a bottom-up manner.
In this approach, at each new stage of clustering, k-means is also applied simultaneously to the clusters obtained in the immediate previous clustering stage. The algorithm uses a sampling strategy to make sure concepts are well represented at each level of the clusters.
This is clever since it allows for clustering and k-means both horizontally among the latest clusters of points, but vertically going back in time (up indicated on the charts above) to avoid dropping less represented examples as it moves upward toward fewer, yet more descriptive, top-level clusters (the line plots at the top of the graphic above).
The researchers describe the technique as a “generic curation algorithm agnostic to downstream tasks” that “allows the possibility of inferring interesting properties from completely uncurated data sources, independently of the specificities of the applications at hand.”
In other words, given any raw dataset, hierarchical clustering can create a training dataset that is diverse and well-balanced.
Evaluating auto-curated datasets
The researchers performed extensive experiments on computer vision models trained on datasets curated with hierarchical clustering. They used images that had no manual labels or descriptions of imagery.
They found that training features on their curated dataset led to better performance on image classification benchmarks, especially on out-of-distribution examples, which are images that are substantially different from the training data. The model also led to significantly better performance on retrieval benchmarks.
Notably, models trained on their automatically curated dataset performed nearly on par with those trained on manually curated datasets, which require significant human effort to create.
The researchers also applied their algorithm to text data for training large language models and satellite imagery for training a canopy height prediction model. In both cases, training on the curated datasets led to significant improvements across all benchmarks.
Interestingly, their experiments show that models trained on well-balanced datasets can compete with state-of-the-art models while trained on fewer examples.
The automatic dataset curation technique introduced in this work can have important implications for applied machine learning projects, especially for industries where labeled and curated data is hard to come by.
The technique has the potential to greatly alleviate the costs related to annotation and manual curation of datasets for self-supervised learning. A well-trained SSL model can be fine-tuned for downstream supervised learning tasks with very few labeled examples. This method could pave the way for more scalable and efficient model training.
Another important use can be for big companies like Meta and Google, which are sitting on huge amounts of raw data that have not been prepared for model training. “We believe [automatic dataset curation] will be increasingly important in future training pipelines,” the researchers write.
The post Meta and Google researchers’ new data curation method could transform self-supervised learning appeared first on Venture Beat.