To optimize data curation for AI, use self-monitored learning lightly


All machine learning models are subject to one critical factor: the quality of the data on which the model is trained.

The challenge of data curation to improve the quality of machine learning and AI models is well understood. A 2021 MIT research study found systemic issues in the way training data was labeled, leading to inaccurate results in AI systems. A study In the diary Quantitative Science Studies which analyzed 141 previous research on data labeling found that 41% of the models used data sets that had been labeled by humans.

Among the vendors attempting to meet the challenge of optimizing data curation for AI is a Swiss startup, Slightly. Founded in 2019, the company announced this week that it has raised $3 million in a seed round of funding. Lightly does not seek to be a data labeling seller, though. Instead, the company wants to help curate the data using a self-monitoring machine learning model that could one day reduce the need for data labeling operations altogether.

“I continue to be amazed at how much of the work in machine learning is manual, very tedious, and not at all automated,” Lightly co-founder Matthias Heller told VentureBeat. “People always think that with machine learning everything is very advanced, but machine learning and deep learning, in particular, is a very young technology and a lot of the tools and infrastructure are only now available.”

A growing market for data curation and labeling

There is no shortage of money or vendors in the market to help optimize data for machine learning, either data curation or data labeling.

For example,, which was known as defined crowd Before rebranding in 2021, it has raised $78 million to date to help advance its data curation vision.

And Grand View Research has forecast that the data labeling market will reach $8.2 billion by 2028, with a projected CAGR of 24.6% between 2021 and 2028. VentureBeat’s own list of the top providers of data labeling software includes Figure Eight from Appen, Amazon Sagemaker Ground Truth, SuperAnnotate, Dataloop, and Darwin from V7.

Other popular providers include the open source Labelbox and Labelstudio, both of which can be integrated with Lightly’s technology. Overall, Lightly plans an open approach, so that users can use the company’s technology with any label provider.

How the self-monitoring model works

Three years ago, Heller and his co-founder Igor Susmelj were working on a machine learning project that required label your data.

“We always wondered if the data we were labeling really helped improve the model,” Heller said.

That led to Lightly, which includes a number of open source projects. The main project is light Library, which provides a self-supervised approach to machine learning on images.

There are multiple approaches to training data for machine learning, Heller explained. In a supervised approach, such as with computer vision, there is an image and an associated label used in combination to teach a model, with a human being doing the labeling.

Unsupervised learning, on the other hand, is the opposite: there is no need for human interaction. The self-monitoring model that Lightly enables falls somewhere in the middle, requiring minimal human interaction.

“You can use the self-monitoring model to select data because the model learns certain information, certain similarities, what belongs to each other and what is different,” Heller said.

From open source to commercial solution

While lightly is free to use as an open source technology, it still requires users to do a lot of the work to set up the proper environment and manage settings.

Lightly’s commercial service provides a managed offering with the infrastructure, tuned algorithms, and learning framework all set up for users.

“Our main competition today is in-house tools,” Heller said. “We use self-monitoring to tell you which 1% of the data to label and use for model training.”

Looking ahead, Heller provocatively forecasts that the day may come in the future when data labeling will no longer be needed, as unsupervised machine learning continues to improve.

“I think the need for labels will drop significantly in the next few years,” Heller said. “Maybe in the future, we won’t need labels anymore.”

The VentureBeat Mission is to be a digital public square for technical decision makers to learn about transformative business technology and transact. Learn more about membership.


Please enter your comment!
Please enter your name here