• Latest
  • Trending
  • All
  • News
  • Business
  • Politics
  • Science
  • World
  • Lifestyle
  • Tech
How MIT is training AI language models in an era of quality data scarcity

How MIT is training AI language models in an era of quality data scarcity

December 6, 2022
Cyprus Heads To Polls To Pick New President With Runoff Expected

Cyprus Heads To Polls To Pick New President With Runoff Expected

February 5, 2023
‘S.N.L.’ Gives Comic Voice to the Downed Chinese Spy Balloon

‘S.N.L.’ Gives Comic Voice to the Downed Chinese Spy Balloon

February 5, 2023
Pakistan ex-President Pervez Musharraf dies in Dubai after years in exile

Pakistan ex-President Pervez Musharraf dies in Dubai after years in exile

February 5, 2023
Chile wildfires spread as death toll rises

Chile wildfires spread as death toll rises

February 5, 2023
Virginia police seeking public’s help after unidentified body found at Williamsburg pond

Virginia police seeking public’s help after unidentified body found at Williamsburg pond

February 5, 2023
Pakistan’s former military ruler Pervez Musharraf dies: army

Pakistan’s former military ruler Pervez Musharraf dies: army

February 5, 2023
Pervez Musharraf, Former Military Ruler of Pakistan, Dies at 79

Pervez Musharraf, Former Military Ruler of Pakistan, Dies at 79

February 5, 2023
FedEx cargo plane, Southwest flight barely avoid collision at Austin airport

FedEx cargo plane, Southwest flight barely avoid collision at Austin airport

February 5, 2023
‘SNL’s Weekend Update Takes Swipes At George Santos’ “New Lie” About ‘Spider-Man’ Musical & Donald Trump

‘SNL’s Weekend Update Takes Swipes At George Santos’ “New Lie” About ‘Spider-Man’ Musical & Donald Trump

February 5, 2023
Pennsylvania family made ‘joint decision’ to commit murder-suicide

Pennsylvania family made ‘joint decision’ to commit murder-suicide

February 5, 2023
Time’s up for China’s spy balloon, but the question now is: what will the FBI uncover?

Time’s up for China’s spy balloon, but the question now is: what will the FBI uncover?

February 5, 2023
SNL Roasts Santos, MTG and Trump’s ‘Big Ole Dump Truck’

SNL Roasts Santos, MTG and Trump’s ‘Big Ole Dump Truck’

February 5, 2023
DNYUZ
  • Home
  • News
    • U.S.
    • World
    • Politics
    • Opinion
    • Business
    • Crime
    • Education
    • Environment
    • Science
  • Entertainment
    • Culture
    • Music
    • Movie
    • Television
    • Theater
    • Gaming
    • Sports
  • Tech
    • Apps
    • Autos
    • Gear
    • Mobile
    • Startup
  • Lifestyle
    • Arts
    • Fashion
    • Food
    • Health
    • Travel
No Result
View All Result
DNYUZ
No Result
View All Result
Home News

How MIT is training AI language models in an era of quality data scarcity

December 6, 2022
in News
How MIT is training AI language models in an era of quality data scarcity
551
SHARES
1.6k
VIEWS
Share on FacebookShare on Twitter

Improving the robustness of machine learning (ML) models for natural language tasks has become a major artificial intelligence (AI) topic in recent years. Large language models (LLMs) have always been one of the most trending areas in AI research, backed by the rise of generative AI and companies racing to release architectures that can create impressively readable content, even computer code. 

Language models have traditionally been trained using online texts from sources such as Wikipedia, news stories, scientific papers and novels. However, in recent years, the tendency has been to train these models on increasing amounts of data in order to improve their accuracy and versatility.

But, according to a team of AI forecasters, there is a concern on the horizon: we may run out of data to train them on. Researchers from Epoch emphasize in a study that high-quality data generally used for training language models may be depleted as early as 2026. As developers create more sophisticated models with superior capabilities, they must gather more texts to train them on, and LLM researchers are now increasingly concerned about running out of quality data.

Kalyan Veeramachaneni, a principal research scientist in the MIT Information and Decision Systems laboratory and leader of the lab’s Data-to-AI group, may have found the solution. In a paper on Rewrite and Rollback (“R&R: Metric-Guided Adversarial Sentence Generation”) recently published in the findings of AACL-IJCNLP 2022, the proposed framework can tweak and turn low-quality data (from sources such as Twitter and 4Chan) into high-quality data (such as that from sources with editorial filters, such as Wikipedia and industry websites), increasing the amount of the correct type of data to test and train language models on.

Data scarcity looming large

Language AI researchers generally divide the data they use to train models into high-quality and low-quality data. High-quality data is generally defined as coming from sources that “have passed usefulness or quality filters” as noted by the Epoch study. In other words, it has been reviewed for editorial quality, either professionally or through peer review (in the case of scientific papers, published novels, Wikipedia, etc.) or positive engagement by many users (such as for filtered web content).

Data from low-quality categories includes non-filtered, user-generated text such as social media postings or comments on websites such as 4chan, and these instances far outweigh those rated high quality.

Training LLMs with flawed, low-quality datasets can lead to many issues:

  • Mislabeled examples in the dataset introduce noise into the training, which can confuse the model and decrease the model quality.
  • Spurious correlations (e.g., sentences with certain words always getting one particular label) encourage the model to pick up incorrect shortcuts and lead it to make mistakes in real scenarios.
  • Data bias (e.g., a dataset containing text only from a specific group of people) makes the model perform poorly on particular inputs. High-quality datasets can alleviate these issues.

Since ML models rely on training data to learn how to make predictions, data quality dramatically impacts the quality of the model. As a result, researchers often only train models with high-quality data, as they want their models to re-create superior language fluency. Training LLMs using high-quality text samples enables the model to understand the intricacies and complexity inherent in every language. This method has yielded outstanding results for complex language models like GPT-3.

Veeramachaneni says that aiming for a more intelligent and articulate text generation can also be helpful in training LLMs on real-life human discourse. 

“Text from your average social media post, blog, etc., may not achieve this high quality, which brings down the overall quality of the training set,” Veeramachaneni told VentureBeat. “We thought, could we use existing high-quality data to train LLMs (which we now already have access to LLMs trained on high-quality data) and use those LLMs to raise the quality of the other data?” 

MIT addresses current challenges in LLM development

Veeramachaneni explained that training LLMs requires massive amounts of training data and computing resources, which are only available to tech giants. This means most individual researchers must depend on the LLMs generated and released by tech giants rather than making their own.

He said that despite LLMs becoming larger and requiring more training data, the bottleneck is still computational power most of the time. 

“Annotated high-quality data for downstream tasks [is] hard to obtain. Even if we design a method to create higher-quality sentences from lower-quality ones, how would we know the method did the job correctly? Asking humans to annotate data is expensive and not scalable.” 

“So, R&R provides a method to use LLMs reliably to improve the quality of sentences,” he said. 

Veeramachaneni believes that, in terms of model quality, current LLMs need to improve their ability to generate long documents.

“Current models can answer questions with a few sentences but cannot write a fictional story with a theme and a logical plot. Architecture improvement is necessary for LMs to handle longer text,” said Veeramachaneni. “There are also more and more concerns about the potential negative impacts of LLMs. For example, LLMs may remember personal information from the training data and leak it when generating text. This issue is hard to detect, as most LLMs are black boxes.”

Veeramachaneni and the research team in MIT’s Data-to-AI group aim to solve such issues through their Rewrite and Rollback framework. 

A new method of adversarial generation from the MIT team

In the paper “R&R: Metric-Guided Adversarial Sentence Generation,” the research team proposes an adversarial framework that can generate high-quality text data by optimizing a critique score that combines fluency, similarity and misclassification metrics. R&R generates high-quality adversarial examples by capturing text data from different sources and rephrasing them,  such as tweaking a sentence in various ways to develop a set of alternative sentences. 

“Given 30K words in its vocabulary, it can produce an arbitrary number of sentences. Then it winnows these down to the highest-quality sentences in terms of grammatical quality, fluency and semantic similarity to the original sentence,” Veeramachaneni told VentureBeat.

To do this, it utilizes an LLM trained on high-quality sentences to remove sentences that need to be grammatically correct or fluent. First, it attempts to rewrite the whole sentence, with no limitation on how many words are changed; then it tries to roll back some edits to achieve a minimal set of modifications.

“Because text classifiers generally need to be trained on human-labeled data, they are often trained with small datasets, meaning they can easily be fooled and misclassify sentences. We used R&R to generate many of these sentences that could fool a text classifier and therefore could be used to train and improve it,” explained Veeramachaneni.

It’s also possible to use R&R to transform a low-quality or poorly written sentence into a better-quality sentence. Such a method can have several applications, from editing assistance for human writing to creating more data for LLMs. 

The stochastic rewrite feature allows the tool to explore a larger text space, and the rollback feature allows it to make meaningful changes with minimal edits. This feature is powerful because it explores many options and can find multiple different adversarial examples for the same sentence. As a result, R&R can generate fluent sentences that are semantically similar to a target sentence without human intervention. 

“The primary use case of R&R is to conduct adversarial attacks on text classifiers,” said Veeramachaneni. “Given a sentence, it can find similar sentences where the classifier misclassified. R&R-generated sentences can help expand these training sets, thus improving text classifiers’ quality, which may also increase their potential applications.”

Talking about the challenges faced while developing the R&R model, Veeramachaneni told VentureBeat that traditional methods for finding alternative sentences stick to changing one word at a time. When designing the rewrite step, the team initially developed the technique to mask only one word — that is, to change one word at a time. Doing so, they found that this led to a change of meaning from that of the original sentence.

“Such a design led to the model getting stuck because there are not many options for a single masked position,” he said. “We overcome this by masking multiple words in each step. This new design also enabled the model to change the length of the text. Hence we introduced the rollback step, which eliminates unnecessary perturbations/changes.”

The research team says that R&R can also help people change their writing in pursuit of a specific goal: for instance, it can be used to make a sentence more persuasive, more concise, etc. Both automatic and human evaluation of the R&R framework showed that the proposed method succeeds in optimizing the automatic similarity and fluency metrics to generate adversarial examples of higher quality than previous methods.

The future of LLMs and generative AI 

Veeramachaneni believes that LLMs will push the boundaries for human discourse in the near future and hopes to see more applications of LLMs in 2023. 

“LLMs will be able to quickly and easily summarize and provide existing information. As a result, what we write and our interactions with each other will have to be more meaningful and insightful. It is progress,” he said. 

Veeramachaneni further explained that LLMs are currently only being used to summarize text or answer questions, but there are many more possible applications.

“As the potential of these tools is continually realized, we expect a usage boom. The recent release of ChatGPT by OpenAI has demonstrated good text-generation capability. We can expect tech giants to compete on larger models and release larger models with better performance,” said Veeramachaneni. 

“At the same time, we expect serious evaluations of LLMs’ limitations and vulnerabilities. It is clear that LLMs can produce meaningful, readable sentences. Now, we expect people to begin focusing on evaluating the factual information contained in the generated text.”

The post How MIT is training AI language models in an era of quality data scarcity appeared first on Venture Beat.

Share220Tweet138Share

Trending Posts

‘Saturday Night Live‘: Pedro Pascal Spoofs ’The Last Of Us’ In Trailer For Apocalyptic Mario Kart Prestige Drama Series

‘Saturday Night Live‘: Pedro Pascal Spoofs ’The Last Of Us’ In Trailer For Apocalyptic Mario Kart Prestige Drama Series

February 5, 2023
Off-Duty Officer Shot While Trying to Purchase Vehicle in Brooklyn, Police Say

Off-Duty Officer Shot While Trying to Purchase Vehicle in Brooklyn, Police Say

February 5, 2023
Ex-Laker Believes Kareem Abdul-Jabbar Should Be Proud When LeBron James Breaks His Record

Ex-Laker Believes Kareem Abdul-Jabbar Should Be Proud When LeBron James Breaks His Record

February 5, 2023
NYPD off-duty officer ‘fighting for his life’ after used vehicle purchase turns into armed robbery: officials

NYPD off-duty officer ‘fighting for his life’ after used vehicle purchase turns into armed robbery: officials

February 5, 2023
Samantha Markle serves court papers to royals in defamation suit against Harry and Meghan

Samantha Markle serves court papers to royals in defamation suit against Harry and Meghan

February 5, 2023

Copyright © 2023.

Site Navigation

  • About
  • Advertise
  • Privacy & Policy
  • Contact

Follow Us

No Result
View All Result
  • Home
  • News
    • U.S.
    • World
    • Politics
    • Opinion
    • Business
    • Crime
    • Education
    • Environment
    • Science
  • Entertainment
    • Culture
    • Gaming
    • Music
    • Movie
    • Sports
    • Television
    • Theater
  • Tech
    • Apps
    • Autos
    • Gear
    • Mobile
    • Startup
  • Lifestyle
    • Arts
    • Fashion
    • Food
    • Health
    • Travel

Copyright © 2023.

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT