By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
AmextaFinanceAmextaFinance
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Notification Show More
Aa
AmextaFinanceAmextaFinance
Aa
  • Banking
  • Credit Cards
  • Loans
  • Dept Management
  • Mortgage
  • Markets
  • Investing
  • Small Business
  • Videos
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Follow US
AmextaFinance > News > Why computer-made data is being used to train AI models
News

Why computer-made data is being used to train AI models

News Room
Last updated: 2023/07/19 at 2:47 AM
By News Room
Share
8 Min Read
SHARE

Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch.

Microsoft, OpenAI and Cohere are among the groups testing the use of so-called “synthetic data” — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limits of human-made data that can further improve the cutting-edge technology.

The launch of Microsoft-backed OpenAI’s ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts.

The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world’s biggest technology companies including Google, Microsoft and Meta racing to dominate the space.

Currently, LLMs that power chatbots like OpenAI’s ChatGPT and Google’s Bard are trained primarily by scraping the internet. Data used to train these systems includes digitised books, news articles, blogs, search queries, Twitter and Reddit posts, YouTube videos and Flickr images, among other content.

Humans are then used to provide feedback and fill gaps in the information in a process known as reinforcement learning by human feedback (RLHF).

But as generative AI software becomes more sophisticated, even deep-pocketed AI companies are running out of easily accessible and high-quality data to train on. Meanwhile, they are under fire from regulators, artists and media organisations around the world over the volume and provenance of personal data consumed by the technology.

At an event in London in May, OpenAI’s chief executive Sam Altman was asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Altman brushed it off, saying he was “pretty confident that soon all data will be synthetic data”. 

Generic data from the web is no longer good enough to push the performance of AI models, according to developers.

“If you could get all the data that you needed off the web, that would be fantastic,” said Aidan Gomez, chief executive of $2bn LLM start-up Cohere. “In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need.”

Currently, the most cutting-edge models, such as OpenAI’s GPT-4, are approaching human-level performance in areas such as writing and coding, and are able to pass benchmarks such as the US bar exam.

To dramatically improve their performance and be able to address challenges in science, medicine or business, AI models will require unique and sophisticated data sets. These will either have to be created by world experts such as scientists, doctors, authors, actors or engineers, or acquired as proprietary data from large corporations like pharmaceuticals, banks and retailers. However, “human-created data . . . is extremely expensive”, Gomez said.

The new trend of using synthetic data sidesteps this costly requirement. Instead, companies can use AI models to produce text, code or more complex information related to healthcare or financial fraud. This synthetic data is then used to train advanced LLMs to become ever more capable.

According to Gomez, Cohere as well as several of its competitors already use synthetic data which is then fine-tuned and tweaked by humans. “[Synthetic data] is already huge . . . even if it’s not broadcast widely,” he said. 

For example, to train a model on advanced mathematics, Cohere might use two AI models talking to each other, where one acts as a maths tutor and the other as the student.

“They’re having a conversation about trigonometry . . . and it’s all synthetic,” Gomez said. “It’s all just imagined by the model. And then the human looks at this conversation and goes in and corrects it if the model said something wrong. That’s the status quo today.”

Two recent studies from Microsoft Research showed that synthetic data could be used to train models that were smaller and simpler than state-of-the art software like OpenAI’s GPT-4 or Google’s PaLM-2.

One paper described a synthetic data set of short stories generated by GPT-4, which only contained words that a typical four-year-old might understand. This data set, known as TinyStories, was then used to train a simple LLM that was able to produce fluent and grammatically correct stories. The other paper showed that AI could be trained on synthetic Python code in the form of textbooks and exercises, which they found performed relatively well on coding tasks.

Start-ups such as Scale AI and Gretel.ai have sprung up to provide synthetic data as a service. Gretel, set up by former US intelligence analysts from the National Security Agency and the CIA, works with companies including Google, HSBC, Riot Games and Illumina to augment their existing data with synthetic versions that can help train better AI models. 

The key component of synthetic data, according to Gretel chief executive Ali Golshan, is that it preserves the privacy of all individuals in a data set, while still maintaining its statistical integrity.

Well-crafted synthetic data can also remove biases and imbalances in existing data, he added. “Hedge funds can look at black swan events and, say, create a hundred variations to see if our models crack,” Golshan said. For banks, where fraud typically constitutes less than a 100th of a per cent of total data, Gretel’s software can generate “thousands of edge case scenarios on fraud and train [AI] models with it.” 

Critics point out that not all synthetic data will be carefully curated to reflect or improve on real-world data. As AI-generated text and images start to fill the internet, it is likely that AI companies crawling the web for training data will inevitably end up using raw data produced by primitive versions of their own models — a phenomenon known as “dog-fooding”.

Research from universities including Oxford and Cambridge, recently warned that training AI models on their own raw outputs, which may contain falsehoods or fabrications, could corrupt and degrade the technology over time, causing “irreversible defects.”

Golshan agrees that training on poor synthetic data could impede progress. “The content on the web is more and more AI-generated, and I do think that will lead to degradation over time [because] LLMs are producing regurgitated knowledge, without any new insights,” he said. 

Despite these risks, AI researchers like Cohere’s Gomez say that synthetic data has the potential to accelerate the path to superintelligent AI systems.

“What you really want is models to be able to teach themselves. You want them to be able to . . . ask their own questions, discover new truths and create their own knowledge,” he said. “That’s the dream.” 

Read the full article here

News Room July 19, 2023 July 19, 2023
Share this Article
Facebook Twitter Copy Link Print
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Finance Weekly Newsletter

Join now for the latest news, tips, and analysis about personal finance, credit cards, dept management, and many more from our experts.
Join Now
Israel to join ceasefire talks despite ‘unacceptable’ response from Hamas

Unlock the Editor’s Digest for freeRoula Khalaf, Editor of the FT, selects…

Tesla and China: EV maker is facing competition like never before, analyst says

Watch full video on YouTube

How Tariffs Will Make Everything More Expensive In The U.S. | CNBC Marathon

Watch full video on YouTube

Texas flash flood claims at least 24 lives

Unlock the Editor’s Digest for freeRoula Khalaf, Editor of the FT, selects…

Opec+ plans to boost oil output in bid to win back market share

Stay informed with free updatesSimply sign up to the Oil & Gas…

- Advertisement -
Ad imageAd image

You Might Also Like

News

Israel to join ceasefire talks despite ‘unacceptable’ response from Hamas

By News Room
News

Texas flash flood claims at least 24 lives

By News Room
News

Opec+ plans to boost oil output in bid to win back market share

By News Room
News

Dealmakers hit pause on M&A as caution rules the boardroom

By News Room
News

All hail the equity vigilantes

By News Room
News

Apple races to box office glory with Brad Pitt’s F1 blockbuster

By News Room
News

EU to stockpile critical minerals due to war risk

By News Room
News

How trade tensions are really affecting the global economy

By News Room
Facebook Twitter Pinterest Youtube Instagram
Company
  • Privacy Policy
  • Terms & Conditions
  • Press Release
  • Contact
  • Advertisement
More Info
  • Newsletter
  • Market Data
  • Credit Cards
  • Videos

Sign Up For Free

Subscribe to our newsletter and don't miss out on our programs, webinars and trainings.

I have read and agree to the terms & conditions
Join Community

2023 © Indepta.com. All Rights Reserved.

YOUR EMAIL HAS BEEN CONFIRMED.
THANK YOU!

Welcome Back!

Sign in to your account

Lost your password?