By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
AmextaFinanceAmextaFinance
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Notification Show More
Aa
AmextaFinanceAmextaFinance
Aa
  • Banking
  • Credit Cards
  • Loans
  • Dept Management
  • Mortgage
  • Markets
  • Investing
  • Small Business
  • Videos
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Follow US
AmextaFinance > News > Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results
News

Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

News Room
Last updated: 2025/02/03 at 2:42 PM
By News Room
Share
5 Min Read
SHARE

Stay informed with free updates

Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.

Artificial intelligence start-up Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology.

In a paper released on Monday, the San Francisco-based start-up outlined a new system called “constitutional classifiers”. It is a model that acts as a protective layer on top of large language models such as the one that powers Anthropic’s Claude chatbot, which can monitor both inputs and outputs for harmful content.

The development by Anthropic, which is in talks to raise $2bn at a $60bn valuation, comes amid growing industry concern over “jailbreaking” — attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons.

Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced “prompt shields” last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.

Mrinank Sharma, a member of technical staff at Anthropic, said: “The main motivation behind the work was for severe chemical [weapon] stuff [but] the real advantage of the method is its ability to respond quickly and adapt.”

Anthropic said it would not be immediately using the system on its current Claude models but would consider implementing it if riskier models were released in future. Sharma added: “The big takeaway from this work is that we think this is a tractable problem.”

The start-up’s proposed solution is built on a so-called “constitution” of rules that define what is permitted and restricted and can be adapted to capture different types of material.

Some jailbreak attempts are well-known, such as using unusual capitalisation in the prompt or asking the model to adopt the persona of a grandmother to tell a bedside story about a nefarious topic.

To validate the system’s effectiveness, Anthropic offered “bug bounties” of up to $15,000 to individuals who attempted to bypass the security measures. These testers, known as red teamers, spent more than 3,000 hours trying to break through the defences.

Anthropic’s Claude 3.5 Sonnet model rejected more than 95 per cent of the attempts with the classifiers in place, compared to 14 per cent without safeguards.

Leading tech companies are trying to reduce the misuse of their models, while still maintaining their helpfulness. Often, when moderation measures are put in place, models can become cautious and reject benign requests, such as with early versions of Google’s Gemini image generator or Meta’s Llama 2. Anthropic said their classifiers caused “only a 0.38 per cent absolute increase in refusal rates”.

However, adding these protections also incurs extra costs for companies already paying huge sums for computing power required to train and run models. Anthropic said the classifier would amount to a nearly 24 per cent increase in “inference overhead”, the costs of running the models.

Security experts have argued that the accessible nature of such generative chatbots has enabled ordinary people with no prior knowledge to attempt to extract dangerous information.

“In 2016, the threat actor we would have in mind was a really powerful nation-state adversary,” said Ram Shankar Siva Kumar, who leads the AI red team at Microsoft. “Now literally one of my threat actors is a teenager with a potty mouth.”

Read the full article here

News Room February 3, 2025 February 3, 2025
Share this Article
Facebook Twitter Copy Link Print
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Finance Weekly Newsletter

Join now for the latest news, tips, and analysis about personal finance, credit cards, dept management, and many more from our experts.
Join Now
In 2026, we’re channeling Powell to reach all of our goals.

Watch full video on YouTube

Why It Feels Like Every Movie Is Just Another Sequel

Watch full video on YouTube

US government releases millions of Jeffrey Epstein documents

Unlock the White House Watch newsletter for freeYour guide to what Trump’s…

Nvidia and AMD unveil new chips at CES, businesses are optimistic despite inflation

Watch full video on YouTube

Meta’s $2 Billion Bet To Win Over Enterprise Customers

Watch full video on YouTube

- Advertisement -
Ad imageAd image

You Might Also Like

News

US government releases millions of Jeffrey Epstein documents

By News Room
News

Tesla lurches into the Musk robotics era

By News Room
News

Donald Trump’s ‘beautiful armada’ underlines US threat to Iran

By News Room
News

Keir Starmer meets Xi Jinping in bid to revive strained UK-China ties

By News Room
News

Meta Stock: Shock And Awe (Rating Downgrade) (NASDAQ:META)

By News Room
News

Qorvo, Inc. (QRVO) Q3 2026 Earnings Call Transcript

By News Room
News

Anthropic doubles VC fundraising to $20bn on surging investor demand

By News Room
News

EU and India seal trade deal to slash €4bn of tariffs on bloc’s exports

By News Room
Facebook Twitter Pinterest Youtube Instagram
Company
  • Privacy Policy
  • Terms & Conditions
  • Press Release
  • Contact
  • Advertisement
More Info
  • Newsletter
  • Market Data
  • Credit Cards
  • Videos

Sign Up For Free

Subscribe to our newsletter and don't miss out on our programs, webinars and trainings.

I have read and agree to the terms & conditions
Join Community

2023 © Indepta.com. All Rights Reserved.

YOUR EMAIL HAS BEEN CONFIRMED.
THANK YOU!

Welcome Back!

Sign in to your account

Lost your password?