By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
PulseReporterPulseReporter
  • Home
  • Entertainment
  • Lifestyle
  • Money
  • Tech
  • Travel
  • Investigations
Reading: AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering
Share
Notification Show More
Font ResizerAa
PulseReporterPulseReporter
Font ResizerAa
  • Home
  • Entertainment
  • Lifestyle
  • Money
  • Tech
  • Travel
  • Investigations
Have an existing account? Sign In
Follow US
  • Advertise
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
PulseReporter > Blog > Tech > AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering
Tech

AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering

Pulse Reporter
Last updated: February 19, 2025 6:42 am
Pulse Reporter 3 months ago
Share
AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering
SHARE

Be part of our each day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra


Giant language fashions (LLMs) might have modified software program improvement, however enterprises might want to suppose twice about solely changing human software program engineers with LLMs, regardless of OpenAI CEO Sam Altman’s declare that fashions can substitute “low-level” engineers.

In a new paper, OpenAI researchers element how they developed an LLM benchmark known as SWE-Lancer to check how a lot basis fashions can earn from real-life freelance software program engineering duties. The check discovered that, whereas the fashions can clear up bugs, they will’t see why the bug exists and proceed to make extra errors. 

The researchers tasked three LLMs — OpenAI’s GPT-4o and o1 and Anthropic’s Claude-3.5 Sonnet — with 1,488 freelance software program engineer duties from the freelance platform Upwork amounting to $1 million in payouts. They divided the duties into two classes: particular person contributor duties (resolving bugs or implementing options), and administration duties (the place the mannequin roleplays as a supervisor who will select the perfect proposal to resolve points). 

“Outcomes point out that the real-world freelance work in our benchmark stays difficult for frontier language fashions,” the researchers write. 

The check reveals that basis fashions can not absolutely substitute human engineers. Whereas they can assist clear up bugs, they’re not fairly on the degree the place they will begin incomes freelancing money by themselves. 

Benchmarking freelancing fashions

The researchers and 100 different skilled software program engineers recognized potential duties on Upwork and, with out altering any phrases, fed these to a Docker container to create the SWE-Lancer dataset. The container doesn’t have web entry and can’t entry GitHub “to keep away from the doable of fashions scraping code diffs or pull request particulars,” they defined.

The workforce recognized 764 particular person contributor duties, totaling about $414,775, starting from 15-minute bug fixes to weeklong function requests. These duties, which included reviewing freelancer proposals and job postings, would pay out $585,225.

The duties had been added to the expensing platform Expensify. 

The researchers generated prompts based mostly on the duty title and outline and a snapshot of the codebase. If there have been extra proposals to resolve the problem, “we additionally generated a administration process utilizing the problem description and checklist of proposals,” they defined.

From right here, the researchers moved to end-to-end check improvement. They wrote Playwright assessments for every process that applies these generated patches which had been then “triple-verified” by skilled software program engineers.

“Checks simulate real-world person flows, akin to logging into the applying, performing complicated actions (making monetary transactions) and verifying that the mannequin’s resolution works as anticipated,” the paper explains. 

Take a look at outcomes

After working the check, the researchers discovered that not one of the fashions earned the total $1 million worth of the duties. Claude 3.5 Sonnet, the best-performing mannequin, earned solely $208,050 and resolved 26.2% of the person contributor points. Nonetheless, the researchers level out, “the vast majority of its options are incorrect, and better reliability is required for reliable deployment.”

The fashions carried out effectively throughout most particular person contributor duties, with Claude 3.5-Sonnet performing finest, adopted by o1 and GPT-4o. 

“Brokers excel at localizing, however fail to root trigger, leading to partial or flawed options,” the report explains. “Brokers pinpoint the supply of a difficulty remarkably shortly, utilizing key phrase searches throughout the entire repository to shortly find the related file and capabilities — typically far quicker than a human would. Nonetheless, they typically exhibit a restricted understanding of how the problem spans a number of parts or information, and fail to handle the foundation trigger, resulting in options which are incorrect or insufficiently complete. We not often discover instances the place the agent goals to breed the problem or fails as a result of not discovering the fitting file or location to edit.”

Apparently, the fashions all carried out higher on supervisor duties that required reasoning to judge technical understanding.

These benchmark assessments confirmed that AI fashions can clear up some “low-level” coding issues and might’t substitute “low-level” software program engineers but. The fashions nonetheless took time, typically made errors, and couldn’t chase a bug round to seek out the foundation reason for coding issues. Many “low-level” engineers work higher, however the researchers stated this will not be the case for very lengthy. 

Each day insights on enterprise use instances with VB Each day

If you wish to impress your boss, VB Each day has you coated. We provide the inside scoop on what corporations are doing with generative AI, from regulatory shifts to sensible deployments, so you possibly can share insights for max ROI.

Learn our Privateness Coverage

Thanks for subscribing. Try extra VB newsletters right here.

An error occured.


You Might Also Like

Elon Musk’s Neuralink Information to Trademark ‘Telepathy’

Neo-Nazis Love the Nazi-Like Salutes Elon Musk Made at Trump’s Inauguration

This Imaginative and prescient Professional Digital Boy emulator isn’t fancy, but it surely will get the job accomplished

Wildfire evacuation alert for all of Los Angeles was despatched by mistake

Opus Main raises $10M to convey individuals collectively via music and video games

Share This Article
Facebook Twitter Email Print
Previous Article How one can use bank cards responsibly How one can use bank cards responsibly
Next Article Individuals Are Outraged At The Lacking Particulars In "White Lotus" Season 3 And I'm Howling Individuals Are Outraged At The Lacking Particulars In "White Lotus" Season 3 And I'm Howling
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Weekly Newsletter

Subscribe to our newsletter to get our newest articles instantly!

More News

Shakespearean TV Trivia Quiz — BuzzFeed Quizzes
Shakespearean TV Trivia Quiz — BuzzFeed Quizzes
14 minutes ago
Do your digital due diligence this EOFY
Do your digital due diligence this EOFY
34 minutes ago
Nvidia’s chips are among the many world’s hottest commodities. So why is the corporate doubtless trashing .5 billion value of H20s?
Nvidia’s chips are among the many world’s hottest commodities. So why is the corporate doubtless trashing $4.5 billion value of H20s?
39 minutes ago
Angelina Jolie, Brad Pitt Daughter Shiloh New Title
Angelina Jolie, Brad Pitt Daughter Shiloh New Title
1 hour ago
Nvidia CEO Jensen Huang sings praises of processor in Nintendo Change 2
Nvidia CEO Jensen Huang sings praises of processor in Nintendo Change 2
2 hours ago

About Us

about us

PulseReporter connects with and influences 20 million readers globally, establishing us as the leading destination for cutting-edge insights in entertainment, lifestyle, money, tech, travel, and investigative journalism.

Categories

  • Entertainment
  • Investigations
  • Lifestyle
  • Money
  • Tech
  • Travel

Trending

  • Shakespearean TV Trivia Quiz — BuzzFeed Quizzes
  • Do your digital due diligence this EOFY
  • Nvidia’s chips are among the many world’s hottest commodities. So why is the corporate doubtless trashing $4.5 billion value of H20s?

Quick Links

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Disclaimer
2024 © Pulse Reporter. All Rights Reserved.
Welcome Back!

Sign in to your account