Tuesday, March 3, 2026

A brand new AI coding problem simply revealed its first outcomes — and so they aren’t fairly

Share


A brand new AI coding problem has revealed its first winner — and set a brand new bar for AI-powered software program engineers. 

On Wednesday at 5 p.m. PT, the nonprofit Laude Institute introduced the primary winner of the Okay Prize, a multi-round AI coding problem launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian immediate engineer named Eduardo Rocha de Andrade, who will obtain $50,000 for the prize. However extra shocking than the win was his closing rating: He received with right solutions to only 7.5% of the questions on the take a look at.

“We’re glad we constructed a benchmark that’s really onerous,” stated Konwinski. “Benchmarks needs to be onerous in the event that they’re going to matter,” he continued, including: “Scores could be completely different if the massive labs had entered with their largest fashions. However that’s sort of the purpose. Okay Prize runs offline with restricted compute, so it favors smaller and open fashions. I really like that. It ranges the enjoying subject.”

Konwinski has pledged $1 million to the primary open supply mannequin that may rating increased than 90% on the take a look at.

Much like the well-known SWE-Bench system, the Okay Prize assessments fashions in opposition to flagged points from GitHub as a take a look at of how effectively fashions can cope with real-world programming issues. However whereas SWE-Bench is predicated on a hard and fast set of issues that fashions can practice in opposition to, the Okay Prize is designed as a “contamination-free model of SWE-Bench,” utilizing a timed entry system to protect in opposition to any benchmark-specific coaching. For spherical one, fashions have been due by March 12. The Okay Prize organizers then constructed the take a look at utilizing solely GitHub points flagged after that date.

The 7.5% high rating stands in marked distinction to SWE-Bench itself, which at the moment exhibits a 75% high rating on its simpler “Verified” take a look at and 34% on its tougher “Full” take a look at. Konwinski nonetheless isn’t positive whether or not the disparity is because of contamination on SWE-Bench or simply the problem of amassing new points from GitHub, however he expects the Okay Prize mission to reply the query quickly.

“As we get extra runs of the factor, we’ll have a greater sense,” he advised TechCrunch, “as a result of we anticipate individuals to adapt to the dynamics of competing on this each few months.”

Techcrunch occasion

San Francisco
|
October 27-29, 2025

It’d look like an odd place to fall quick, given the big selection of AI coding instruments already publicly out there — however with benchmarks changing into too straightforward, many critics see initiatives just like the Okay Prize as a essential step towards fixing AI’s growing evaluation problem.

“I’m fairly bullish about constructing new assessments for present benchmarks,” says Princeton researcher Sayash Kapoor, who put ahead an identical concept in a recent paper. “With out such experiments, we are able to’t really inform if the problem is contamination, and even simply focusing on the SWE-Bench leaderboard with a human within the loop.”

For Konwinski, it’s not only a higher benchmark, however an open problem to the remainder of the business. “In case you take heed to the hype, it’s like we needs to be seeing AI docs and AI attorneys and AI software program engineers, and that’s simply not true,” he says. “If we are able to’t even get greater than 10% on a contamination-free SWE-Bench, that’s the truth verify for me.”



Source link

Read more

Read More