OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

4 Views

A discrepancy between first- and third-party benchmark outcomes for OpenAI’s o3 AI mannequin is raising questions about the company’s transparency and mannequin testing practices.

When OpenAI unveiled o3 in December, the corporate claimed the mannequin may reply simply over a fourth of questions on FrontierMath, a difficult set of math issues. That rating blew the competitors away — the next-best mannequin managed to reply solely round 2% of FrontierMath issues accurately.

“At the moment, all choices on the market have lower than 2% [on FrontierMath],” Mark Chen, chief analysis officer at OpenAI, said during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re capable of recover from 25%.”

Because it seems, that determine was seemingly an higher sure, achieved by a model of o3 with extra computing behind it than the mannequin OpenAI publicly launched final week.

Epoch AI, the analysis institute behind FrontierMath, launched outcomes of its unbiased benchmark assessments of o3 on Friday. Epoch discovered that o3 scored round 10%, nicely under OpenAI’s highest claimed rating.

OpenAI has launched o3, their extremely anticipated reasoning mannequin, together with o4-mini, a smaller and cheaper mannequin that succeeds o3-mini.

We evaluated the brand new fashions on our suite of math and science benchmarks. Ends in thread! pic.twitter.com/5gbtzkEy1B

— Epoch AI (@EpochAIResearch) April 18, 2025

That doesn’t imply OpenAI lied, per se. The benchmark outcomes the corporate printed in December present a lower-bound rating that matches the rating Epoch noticed. Epoch additionally famous its testing setup seemingly differs from OpenAI’s, and that it used an up to date launch of FrontierMath for its evaluations.

“The distinction between our outcomes and OpenAI’s is perhaps as a consequence of OpenAI evaluating with a extra highly effective inner scaffold, utilizing extra test-time [computing], or as a result of these outcomes had been run on a distinct subset of FrontierMath (the 180 issues in frontiermath-2024-11-26 vs the 290 issues in frontiermath-2025-02-28-private),” wrote Epoch.

According to a post on X from the ARC Prize Basis, a company that examined a pre-release model of o3, the general public o3 mannequin “is a distinct mannequin […] tuned for chat/product use,” corroborating Epoch’s report.

“All launched o3 compute tiers are smaller than the model we [benchmarked],” wrote ARC Prize. Typically talking, larger compute tiers may be anticipated to attain higher benchmark scores.

Re-testing launched o3 on ARC-AGI-1 will take a day or two. As a result of immediately’s launch is a materially completely different system, we’re re-labeling our previous reported outcomes as “preview”:

o3-preview (low): 75.7%, $200/job
o3-preview (excessive): 87.5%, $34.4k/job

Above makes use of o1 professional pricing…

— Mike Knoop (@mikeknoop) April 16, 2025

OpenAI’s personal Wenda Zhou, a member of the technical workers, said during a livestream last week that the o3 in manufacturing is “extra optimized for real-world use instances” and velocity versus the model of o3 demoed in December. In consequence, it might exhibit benchmark “disparities,” he added.

“[W]e’ve accomplished [optimizations] to make the [model] extra value environment friendly [and] extra helpful on the whole,” Zhou mentioned. “We nonetheless hope that — we nonetheless suppose that — it is a significantly better mannequin […] You received’t have to attend as lengthy once you’re asking for a solution, which is an actual factor with these [types of] fashions.”

Granted, the truth that the general public launch of o3 falls wanting OpenAI’s testing guarantees is a little bit of a moot level, for the reason that firm’s o3-mini-high and o4-mini fashions outperform o3 on FrontierMath, and OpenAI plans to debut a extra highly effective o3 variant, o3-pro, within the coming weeks.

It’s, nevertheless, one other reminder that AI benchmarks are finest not taken at face worth — significantly when the supply is an organization with companies to promote.

Benchmarking “controversies” have gotten a typical incidence within the AI trade as distributors race to seize headlines and mindshare with new fashions.

In January, Epoch was criticized for ready to reveal funding from OpenAI till after the corporate introduced o3. Many teachers who contributed to FrontierMath weren’t knowledgeable of OpenAI’s involvement till it was made public.

Extra lately, Elon Musk’s xAI was accused of publishing deceptive benchmark charts for its newest AI mannequin, Grok 3. Simply this month, Meta admitted to touting benchmark scores for a model of a model that differed from the one the company made available to developers.

Up to date 4:21 p.m. Pacific: Added feedback from Wenda Zhou, a member of the OpenAI technical workers, from a livestream final week.

Trending Merchandise

Add to compare

CIVOTIL Porch Sign, Porch Decor for Home, Bar, Farmhouse, 4″x16″ Aluminum Metal Wall Sign – This is Our Happy Place

Add to compare

$10.25

Add to compare

Aromatherapy Scented Candle Set – Natural Soy Wax Candles for Stress Relief and Home Decor, 4×3.5oz Scented Portable Travel Tin Candles, Perfect for Birthday Christmas Gift Idea

Add to compare

$35.99

Add to compare

PTShadow 4 Pcs Decorative Books for Home décor,Black and whiteshelf Decor Accents Library décor for Home Sweet Stacked Books

Add to compare

$22.99

Add to compare

Handmade Wooden Statue, Sitting Woman and Dog, Wood Decor Accents Craft Figurine for Bedroom Home Office Shelf Decor Gift Natural ECO Friendly

Add to compare

$15.09

Add to compare

Nicunom 12-Inch Retro Wall Clock, Round Vintage Wall Clocks, Silent Non-Ticking, Classic Decorative Clock for Home Living Room Bedroom Kitchen School Office – Battery Operated

Add to compare

$21.99

Add to compare

White Ceramic Vases Flower for Home Décor Modern Boho Vase for Living Room Pampas Floor Tall Geometric Vase (7.7in) (WhiteC)

Add to compare

$17.99

Add to compare

PeraBella House Warming Gifts New Home, Housewarming Presents Women Couple | Living Room Home Decor Farmhouse Decor, Coffee Table Decor, New Home Gifts Ideas | 3 Succulent Pots for Plant Lovers

Add to compare

$39.95

Add to compare

CEMABT White Ceramic Vase Set of 2 for Modern Minimalist Bohemian Decor，Round Matte Donut Vases for Pampas Grass-Perfect for Living Room, Dining Table, Office Bedroom Shelf Decor Christmas Day Gifts

Add to compare

$34.95

Add to compare

LEIKE Large Modern Metal Wall Clocks Rustic Round Silent Non Ticking Battery Operated Black Roman Numerals Clock for Living Room/Bedroom/Kitchen Wall Decor-60cm

Add to compare

$73.99

Add to compare

Rustic Grey Mason Jar Sconces for Home Decor, Decorative Chic Hanging Wall Decor Mason Jars with LED Strip Lights, 6-Hour Timer, Silk Hydrangea, & Iron Hooks for Home & Kitchen Decorations [Set of 2]

Add to compare

$39.99

OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

CIVOTIL Porch Sign, Porch Decor for Home, Bar, Farmhouse, 4″x16″ Aluminum Metal Wall Sign – This is Our Happy Place

Aromatherapy Scented Candle Set – Natural Soy Wax Candles for Stress Relief and Home Decor, 4×3.5oz Scented Portable Travel Tin Candles, Perfect for Birthday Christmas Gift Idea

PTShadow 4 Pcs Decorative Books for Home décor,Black and whiteshelf Decor Accents Library décor for Home Sweet Stacked Books

Handmade Wooden Statue, Sitting Woman and Dog, Wood Decor Accents Craft Figurine for Bedroom Home Office Shelf Decor Gift Natural ECO Friendly

Nicunom 12-Inch Retro Wall Clock, Round Vintage Wall Clocks, Silent Non-Ticking, Classic Decorative Clock for Home Living Room Bedroom Kitchen School Office – Battery Operated

White Ceramic Vases Flower for Home Décor Modern Boho Vase for Living Room Pampas Floor Tall Geometric Vase (7.7in) (WhiteC)

PeraBella House Warming Gifts New Home, Housewarming Presents Women Couple | Living Room Home Decor Farmhouse Decor, Coffee Table Decor, New Home Gifts Ideas | 3 Succulent Pots for Plant Lovers

CEMABT White Ceramic Vase Set of 2 for Modern Minimalist Bohemian Decor，Round Matte Donut Vases for Pampas Grass-Perfect for Living Room, Dining Table, Office Bedroom Shelf Decor Christmas Day Gifts

LEIKE Large Modern Metal Wall Clocks Rustic Round Silent Non Ticking Battery Operated Black Roman Numerals Clock for Living Room/Bedroom/Kitchen Wall Decor-60cm

Rustic Grey Mason Jar Sconces for Home Decor, Decorative Chic Hanging Wall Decor Mason Jars with LED Strip Lights, 6-Hour Timer, Silk Hydrangea, & Iron Hooks for Home & Kitchen Decorations [Set of 2]

Leave a reply Cancel reply

Compare items

Shopping cart