Benchmark Level Meaning

News

10h

Anthropic’s Claude Opus 4 and Sonnet 4 Set a New Benchmark in AI Coding

Anthropic's new Claude Opus 4 and Sonnet 4 AI models deliver state-of-the-art performance in coding and agentic workflows.

10h

AI benchmarking platform is helping top companies rig their model performances, study claims

LMArena, a popular benchmark for large language models, has been accused of giving preferential treatment to AIs made by big ...

VentureBeat12h

After GPT-4o backlash, researchers benchmark models on moral endorsement—Find sycophancy persists across the board

They called the benchmark Elephant, for Evaluation of LLMs as Excessive SycoPHANTs, and found that every large language model (LLM) has a certain level of sycophany. By understanding how sycophantic ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

News

Trending now