NEWSLETTER

By clicking submit, you agree to share your email address with TFN to receive marketing, updates, and other emails from the site owner. Use the unsubscribe link in the emails to opt out at any time.

Claude Opus 4.6 crushes benchmarks with 1M-token beta window

Anthropic Claude
Picture credits: rafapress/DepositPhotos

Anthropic has introduced Claude Opus 4.6, an upgraded version of its most advanced AI model.

The company says the new model improves significantly in coding, reasoning, long-context understanding, and real-world knowledge work.

Claude Opus 4.6

Claude Opus 4.6 is designed to plan tasks more carefully, work for longer periods without losing focus, and operate more reliably in large codebases.

It also improves code review and debugging, helping it detect and correct its own mistakes. For the first time in the Opus series, the model includes a 1-million-token context window in beta.

Beyond software development, the model is built to handle everyday professional tasks. It can run financial analysis, conduct research, and create or edit documents, spreadsheets, and presentations.

Inside Anthropic’s Cowork environment, Claude can manage multiple tasks autonomously, says the company.

Performance across benchmarks

Anthropic says that Claude Opus 4.6 delivers state-of-the-art results across several industry benchmarks. It achieved the highest score on Terminal-Bench 2.0, an evaluation focused on agentic coding.

It also leads to Humanity’s Last Exam, a complex reasoning benchmark covering multiple disciplines.

On GDPval-AA, which measures performance on economically valuable knowledge tasks in finance and legal domains, Opus 4.6 outperforms OpenAI’s GPT-5.2 by about 144 Elo points and improves on its own predecessor, Claude Opus 4.5, by 190 points. The model also tops BrowseComp, a benchmark that tests its ability to find difficult information online.

Anthropic reports that the model performs better at retrieving relevant information from very large document sets. In long-context testing, Opus 4.6 maintains accuracy across hundreds of thousands of tokens with less “context rot,” a common issue in which models lose track of earlier information.

On the 8-needle 1M variant of MRCR v2, a benchmark that hides key facts in massive text volumes, Opus 4.6 scored 76%, compared to 18.5% for Sonnet 4.5. The model also shows strong results in software engineering, multilingual coding, cybersecurity, long-term coherence, root cause analysis, and life sciences knowledge.

Safety and alignment

Anthropic says the performance gains do not reduce safety. On its automated behavioural audit, Opus 4.6 showed low levels of problematic behaviours such as deception, sycophancy, or encouraging harmful misuse. The company states that the model is as aligned as Claude Opus 4.5 and has the lowest rate of unnecessary refusals among recent Claude models.

For this release, Anthropic conducted expanded safety evaluations, including new tests focused on user well-being, refusal behaviour, and hidden harmful actions. The company also used new interpretability methods to better understand how the model reaches its decisions.

Because the model demonstrates stronger cybersecurity skills, Anthropic added six new cybersecurity probes to monitor potential misuse. At the same time, the company is promoting defensive uses of the model, including identifying and fixing vulnerabilities in open-source software.

Developer and product updates

Anthropic introduced several upgrades across Claude’s developer platform and tools to support Opus 4.6. Key updates include adaptive thinking, where the model decides when deeper reasoning is needed.

Developers can now choose from four effort levels: low, medium, high (default), and max, to balance intelligence, speed, and cost.

The API also includes context compaction in beta, which automatically summarises older conversation content to allow longer-running tasks.

Opus 4.6 supports up to 128,000 output tokens and offers a 1-million-token context window in beta, with premium pricing for prompts above 200,000 tokens. For workloads requiring U.S.-based processing, a US-only inference option is available at 1.1× token pricing.

Claude Opus 4.6 is available through claude.ai, Anthropic’s API, and major cloud platforms. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, with higher rates applying for extended context usage.

Total
0
Shares
Related Posts
Total
0
Share

Get daily funding news briefings in the tech world delivered right to your inbox.

Enter Your Email
join our newsletter. thank you
TFN Banner