As AI labs push to build ever more powerful models, a fundamental challenge has emerged: we no longer have effective ways to measure what these systems can actually do. Traditional benchmarks are saturated, with many models scoring above 90% on popular tests like MMLU and GSM8K. For businesses, this creates a problem, and can they reliably assess a model’s capabilities, communicate results to stakeholders, or ensure compliance with evolving regulations.
Generic public benchmarks often fail to reflect domain-specific requirements or proprietary data distributions. Teams are left without granular measurements for model selection or adaptive testing, all while racing against competitive pressures demanding faster development cycles.
A revolutionary approach: Psychometrics meets AI
Trismik, a Cambridge-based startup founded in 2025, may have found a solution. Its team, combining former Amazon and Salesforce engineers with Cambridge scientists, is applying Item Response Theory (IRT) and Computerised Adaptive Testing (CAT), methods used in standardised human intelligence tests to AI evaluation.
By dynamically adjusting test difficulty in real-time, Trismik’s platform maps a model’s capabilities with unprecedented precision. Early results are promising: on four out of five datasets, adaptive tests deliver rankings nearly identical to full evaluations while requiring only 8.5% of the questions. This approach not only reduces evaluation costs by up to 95% but also offers far more nuanced insights than simple accuracy percentages.
The company has snapped £2.2 million in pre-seed funding, led by Twinpath Ventures, with participation from Cambridge Enterprise Ventures, Parkwalk Advisors, Fund F, Vento Ventures, and angel investors via Ventures Together.
The fund will be used to scale Trismik’s platform, expand coverage of public datasets for reasoning, safety, alignment, and domain-specific knowledge, and accelerate user onboarding ahead of a planned enterprise launch in early 2026.
The team behind this AI innovation
Trismik’s mission is rooted in the vision of its founders. Nigel Collier, a prolific researcher with over 200 published papers, has long explored whether AI could be evaluated as efficiently and fairly as humans. Partnering with Rebekka Mikkola, a repeat founder passionate about AI and women in tech, Collier helped build Trismik’s first MVP in collaboration with a major UK telco. CTO Marco Basaldella, a former Amazon scientist and TEDx speaker, rounds out the team, blending commercial and scientific expertise. Together, they aim to transform AI evaluation from crude accuracy metrics to rich, interpretable ability distributions.
What Trismik eyes for future?
Over the next year, Trismik plans to expand the platform into a full-fledged environment for designing, running, and analysing large language model experiments, including fine-tuning and prompt engineering. Enterprise clients will soon benefit from advanced experiment tracking, data visualisation, and compliance-ready infrastructure. By bridging psychometrics with adaptive testing, Trismik hopes to make AI not just smarter, but measurable in ways that are precise, transparent, and cost-effective.
“If we want to trust AI, our methods have to be as rigorous as our ideas. Benchmark saturation is creating problems in every domain, from general knowledge, to reasoning, math, and coding. Scientists, researchers and technical teams face mounting pressure as evaluation is exploding in importance and has become essential for tying AI to trust. We need an evaluation framework that scales and can support this.” said Professor Nigel Collier, a Cambridge NLP researcher and Trismik’s Chief Scientific Officer.
“The AI evaluation market is at an inflection point. Every AI team we speak with is drowning in evaluation overhead, it has become the hidden bottleneck preventing teams from shipping faster and with confidence. Trismik’s approach is compelling because it applies proven scientific methods from a completely different domain to solve this problem. When you can reduce evaluation time by two orders of magnitude while actually increasing measurement precision, you fundamentally change what’s possible in AI development cycles.” said John Spindler, from Twinpath Ventures.