It feels better is not a QA strategy.

Define what good looks like, then measure it. We build evaluation frameworks, regression suites, and quality gates so AI systems improve instead of drifting.

AI quality tends to be judged by vibes. It looked good in the demo, then production went sideways and nobody can explain why. Without evaluation, you are guessing.

We build evaluation frameworks that test the right things. Regression suites that catch failures before users do. Acceptance criteria and quality gates that turn subjective quality into measurable standards.

You get confidence in every change, and a system that improves with each release instead of getting quietly worse.

Evaluation framework design

Define what good looks like with metrics tied to real user outcomes, not just model scores.

Regression test suites

Build regression suites that catch quality drops before they reach production.

Acceptance criteria and quality gates

Clear thresholds that decide whether a change ships or goes back for fixing.

Benchmarking and performance tracking

Track performance over time so improvements are real, not imaginary.

Ready to discuss your needs?

Book a 30-minute call

Services offered

Things That is a product engineering practice focused on building AI systems that help people make sense of complexity. For over 20 years, we've worked with teams at Google, IBM, Air New Zealand, Kpler, EE, News UK, Tesco, and The Economist, sitting in the space between product strategy and hands-on engineering. When specialist help is needed, we work with a network of senior consultants and product designers.

We're not consultants who hand off to developers. We're product engineers and designers who think strategically about what users need, then build it, from architecture to APIs to interfaces to production deployment. Our work is hands-on, writing code, reviewing pull requests, designing schemas, and testing edge cases, but always with the human experience in mind.