Engineering at Harmonic: Lessons from our latest experiments in model tuning

Published on

April 15, 2026

Contributors

Daniel Singh

Data Scientist

Building PII detection that works in production is not just an accuracy problem. Anyone can train a model that scores well on a held-out test set if they have enough compute and enough patience. The harder constraint is latency: our models run in-line, inspecting content as it moves toward an AI tool before it gets there, which means a classification decision that takes too long creates visible friction for the user. Accuracy and speed are in constant tension, and the usual casualty of chasing one is the other.

That's the context for something I built earlier this year: a Claude Code plugin we're calling autotune.

Why precision in PII detection is harder than it sounds

Harmonic's detection models sit in the path of real employee (or agent) activity. When someone pastes content into an AI tool, we're analyzing it and making a classification decision in milliseconds. A false negative means sensitive data slips through undetected. A false positive interrupts a legitimate workflow and erodes trust in the product.

The tolerance for error is low in both directions, and the inputs are genuinely hard. PII comes in many forms, is often abbreviated, truncated, or wrapped in context that changes its meaning, and varies significantly across industries and regions. A well-trained model has to handle the full distribution of that variation while still running fast enough to work inline.

Precision, specifically, is the metric that matters most for usability. A model with high recall but loose precision will flag too many false positives. Users learn to ignore the alerts. The security signal degrades. So when we received customer feedback pointing to precision as an area to sharpen on one component of our PII detection stack, we took it seriously.

What is autotune?

Andrej Karpathy famously published autoresearch, a project that uses an AI agent to autonomously run a research loop: it reads papers, forms hypotheses, designs experiments, executes them, and iterates on the results without a human directing each step. I read it and immediately saw the same structure in our model tuning problem. You have a search space (hyperparameters, pipeline components, feature sets, thresholds), a clear objective function (precision, recall, F1), and a repeatable way to evaluate each configuration. The bottleneck isn't knowing what to try. It's having the time to try everything systematically.

So I built autotune as a Claude Code plugin, applying autoresearch's core approach to ML model optimization rather than literature review. Point an agent at your ML repo, give it a trial budget, and let it work through the search space.

The agent reads the codebase to understand the existing pipeline, forms hypotheses about where improvements might exist, runs experiments using DVC to track results cleanly, and applies changes that show genuine improvement. Critically, it's not just grid search over hyperparameters. It can propose architectural changes to the pipeline itself: adding preprocessing steps, engineering new feature groups, trying dimensionality reduction, adjusting decision thresholds. It learns from each trial and adjusts what it tries next based on what's working.

The setup is straightforward if you have a DVC pipeline: define a trial budget, point it at the repo, and run it.

‍

What 30 trials uncovered that we hadn't thought to look for

I ran autotune against the PII detection component in question. The goal was to improve precision. I gave it a budget of 30 trials and let it go.

The result was a 20% improvement in F1 score, with precision gains driving it. F1 is the harmonic mean of precision and recall, so a 20% gain there without trading one against the other means the model got meaningfully better at the actual job: catching more of what it should catch, with fewer false alarms.

This was not a model that had been neglected or was running on a years-old configuration. It had been tuned, it was performing well in production, and I genuinely did not think there was meaningful room left. That's part of what made the result surprising.

The agent did not find the improvement by doing anything exotic. Over 30 trials, it added dimensionality reduction and normalization components to the pipeline, engineered new feature groups, and tuned thresholds in ways that stacked rather than traded off against each other. The precision improvements were real and they held up when we validated them properly.

What it did that a human tuning session would be unlikely to replicate was methodically test combinations across the full pipeline, not just the obvious hyperparameters. Engineers naturally focus attention on the parts of a model they believe are most likely to move the needle. The agent has no such prior. It just runs experiments and follows the evidence.

‍

The case for systematic over intuitive, even when the model is good

The interesting lesson here is not that the model had room to improve; almost every model does. The lesson is that the agent found improvements that experienced engineers, who knew the model well, had missed.

That's a different claim, and it's worth sitting with. Human tuning is constrained by intuition, by what we expect to work, by the time cost of running and evaluating experiments. Those constraints push toward local search: we try variations on things we already think are good. Systematic agents don't have that bias. They'll try the dimensionality reduction that seemed unnecessary, the normalization step that seemed redundant, the threshold configuration that seemed counterintuitive.

Sometimes those ideas don't pan out. But occasionally they do, and across 30 trials the improvements compound.

The flip side is that systematic search has its own constraints. A trial budget is finite, and the search space for a non-trivial ML pipeline is enormous. Autotune is not a replacement for knowing your model; it's better understood as a complement to it. You still need to validate results, catch regressions on edge cases, and make sure improvements on one metric haven't come at the expense of another. The agent finds candidates. You still have to judge them.

‍

What this means for how we build models going forward

We're continually training new models and refining existing ones across Harmonic's detection stack. The use case above is one example; there are others in flight. Autotune is becoming a standard part of our model development process rather than a one-off experiment, particularly for cases where we want to systematically probe a well-performing model before shipping changes.

The inline latency constraint doesn't go away. Any pipeline change the agent proposes has to be tested for its runtime cost, not just its accuracy impact. Dimensionality reduction, for instance, can improve precision while actually helping latency by reducing the compute load at inference time. Normalization adds a step but often pays for itself. These trade-offs are worth evaluating carefully rather than assuming they cut against each other.

The broader point is that ML model development, at scale, benefits from tools that can search the space you don't have time to search manually. That's true even when the model is already good.

‍

Catch us at AWS Summit London, 22 April

If you're attending AWS Summit London next week, our CTO Bryan Woolgar-O'Neil is speaking on 22 April alongside the AWS team on customizing AI models and accelerating time to production with Amazon SageMaker AI. The talk will cover how teams are actually moving models from experimentation into production environments, which connects directly to the kind of work described above. Worth attending if you're in London.

‍

Our Resources

View All Resources

Engineering at Harmonic: Lessons from our latest experiments in model tuning

Why precision in PII detection is harder than it sounds

What is autotune?

What 30 trials uncovered that we hadn't thought to look for

The case for systematic over intuitive, even when the model is good

What this means for how we build models going forward

Catch us at AWS Summit London, 22 April

Related Posts

Engineering at Harmonic: Lessons from our latest experiments in model tuning

UK Cyber Flywheel Round 2: From Alignment to Execution

Making Sense of the Agentic AI Landscape

Build Your AI Guardrails Now