In my previous post, I described building an AI agent for Harvest time tracking using Pydantic AI — driven partly by security concerns with the skill-based approach. The agent worked perfectly with Claude. Then I tried running it locally, and this particular local model struggled with things the cloud model handled effortlessly.
It miscalculated a date and added a time entry on Tuesday instead of Monday. When I asked it to fix that, it added the Monday entry but forgot to delete the Tuesday one. This wasn’t a damning verdict on local models — but it was a useful lesson about where to invest your time when building agents.
Going Local
Why run locally? Partly practical — no API costs, no data leaving my machine. But it’s also about ownership. When you depend on a hosted model, you’re trusting that the provider’s incentives stay aligned with yours. The announcement of ads coming to ChatGPT suggests a future that looks a lot like other ad-supported platforms — and that’s a future where your tools might not be working entirely for you. Running locally means you own your infrastructure, your data, and your model. Nobody’s optimizing your experience for engagement or ad revenue.
I’d already been experimenting with running models locally for coding tasks, which I wrote about previously on my own blog. Since then, llama.cpp has added built-in router mode — you run a single server process that auto-loads and unloads models on demand based on the model field in your API request. With --models-max 1 it evicts the current model when you request a different one, which works well for my local setup with an RTX 3060 12GB where only one large model fits in VRAM at a time.
My first test model was Qwen3-Coder-Next — 80B total parameters but only 3B active per token (it’s a Mixture-of-Experts model with 512 experts, 10 selected per token). I was using the Q4_K_M quantization, about 46GB on disk. It’s built for coding tasks, not calendar math and time-tracking. So it wasn’t exactly a fair fight from the start, and the quantization may have further degraded its reasoning on this kind of task.
Where Things Went Wrong
I want to be careful here — this isn’t a story about local models being bad. Local models can be very good, and they’re getting better fast. GLM-5.1, for example, is a 754B parameter model with reportedly near-Opus 4.6 capability under an MIT license. The fact that you can self-host something at that level at all is incredible, even if you need serious hardware to run it.
This was more of an exercise in seeing if I could make the agent robust to a poorly-performing model. The agent worked fine with Claude Sonnet 4.6 — I wanted to see what would break when I threw a much weaker model at it. Here’s what I ran into:
Date math: I asked it to log time for “next Monday.” It miscalculated Monday’s date and added the entry to Tuesday instead. When dates crossed month boundaries, it got worse.
Day-of-week hallucination: Given the ISO date “2026-04-07,” the model confidently identified it as Monday, instead of Tuesday.
Silent substitution: I asked it to log time to a project that didn’t exist. Instead of telling me, the over-eager model quietly logged my time to a completely different, but real project.
These aren’t exotic edge cases — dates and project names are the entire job of a time-tracking agent. Could it have been the model, the quantization, something in llama.cpp, or just that 3B active parameters isn’t enough for this? Hard to say. Probably some combination.
Pushing Logic Out of the LLM
Rather than guessing at fixes, I started by building evals using pydantic_evals — every time the model got something wrong, I’d have Claude Code add a test case for it. LLM evals aren’t new. OpenAI published an evals framework early on, and even a simple, hand-built suite goes a long way. I ended up with 28 cases covering tool selection, date parsing, shortcut resolution, hallucination detection, and project validation. Evals are valuable regardless of model quality. Model life cycles are short; you’re going to be swapping models regularly, and evals let you validate each swap quickly and catch regressions.
With the evals showing me exactly where the model was failing, I systematically moved deterministic work out of the LLM and into Python:
Date parsing: The model picked Tuesday when I said “next Monday.” Instead of hoping it would get date math right, the tool code now parses relative dates in Python. The system prompt also includes a three-week calendar table — last week through next week — so the model just reads dates off the table instead of doing arithmetic.
Project validation: The model silently substituted a real project when I asked for one that didn’t exist. Now the agent builds an index of real Harvest projects at startup. Every tool call validates the project name against this index before hitting the API. Close matches get fuzzy-matched with suggestions (“did you mean Deep Learning?”). The model is also explicitly told: never substitute a project the user didn’t ask for.
Shortcut resolution and hour rounding: Lookup tables and rounding logic in Python, not left to the model.
The pattern behind all of these: the LLM handles intent — understanding what the user wants. Code handles precision — getting the details exactly right. Anything deterministic belongs in code, not in the prompt.
After these changes, Qwen3-Coder-Next passed 100% of my eval cases, showing real improvement. But in real-world usage, it still had rough edges not covered by my (admittedly quick) test suite. In this case, a 100% pass rate meant the test suite wasn’t comprehensive enough yet — not that it’s production-ready.
A Better Model Changes Everything
I tried a few other local options. Gemma 4 E4B (roughly 4B parameters, fits entirely in VRAM) was fast but just wasn’t reliable enough for agent tasks.
Then I loaded Gemma 4 26B-A4B — a Mixture-of-Experts model with about 4B active parameters out of 26B total. It fits on my 12GB GPU by offloading the MoE expert layers to CPU. And it just… worked. The rough edges I’d been fighting with Qwen largely disappeared. Not a frontier model, not even a huge model — but a better fit for this task, still running entirely on my hardware.
I’d spent a lot of time engineering around Qwen’s weaknesses. A better model — still local, still on the same GPU — solved most of those problems without extra effort.
To be fair, the engineering work wasn’t wasted. Moving deterministic logic to code made the agent better for all models, including Claude. That kind of improvement is worth making regardless. But the hours spent debugging model-specific failures? Those were mostly absorbed by a better model.
Practical Takeaways
These are suggestions from my experience, not hard rules — your mileage will vary.
Default to the best model you can use, then scale back with measurement. Don’t start by trying to make a weak model work. Get the product right first, validate the concept, then optimize if needed. In my case, going from a coding-specialized 3B-active model to a general-purpose 4B-active model made a dramatic difference — and both ran locally on the same hardware.
Invest in evals early. Even if you’re using a frontier model, evals give you a regression safety net for when you swap models, update prompts, or change tool implementations. They’re not just for debugging weak models.
Think carefully before fine-tuning a smaller model. I’ve done fine-tuning work on a separate project — training a small model for GitHub issue classification — and the maintenance burden is real. Adding a new label means regenerating your training dataset and retraining. Any requirement change means re-doing that work. For most use cases, I suspect the maintenance cost exceeds the inference savings. It might make sense at industrial scale — a company running chatbots for hundreds of clients, where a stable fine-tuned model is amortized across huge volume. But for most teams building agents, a better base model plus good engineering is probably the more practical path.
Before fine-tuning, consider the alternatives: better prompts, moving more logic to code (as I did here), structured output constraints, or honestly just waiting — small models are getting better fast. By the time you finish a fine-tuning pipeline, the next generation of base models might have closed the gap.
The Bottom Line
The best agent architecture is one that doesn’t depend on the model being brilliant. Push precision into code, let the LLM handle intent, invest in evals, and start with the best model available to you. You can always optimize later — and by the time you need to, there might be a better small model anyway.


