Sophia Stein · Boulder, CO · online
AI Architect.
Local-first by design.
I design and ship agentic systems that run on hardware you own. Local inference, smaller models, deliberate architecture. The benchmark below is real.
Local-First AI
One email a week on local-first AI, agentic architecture, and what is actually working in 2026.
n8n templates, cost teardowns, and what is actually working in 2026. No drip sequences, no upsells. Reply to opt out.
Start here
I Run a Whole AI Stack on a Laptop for Three Cents an Hour
28.4 tokens per second on a laptop running GLM-4 9B, three cents of electricity per session, and the moment local inference stopped being a hobby.
The 40K Token Wall
Why bigger context windows are not the answer, and what production-tuned engineers actually trust in 2026.
When to Run Locally and When to Pay Anthropic
Real numbers, real workloads, real break-even points. When local is the obvious answer, when cloud is, and the hybrid that wins for most teams.
Pre-order · The Sovereign Stack
An interactive e-book on production agents on hardware you own — built to update for free as the local-AI stack changes.
Hardware selection, ROCm + Metal setup, model picks across Qwen / Gemma / Kimi / DeepSeek with the benchmark numbers I actually ran, agent harness design with paste-able eval harnesses, n8n + MCP orchestration patterns from production. PDF + EPUB + browsable web version. ~220 pages, ships fall 2026. Pre-order locks in lifetime free updates.
What I do
Architecture
Hands-on projects: agentic system design, local LLM infrastructure, cost teardowns. Corporate trainings + custom course creation for orgs.
OSS Tools
Open-source libraries for AMD APU configuration, ROCm tuning, local inference benchmarking, and the n8n templates that run my own business.
Field Notes
Benchmarks, deep dives, and what is actually working in 2026. Written from the projects, not the marketing decks.
Recent posts
all posts →Cost-Quality Pareto for Coding Agents (May 2026)
Paying $5 per task no longer gets you the frontier. Qwen3.6-27B running locally hits 77.2% SWE-bench Verified at ~$0.04/task, within 10pp of Opus 4.7 for ~130x less money.
Building an Eval Harness That Catches Regressions
Anthropic shipped three concurrent regressions over six weeks and their eval suite caught none of them. Even Anthropic ships blind. Here is the six-layer harness pattern that would have caught it.
MCP Server Audit: Which Ones Actually Work in 2026
Of the ~14,000 MCP servers in PulseMCP's hand-curated index, fewer than 30 are demonstrably production-ready. Here is the list, the criteria, and the failure modes.
Mixing and Matching Open-Weight Models: A Recipe Book
An 84% cost reduction on a real SaaS workload, a 97% reduction on agentic dev loops, and the three-tier mix that actually ships in May 2026.
The Multi-Agent Orchestration Frontier in May 2026
Nine frameworks, one durable-execution wedge, zero unbroken benchmarks. An honest map of the multi-agent ecosystem in May 2026, anchored in Anthropic's 90.2%/15× receipts and Berkeley's eight-benchmark exploit.