Local-First AI

Deep divecoding-agentsbenchmarkscostswe-benchclaude-codeqwendeepseek

Paying $5 per task no longer gets you the frontier. Qwen3.6-27B running locally hits 77.2% SWE-bench Verified at ~$0.04/task, within 10pp of Opus 4.7 for ~130x less money.

Building an Eval Harness That Catches Regressions

Deep diveevalsagentsproductiontestinganthropicshadow-testing

Anthropic shipped three concurrent regressions over six weeks and their eval suite caught none of them. Even Anthropic ships blind. Here is the six-layer harness pattern that would have caught it.

MCP Server Audit: Which Ones Actually Work in 2026

Deep divemcpagentssecurityproductionoauthreview

Of the ~14,000 MCP servers in PulseMCP's hand-curated index, fewer than 30 are demonstrably production-ready. Here is the list, the criteria, and the failure modes.

Mixing and Matching Open-Weight Models: A Recipe Book

Deep divelocal-modelscostopen-weightsagentsqwenkimideepseek

An 84% cost reduction on a real SaaS workload, a 97% reduction on agentic dev loops, and the three-tier mix that actually ships in May 2026.

The Multi-Agent Orchestration Frontier in May 2026

Deep diveagentsframeworksproductionlanggraphcrewai

Nine frameworks, one durable-execution wedge, zero unbroken benchmarks. An honest map of the multi-agent ecosystem in May 2026, anchored in Anthropic's 90.2%/15× receipts and Berkeley's eight-benchmark exploit.

Self-Hosting the Full AI Stack on $5/mo Hetzner

Deep diveself-hostinghetznerlistmonknewslettersovereign-computeinfrastructure

$9/mo of Hetzner replaces $80 to $150/mo of SaaS. Listmonk, Cal.com, Umami, and Uptime Kuma on a single CX22, with the deliverability playbook nobody writes down.

Field notes

AI Agent Conference NYC: 10 Takeaways

agentsproductionconferenceshitlevals

The 10 things from AI Agent Conference 2026 NYC (May 4-5, NY Hilton Midtown) that are actually load-bearing if you ship agents in 2026. The trust paradox, CrewAI's 42% AI-authored code, the iceberg under every project, AX as the new UX, and what the panels from Datadog, LanceDB, Carta, and the Codex/Linear/Graphite room actually said.

AI Dev SF: 10 Takeaways

agentsproductionconferencesevalssecurity

The 10 things from AI Dev 26 SF (April 28-29, Pier 48) that are actually load-bearing if you build agentic systems in 2026. Marc Brooker on defects, Andrew Ng on PM bottlenecks, Bain's 8-subgraph payroll system, the 4-legged identity, hybrid doc OCR, and the simulation sandbox every action-taking agent needs.

Boulder's AI Frontier

bouldercommunityai-agentsstartup-week

What is actually shipping out of the Boulder AI scene in May 2026, what is on the schedule for Boulder Startup Week, and why the Give First culture matters for agentic builders.

Three Ways to Run Open Weights for Pennies

local-modelsopen-weightscloudbenchmarkscost

Local hardware vs rented GPUs vs serverless OSS APIs. Real prices, real benchmarks, and the workload-shape question that decides which path is right.

Coding Agent Infrastructure in Production

coding-agentsinfrastructurecodexlineargraphite

Codex, Linear, and Graphite shared the stage at AI Agent Conference NYC on what scales coding agents past the demo. The infrastructure underneath is the actual work.

Shadow Testing: From 70% to 98% in Four Weeks

agentsproductiontestingevals

The single highest-leverage decision when shipping mission-critical autonomous agents. Production is the only truth.

Swarm vs Monolith

agentsarchitecturepatternsproduction

Why five specialized $0.01 agents beat one $0.50 god model, and what the multi-agent crowd gets wrong about it.

Notes on Human-in-the-Loop