blog

Local-First AI

Deep dives and field notes on local-first AI, agentic architecture, and what is actually working in 2026, with primary sources and reproducible benchmarks.

all posts Local models Agent architecture Production Founders & SMB Identity & compliance Culture

Showing 15 of 38 posts in Production · clear

Deep dives

Long-form research articles with primary sources, benchmarks, and reference tables.

Building an Eval Harness That Catches Regressions

May 7, 2026

Anthropic shipped three concurrent regressions over six weeks and their eval suite caught none of them. Even Anthropic ships blind. Here is the six-layer harness pattern that would have caught it.

Deep diveevalsagentsproductiontestinganthropicshadow-testing

MCP Server Audit: Which Ones Actually Work in 2026

May 7, 2026

Of the ~14,000 MCP servers in PulseMCP's hand-curated index, fewer than 30 are demonstrably production-ready. Here is the list, the criteria, and the failure modes.

Deep divemcpagentssecurityproductionoauthreview

The Multi-Agent Orchestration Frontier in May 2026

May 7, 2026

Nine frameworks, one durable-execution wedge, zero unbroken benchmarks. An honest map of the multi-agent ecosystem in May 2026, anchored in Anthropic's 90.2%/15× receipts and Berkeley's eight-benchmark exploit.

Deep diveagentsframeworksproductionlanggraphcrewai

Self-Hosting the Full AI Stack on $5/mo Hetzner

May 7, 2026

$9/mo of Hetzner replaces $80 to $150/mo of SaaS. Listmonk, Cal.com, Umami, and Uptime Kuma on a single CX22, with the deliverability playbook nobody writes down.

Deep diveself-hostinghetznerlistmonknewslettersovereign-computeinfrastructure

Field notes

AI Agent Conference NYC: 10 Takeaways

May 5, 2026

The 10 things from AI Agent Conference 2026 NYC (May 4-5, NY Hilton Midtown) that are actually load-bearing if you ship agents in 2026. The trust paradox, CrewAI's 42% AI-authored code, the iceberg under every project, AX as the new UX, and what the panels from Datadog, LanceDB, Carta, and the Codex/Linear/Graphite room actually said.

agentsproductionconferenceshitlevals

AI Dev SF: 10 Takeaways

May 5, 2026

The 10 things from AI Dev 26 SF (April 28-29, Pier 48) that are actually load-bearing if you build agentic systems in 2026. Marc Brooker on defects, Andrew Ng on PM bottlenecks, Bain's 8-subgraph payroll system, the 4-legged identity, hybrid doc OCR, and the simulation sandbox every action-taking agent needs.

agentsproductionconferencesevalssecurity

Coding Agent Infrastructure in Production

May 4, 2026

Codex, Linear, and Graphite shared the stage at AI Agent Conference NYC on what scales coding agents past the demo. The infrastructure underneath is the actual work.

coding-agentsinfrastructurecodexlineargraphite

Shadow Testing: From 70% to 98% in Four Weeks

May 4, 2026

The single highest-leverage decision when shipping mission-critical autonomous agents. Production is the only truth.

agentsproductiontestingevals

Swarm vs Monolith

May 4, 2026

Why five specialized $0.01 agents beat one $0.50 god model, and what the multi-agent crowd gets wrong about it.

agentsarchitecturepatternsproduction

Notes on Human-in-the-Loop

May 4, 2026

Why human-in-the-loop is the only ethical and profitable way to scale agentic AI in a world of bot fatigue.

hitlagentsethicsproduction

Below the Waterline: What Decides Whether Your Agent Ships

May 3, 2026

The hidden engineering that decides whether your agent makes it to production. The 65/95 gap and the three foundations underneath it.

agentsproductioninfrastructureevals

Document OCR for Agentic Workflows

May 3, 2026

90% of enterprise data is locked in PDFs. The 2026 pipeline that gets it out is not RAG, not vision-only, and not the OCR you remember from 2018.

ocragentsdocumentsvlmllamaindex

The 40K Token Wall

Apr 30, 2026

Why bigger context windows are not the answer, and what production-tuned engineers actually trust in 2026.

context-engineeringagentsragevals

Simulation Sandboxes for Agents

Apr 30, 2026

The fastest 2026 teams are testing autonomous agents in synthetic enterprise environments before any customer is exposed. With the case for it and the open-source pieces to build one.

agentstestingsimulationevalsproduction

Self-Healing CI/CD Patterns

Apr 23, 2026

Agentic loops that detect, diagnose, and fix deployment errors before you see the notification. With the workflow that actually works in 2026.

devopsci-cdagentsproductionobservability

Local-First AI

Deep dives

›Building an Eval Harness That Catches Regressions

›MCP Server Audit: Which Ones Actually Work in 2026

›The Multi-Agent Orchestration Frontier in May 2026

›Self-Hosting the Full AI Stack on $5/mo Hetzner

Field notes

›AI Agent Conference NYC: 10 Takeaways

›AI Dev SF: 10 Takeaways

›Coding Agent Infrastructure in Production

›Shadow Testing: From 70% to 98% in Four Weeks

›Swarm vs Monolith

›Notes on Human-in-the-Loop

›Below the Waterline: What Decides Whether Your Agent Ships

›Document OCR for Agentic Workflows

›The 40K Token Wall

›Simulation Sandboxes for Agents

›Self-Healing CI/CD Patterns

Building an Eval Harness That Catches Regressions

MCP Server Audit: Which Ones Actually Work in 2026

The Multi-Agent Orchestration Frontier in May 2026

Self-Hosting the Full AI Stack on $5/mo Hetzner

AI Agent Conference NYC: 10 Takeaways

AI Dev SF: 10 Takeaways

Coding Agent Infrastructure in Production

Shadow Testing: From 70% to 98% in Four Weeks

Swarm vs Monolith

Notes on Human-in-the-Loop

Below the Waterline: What Decides Whether Your Agent Ships

Document OCR for Agentic Workflows

The 40K Token Wall

Simulation Sandboxes for Agents

Self-Healing CI/CD Patterns