April 28, 2026
I Run a Whole AI Stack on a Laptop for Three Cents an Hour
28.4 tokens per second on a laptop running GLM-4 9B, three cents of electricity per session, and the moment local inference stopped being a hobby.
Contents (10)
TL;DR. Framework 16 with Ryzen AI 9 HX 370, RX 7700S, and 96GB DDR5 runs GLM-4 9B at Q8_0 at 28.4 tokens/sec under ROCm 7.3, with a 90GB GART carve-out feeding the iGPU directly out of system RAM. That number is not a benchmark trophy. It is the threshold where agentic loops stop being a budgeting problem.
The thesis
Multi-agent loops (planner, researcher, critic, executor, with retries) burn 5 to 20 million tokens a day per developer. At Sonnet 4.5 list rates, that is $15 to $60. At GPT-5 reasoning rates, more. The cost is real but the behavioral cost is what kills the loop. You start trimming retries. You stop letting the critic argue. Quality drops.
96GB of unified memory, with most of it pinned as GART for an iGPU/dGPU pair that ROCm 7.3 finally treats as first-class, removes the budgeting layer. The question stops being "can I afford this retry" and becomes "is the loop converging." That is the only question that matters.
The hardware
Mainboard: Framework 16 (AMD)
SoC: AMD Ryzen AI 9 HX 370 (Strix Point)
dGPU: Radeon RX 7700S (Navi 33, 8GB GDDR6)
Memory: 96GB DDR5-5600 (2 × 48GB SO-DIMM)
Storage: 2TB NVMe (Gen 4)
OS: Ubuntu 26.04 LTS, kernel 6.x
Three things matter. The Strix Point iGPU (Radeon 890M, RDNA 3.5) is the workhorse for sustained inference under 12B params, tied to system RAM. The dGPU handles bursty workloads. 96GB hits the sweet spot: 64GB is too small once you keep a 9B model resident with full context plus VS Code plus a vector DB. 128GB only ships as 2 × 64GB DDR5-5200 on Framework 16 (slower, $400 more) and you will never spend the extra capacity on inference.
Ubuntu 26.04 LTS Resolute Raccoon ships AMDGPU support for Strix Point in the GA tree. No HWE stack, no DKMS gymnastics. ROCm 7.3 builds clean.
The 90GB GART trick
GART (Graphics Aperture Remapping Table) is how the iGPU sees system RAM as VRAM. Default Ubuntu hands the iGPU 4GB. To run a 9B model at Q8_0 (~9.5GB on disk, more with KV cache), you need it bigger.
# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.gttsize=92160"
# then
sudo update-grub && sudo reboot
gttsize is in MB. 92160 = 90GB. After reboot, glxinfo -B shows the new GTT size. Leave 6GB unmapped for the OS. This single change does more work than any other line in the stack.
Installing ROCm 7.3
ROCm 7.3 is the first release where Strix Point + RDNA 3.5 is genuinely supported. Earlier 7.x crashed on gfx1150 under load.
sudo apt update
sudo apt install python3-setuptools python3-wheel
wget https://repo.radeon.com/amdgpu-install/7.3/ubuntu/noble/amdgpu-install_7.3.x_all.deb
sudo apt install ./amdgpu-install_7.3.x_all.deb
sudo amdgpu-install -y --usecase=rocm
sudo usermod -aG render,video $USER
Verify:
rocminfo | grep gfx
# expected: gfx1150 (iGPU), gfx1102 (RX 7700S)
Building llama.cpp for HIP
Prebuilt binaries miss the gfx1150 target. Build from source:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1150;gfx1102" \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j
AMDGPU_TARGETS is the load-bearing line. The HIP compiler emits code for both GPUs in one binary. Switch between them with HIP_VISIBLE_DEVICES. Build takes ~6 minutes.
The benchmark
GLM-4 9B Q8_0. The best open model under 10B for tool-calling and structured output at the time I ran this. Pinned to the iGPU:
HIP_VISIBLE_DEVICES=0 \
./build/bin/llama-bench \
-m models/glm-4-9b-chat.Q8_0.gguf \
-p 512 -n 256 -t 8 -ngl 99
Result: 28.4 t/s sustained (tg256, ±0.31). Prompt processing (pp512) hits ~340 t/s, which matters because most agentic turns are prompt-heavy and generation-light.
| Hardware | tg t/s | Cost per month |
|---|---|---|
| Strix Point iGPU (this rig) | 28.4 | $0 (owned) |
| RX 7700S dGPU (8GB) | ~35 | (offload only) |
| RTX 4070 (12GB) | ~52 | $600+ card |
| Mac Studio M3 Ultra (96GB) | ~42 | $4,800 |
| Cloud H100 (per hour) | 180+ | $2.49/hr |
The H100 is faster per second. Not faster per dollar. Not faster at 3am when the daily token budget is gone.
The economic case
A real agentic dev loop from last week:
- Loop: planner, 3 parallel researchers, critic, executor, verifier
- Tokens per loop: 180K input, 12K output
- Loops per hour: 8 to 12 during active dev
- Session: 4 hours
Total: 6.5M input, 432K output per session. On Sonnet 4.5: ~$26. On Opus: more. Per developer per day.
On this laptop, the same session is electricity. ~65W under load × 4 hours × $0.13/kWh = $0.034. Three cents. The agent runs all night because there is no reason it shouldn't.
What it actually unlocks
The interesting consequence is not saved tokens. It is the regime change in how loops get designed:
- Parallel fan-out becomes free. Five local agents finish in parallel. Five cloud agents finish when the rate limiter says so.
- The critic gets to be expensive. Three turns of the critic arguing with the executor is the largest quality unlock in agentic design. It is paywalled by token cost on metered APIs and free at home.
- Loops can run unattended. A research agent that wakes every 15 minutes is a $300/month subscription on cloud APIs and the same as leaving Slack open at home.
- Experiments stop being budget meetings. "What if the planner ran twice and we picked the better plan" becomes a one-line config change.
The catch
96GB on Strix Point does not give you GPT-5. The model ceiling here is 7B to 14B at Q5 to Q8, comfortably. 27B Dense (Gemma 4, Qwen 3.6) fits at Q4_K_M with effort. A 1T MoE does not fit, period.
The right pattern is hybrid: local for the 95% of agentic turns that route, summarize, draft, call tools, and critique; frontier API for the 5% that genuinely need the smartest model in the world. A well-designed loop calls the frontier once per session, not eighty times.
What's next
Strix Halo (Ryzen AI Max+ 395) ships with 256GB unified memory in mini-PC form factors. ROCm 8.0 is rumored Q3 2026 with first-class XDNA 2 NPU offload for attention matmul. The thesis only gets stronger as open models catch up to closed ones at sub-12B sizes.
The line between "local inference is a hobby" and "local inference is the production substrate for my agentic stack" gets crossed somewhere around 96GB of fast unified memory paired with ROCm 7.3-class drivers. That line was crossed in April 2026. If you are still architecting your agents around token budgets in 2026, you are paying a tax that the giants are charging because they can.
Local-First AI
If this was useful, the weekly notes go deeper. No drip sequences, no upsells.
n8n templates, cost teardowns, and what is actually working in 2026. No drip sequences, no upsells. Reply to opt out.