Run local LLMs with llama.cpp¶
llama.cpp runs GGUF models directly on your machine with no background daemon. On Apple Silicon its Metal backend gives you full GPU offload from a single binary. This page sets up two things:
- A local embedding server for Hadron's RAG vector index — a
free, offline dev backend for
mode:vector/mode:hybridretrieval, with no SageMaker endpoint to stand up or pay for. This is the Hadron-specific part, and it's a drop-in for the Ollama dev path. - Local chat models — any GGUF served over an OpenAI-compatible
API on
localhost, with one-line start/stop scripts.
If you only want the Hadron embedding backend, skip to Step 5. The chat steps establish the same start/stop pattern the embedding server uses, so they're worth a skim either way.
This page is the dev / self-hosted companion to Configure AWS SageMaker for vector embeddings (the production backend). Both feed the same feature — see RAG vector index for the operational reference and The RAG tech stack for why Hadron self-hosts the embedding model at all.
Why llama.cpp instead of Ollama
Both serve the same nomic-embed-text-v1.5 weights over the same
http backend, so retrieval quality is identical. The difference
is operational: llama.cpp is not a daemon. There's no
always-on background service, nothing autostarts at login, and RAM
is freed the instant the process exits — versus Ollama's
background service and OLLAMA_KEEP_ALIVE idle-unload. You also
get direct control over Metal flags (-ngl, context size, batch).
Ollama is still the lowest-friction zero-config path; reach for
llama.cpp when you'd rather not run a background service.
Scope¶
Written and verified on macOS Apple Silicon (Metal) — llama.cpp
b9700, M4 Pro / 48 GB. The Linux/CUDA install and the
wired-memory step
differ; the env wiring in Step 5
is identical on any platform.
Prerequisites¶
- macOS on Apple Silicon (M1–M4). Metal acceleration is automatic in the Homebrew build — no flags needed at install time.
- Homebrew.
- For the Hadron embedding path (Step 5):
write access to the Hadron server's environment (
.env, Doppler, or whatever the deploy uses) and a server you can restart. - Disk and RAM headroom. GGUF chat models run from a few GB to ~40 GB depending on parameter count and quantization. The model set in the scripts below is sized for ~48 GB unified memory — treat the names and quants as examples and pick a quant that fits your RAM from each model's GGUF repo.
Step 1 — Install llama.cpp and a model downloader¶
Install the Metal build of llama.cpp and the Hugging Face CLI (used to pull GGUF files):
The Homebrew formula links Metal, so -ngl 99 (offload all layers to
the GPU) works out of the box. The second command installs the hf
downloader via uv — brew install uv
first if you don't have it.
Verify both are on your PATH:
Linux / CUDA
On Linux you'll build or install llama.cpp against CUDA instead of
Metal, and the Metal wired-memory step
doesn't apply. Everything in Step 5
(the Hadron .env wiring) is platform-independent.
Step 2 — Download GGUF models into ~/models¶
The scripts below keep one directory per model under ~/models/.
The embedding model is required for the Hadron path — download it now:
hf download nomic-ai/nomic-embed-text-v1.5-GGUF \
nomic-embed-text-v1.5.f16.gguf \
--local-dir ~/models/nomic-embed
For chat models, pick a repo and a quant from the model's GGUF
card on the Hub and download the .gguf into its own directory:
Match the directory and filename to the cases in serve.sh
(Step 3), or edit the script's paths to
match what you downloaded.
Split GGUFs
Large quants are sometimes published as multiple
*-00001-of-0000N.gguf parts. Download all parts into the same
directory and point -m at the first part — llama.cpp loads the
rest automatically.
Step 3 — Run a chat model¶
Drop this serve.sh into ~/models/ and chmod +x ~/models/serve.sh.
It launches any model in the set over llama.cpp's built-in web UI
and an OpenAI-compatible API on the same port.
#!/usr/bin/env bash
# llama.cpp chat-model launcher for the GGUF set in ~/models.
# Usage: ~/models/serve.sh <model> [port]
# models: coder qwen3 llama70 gemma12 gemma26 mistral
# Web UI: http://127.0.0.1:<port> API: http://127.0.0.1:<port>/v1 Ctrl-C to stop.
set -euo pipefail
export PATH="/opt/homebrew/bin:$PATH"
M="${HOME}/models"
name="${1:-}"; port="${2:-8080}"
case "$name" in
coder) model="$M/qwen2.5-coder-32b/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf"; ctx=16384 ;;
qwen3) model="$M/qwen3-30b-a3b/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf"; ctx=32768 ;;
gemma12) model="$M/gemma-4-12b/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf"; ctx=16384 ;;
gemma26) model="$M/gemma-4-26b/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf"; ctx=16384 ;;
mistral) model="$M/mistral-small-24b/Mistral-Small-Instruct-2409-Q4_K_M.gguf"; ctx=16384 ;;
llama70)
model="$M/llama-3.3-70b/Llama-3.3-70B-Instruct-IQ4_XS.gguf"; ctx=8192
# 70B (~38 GB) exceeds the default Metal wired-memory limit (~36 GB on 48 GB).
cur=$(sysctl -n iogpu.wired_limit_mb 2>/dev/null || echo 0)
if [ "${cur:-0}" -lt 44000 ]; then
echo ">> Raising Metal wired-memory limit to 44 GB (needs sudo; resets on reboot)…"
sudo sysctl -w iogpu.wired_limit_mb=44000
fi ;;
*) echo "Usage: $0 <coder|qwen3|llama70|gemma12|gemma26|mistral> [port]"; exit 1 ;;
esac
[ -f "$model" ] || { echo "Model not found (still downloading?): $model"; exit 1; }
echo ">> Serving '$name' ctx=$ctx -> http://127.0.0.1:$port (web UI + /v1 API)"
exec llama-server -m "$model" -ngl 99 -c "$ctx" --jinja --host 127.0.0.1 --port "$port"
Run it:
~/models/serve.sh coder # http://127.0.0.1:8080
~/models/serve.sh qwen3 8090 # second model on a different port
The flags that matter: -ngl 99 offloads all layers to Metal, -c
sets the context window, and --jinja makes llama-server apply the
model's own chat template. Open the web UI at the printed URL, or
point any OpenAI-compatible client at http://127.0.0.1:<port>/v1.
The model list is an example set sized for ~48 GB. Edit the case
paths to match the GGUFs you downloaded in
Step 2.
Local chat models don't wire into the agent's LLM provider
The OpenAI-compatible endpoint here is for your own dev / offline use (coding assistants, OpenAI-client testing). It is not a backend for a Hadron agent's chat — that provider list (Configure your LLM provider) is hosted APIs (Anthropic, OpenAI, GLM, Bedrock), with no custom base-URL field. Only the embedding path (Step 5) plugs into Hadron.
Large models and the Metal wired-memory limit¶
macOS caps how much unified memory the GPU may "wire" (hold
non-pageable) — roughly 36 GB on a 48 GB machine. A 70B model at
IQ4_XS (~38 GB) exceeds that and fails to load until you raise the
limit. The llama70 case in serve.sh does this for you; to do it
by hand:
This is macOS-only, needs sudo, and resets on reboot. Size
it below your total RAM with headroom for the OS — 44 GB on a 48 GB
machine. Models that fit under the default limit (everything smaller
than ~36 GB) need no change.
Step 4 — Stop a server and free RAM¶
Ctrl-C in the serving terminal stops a server. To kill any
llama.cpp server from anywhere (chat on :8080, embeddings on
:8081, or both), drop this stop.sh into ~/models/:
#!/usr/bin/env bash
# Stop all running llama.cpp servers (chat :8080 and/or embeddings :8081), freeing RAM.
if pkill -f llama-server; then
echo "stopped llama.cpp server(s) — RAM freed"
else
echo "no llama.cpp server was running"
fi
Because llama.cpp is not a daemon, the RAM is released the instant
the process exits — there's no background service holding weights in
memory, and nothing relaunches at login. This is the practical
difference from Ollama, which runs a persistent service and keeps
models resident until OLLAMA_KEEP_ALIVE elapses.
The no-daemon tradeoff for Hadron
The flip side: you must start the embedding server yourself
before mode:vector / mode:hybrid retrieval works. If
serve-embed.sh isn't running, Hadron can't reach the endpoint
and h-find-nodes returns
reason: "embedding_unavailable"
(on hybrid, the keyword half still runs). Start it again and
retrieval recovers on the next search.
Step 5 — Serve embeddings for Hadron's RAG layer¶
This is the Hadron-specific payoff: a local nomic-embed-text-v1.5
endpoint that Hadron's embedding worker calls instead of SageMaker.
Drop this serve-embed.sh into ~/models/, chmod +x it, and run
it. It serves the embedding model on :8081:
#!/usr/bin/env bash
# nomic-embed-text v1.5 embedding server for Hadron's RAG layer (768-dim).
# Point Hadron at it: EMBEDDING_API_URL=http://127.0.0.1:8081/v1/embeddings
set -euo pipefail
export PATH="/opt/homebrew/bin:$PATH"
exec llama-server \
-m "$HOME/models/nomic-embed/nomic-embed-text-v1.5.f16.gguf" \
--embedding --pooling mean -ngl 99 \
-c 8192 -b 8192 --ubatch-size 8192 \
--host 127.0.0.1 --port 8081
Why this is a drop-in for Hadron's default backend:
- Response shape. Hadron's HTTP embedding backend
(
EMBEDDING_BACKEND=http, the default) sends{ model, input }and already accepts the OpenAI{ data: [{ embedding }] }response, so llama-server's/v1/embeddingsneeds no adapter. - Task prefixes. nomic needs
search_document:/search_query:prefixes. Hadron's client applies these itself, so no endpoint-side templating is required. - Vector compatibility.
--pooling meanwith the default--embd-normalize 2(L2) matches Ollama's normalized output. Same model, same normalization — so existing dev-DB vectors created against Ollama don't need re-embedding when you switch to llama.cpp.
5a — Wire the Hadron .env¶
Point the Hadron server at the local endpoint and restart it so the vars take effect:
EMBEDDING_BACKEND=http
EMBEDDING_API_URL=http://127.0.0.1:8081/v1/embeddings # llama-server --embedding
EMBEDDING_MODEL=nomic-embed-text # sent verbatim; llama-server ignores it
EMBEDDING_DIM=768 # MUST equal the pgvector vector(N) column
# EMBEDDING_API_KEY= # optional
| Env var | Required | Notes |
|---|---|---|
EMBEDDING_BACKEND |
no | Defaults to http. The HTTP client is what speaks to llama-server. |
EMBEDDING_API_URL |
yes | The llama-server /v1/embeddings URL. Unset means every mode:vector query returns reason: "embedding_unavailable". |
EMBEDDING_MODEL |
no | Stored per-vector for lineage. Sent in the request body but llama-server ignores it (it serves whatever -m loaded). Keep it nomic-embed-text to match the Ollama lineage label. |
EMBEDDING_DIM |
no | Defaults to 768. Must equal the pgvector vector(N) column dimension, or embeds are rejected. nomic-embed-text-v1.5 is 768-dim. |
EMBEDDING_API_KEY |
no | Only if you front the endpoint with auth. A bare local llama-server needs none. |
See RAG vector index — Operator configuration for the full env table.
5b — Verify¶
1. The endpoint returns 768-dim vectors. With serve-embed.sh
running:
curl -s http://127.0.0.1:8081/v1/embeddings \
-H 'content-type: application/json' \
-d '{"input": "search_document: hello world"}' \
| python3 -c 'import sys, json; print(len(json.load(sys.stdin)["data"][0]["embedding"]), "dims")'
Expect 768 dims. (Have jq? … | jq '.data[0].embedding | length'.)
2. A node embeds. Enable the index on a test memory, write a node
with an abstract, and confirm the worker drains — the node's
embeddingPendingAt clears on success:
mutation {
updateMemory(id: "<memoryId>", vectorIndexEnabled: true, embeddingSource: abstract) {
id
vectorIndexEnabled
}
}
(For an encrypted memory, also pass acknowledgeVectorInversionRisk: true —
see Encrypted memories.)
3. Vector search returns hits. Call h-find-nodes with
mode: vector and a natural-language query. A non-empty result means
the worker reached llama-server, embedded successfully, and the
vectors landed in pgvector. An empty result with
reason: "embedding_unavailable" means the worker couldn't reach the
endpoint — confirm serve-embed.sh is running and the .env URL
matches.
This mirrors the SageMaker verification checklist — the Hadron-side checks are backend-agnostic.
Talking to Hadron from your local setup¶
This page wires one thing between llama.cpp and Hadron: the embedding backend. Two other things people reach for, and where each belongs:
-
Read/write memory from the terminal or scripts → the hadron CLI. It talks to Hadron directly over GraphQL with no model in the loop:
hadron node get/add,hadron memory ls,hadron spec find(hybrid search), or the rawhadron apiescape hatch — all with--jsonand stable exit codes. This is the right companion to a local/offline setup; see the CLI reference. -
Let a model use Hadron's memory tools → MCP, in a capable agent client (Claude Code, Claude Desktop, Cursor, OpenCode). Hadron's MCP surface (URN grammar, scoped read/write tools) needs a strong tool-caller — small local models drive it unreliably, so pointing a local chat model at Hadron's MCP isn't a setup we recommend.
Experimental: MCP in the llama.cpp web UI
llama-server's web UI has an experimental MCP proxy (--ui-mcp-proxy)
you could aim at Hadron's MCP endpoint. Upstream flags it "do not enable
in untrusted environments," and tool-calling quality is bounded by the
local model — treat it as tinkering, not a supported path. For real work
against Hadron, use the CLI above or a full agent client.
Reverting to Ollama or SageMaker¶
The backend is selected by env, so switching back is a config change,
not a code change. Stop the local server first
(~/models/stop.sh) to free its RAM, then update the Hadron .env:
- To Ollama —
EMBEDDING_BACKEND=http,EMBEDDING_API_URL=http://localhost:11434/api/embed,EMBEDDING_MODEL=nomic-embed-text. Ollama and llama.cpp both servenomic-embed-text-v1.5with L2-normalized output, so dev vectors interoperate — no re-embedding needed. - To SageMaker —
EMBEDDING_BACKEND=sagemakerplus the endpoint vars (see Configure AWS SageMaker for vector embeddings). If your SageMaker endpoint hosts a different model — that how-to deploysmodernbert-embed-base, not nomic-embed-text-v1.5 — then switching to or from it is a model change, and the memory needs a re-index (togglevectorIndexEnabledoff and on, or changeembeddingSource, to re-enqueue) before rankings are correct.
Restart the Hadron server after any change.
See also¶
- RAG vector index — operational
reference: every embedding env var, search modes, the
reason/degradedflags this page references. - The RAG tech stack — why Hadron
self-hosts
nomic-embed-text-v1.5, and thehttpbackend both Ollama and llama.cpp ride. - Configure AWS SageMaker for vector embeddings — the production backend this page is the dev companion to.
- Configure your LLM provider — the separate, agent-facing chat-provider config (hosted APIs; not the local chat server above).
- Install the hadron CLI — the terminal / scripting interface to Hadron, no model in the loop; the local companion this page recommends for reading and writing memory.
- Add the Hadron MCP server to Claude Code — connect a capable agent client to Hadron's memory tools over MCP.
- llama.cpp and the llama-server README — upstream project and the full server flag reference.
nomic-ai/nomic-embed-text-v1.5-GGUF— the embedding modelserve-embed.shloads.- Hugging Face CLI
— the
hf downloadtool from Step 1.