Skip to content

Run local LLMs with llama.cpp

llama.cpp runs GGUF models directly on your machine with no background daemon. On Apple Silicon its Metal backend gives you full GPU offload from a single binary. This page sets up two things:

  1. A local embedding server for Hadron's RAG vector index — a free, offline dev backend for mode:vector / mode:hybrid retrieval, with no SageMaker endpoint to stand up or pay for. This is the Hadron-specific part, and it's a drop-in for the Ollama dev path.
  2. Local chat models — any GGUF served over an OpenAI-compatible API on localhost, with one-line start/stop scripts.

If you only want the Hadron embedding backend, skip to Step 5. The chat steps establish the same start/stop pattern the embedding server uses, so they're worth a skim either way.

This page is the dev / self-hosted companion to Configure AWS SageMaker for vector embeddings (the production backend). Both feed the same feature — see RAG vector index for the operational reference and The RAG tech stack for why Hadron self-hosts the embedding model at all.

Why llama.cpp instead of Ollama

Both serve the same nomic-embed-text-v1.5 weights over the same http backend, so retrieval quality is identical. The difference is operational: llama.cpp is not a daemon. There's no always-on background service, nothing autostarts at login, and RAM is freed the instant the process exits — versus Ollama's background service and OLLAMA_KEEP_ALIVE idle-unload. You also get direct control over Metal flags (-ngl, context size, batch). Ollama is still the lowest-friction zero-config path; reach for llama.cpp when you'd rather not run a background service.

Scope

Written and verified on macOS Apple Silicon (Metal) — llama.cpp b9700, M4 Pro / 48 GB. The Linux/CUDA install and the wired-memory step differ; the env wiring in Step 5 is identical on any platform.

Prerequisites

  • macOS on Apple Silicon (M1–M4). Metal acceleration is automatic in the Homebrew build — no flags needed at install time.
  • Homebrew.
  • For the Hadron embedding path (Step 5): write access to the Hadron server's environment (.env, Doppler, or whatever the deploy uses) and a server you can restart.
  • Disk and RAM headroom. GGUF chat models run from a few GB to ~40 GB depending on parameter count and quantization. The model set in the scripts below is sized for ~48 GB unified memory — treat the names and quants as examples and pick a quant that fits your RAM from each model's GGUF repo.

Step 1 — Install llama.cpp and a model downloader

Install the Metal build of llama.cpp and the Hugging Face CLI (used to pull GGUF files):

brew install llama.cpp
uv tool install "huggingface_hub[cli]"

The Homebrew formula links Metal, so -ngl 99 (offload all layers to the GPU) works out of the box. The second command installs the hf downloader via uvbrew install uv first if you don't have it.

Verify both are on your PATH:

llama-server --version
hf version

Linux / CUDA

On Linux you'll build or install llama.cpp against CUDA instead of Metal, and the Metal wired-memory step doesn't apply. Everything in Step 5 (the Hadron .env wiring) is platform-independent.

Step 2 — Download GGUF models into ~/models

The scripts below keep one directory per model under ~/models/.

The embedding model is required for the Hadron path — download it now:

hf download nomic-ai/nomic-embed-text-v1.5-GGUF \
  nomic-embed-text-v1.5.f16.gguf \
  --local-dir ~/models/nomic-embed

For chat models, pick a repo and a quant from the model's GGUF card on the Hub and download the .gguf into its own directory:

hf download <org>/<model>-GGUF <file>.gguf --local-dir ~/models/<dir>

Match the directory and filename to the cases in serve.sh (Step 3), or edit the script's paths to match what you downloaded.

Split GGUFs

Large quants are sometimes published as multiple *-00001-of-0000N.gguf parts. Download all parts into the same directory and point -m at the first part — llama.cpp loads the rest automatically.

Step 3 — Run a chat model

Drop this serve.sh into ~/models/ and chmod +x ~/models/serve.sh. It launches any model in the set over llama.cpp's built-in web UI and an OpenAI-compatible API on the same port.

#!/usr/bin/env bash
# llama.cpp chat-model launcher for the GGUF set in ~/models.
#   Usage:  ~/models/serve.sh <model> [port]
#   models: coder qwen3 llama70 gemma12 gemma26 mistral
# Web UI: http://127.0.0.1:<port>   API: http://127.0.0.1:<port>/v1   Ctrl-C to stop.
set -euo pipefail
export PATH="/opt/homebrew/bin:$PATH"
M="${HOME}/models"
name="${1:-}"; port="${2:-8080}"

case "$name" in
  coder)    model="$M/qwen2.5-coder-32b/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf";  ctx=16384 ;;
  qwen3)    model="$M/qwen3-30b-a3b/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf";      ctx=32768 ;;
  gemma12)  model="$M/gemma-4-12b/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf";             ctx=16384 ;;
  gemma26)  model="$M/gemma-4-26b/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf";         ctx=16384 ;;
  mistral)  model="$M/mistral-small-24b/Mistral-Small-Instruct-2409-Q4_K_M.gguf"; ctx=16384 ;;
  llama70)
    model="$M/llama-3.3-70b/Llama-3.3-70B-Instruct-IQ4_XS.gguf"; ctx=8192
    # 70B (~38 GB) exceeds the default Metal wired-memory limit (~36 GB on 48 GB).
    cur=$(sysctl -n iogpu.wired_limit_mb 2>/dev/null || echo 0)
    if [ "${cur:-0}" -lt 44000 ]; then
      echo ">> Raising Metal wired-memory limit to 44 GB (needs sudo; resets on reboot)…"
      sudo sysctl -w iogpu.wired_limit_mb=44000
    fi ;;
  *) echo "Usage: $0 <coder|qwen3|llama70|gemma12|gemma26|mistral> [port]"; exit 1 ;;
esac

[ -f "$model" ] || { echo "Model not found (still downloading?): $model"; exit 1; }
echo ">> Serving '$name' ctx=$ctx -> http://127.0.0.1:$port (web UI + /v1 API)"
exec llama-server -m "$model" -ngl 99 -c "$ctx" --jinja --host 127.0.0.1 --port "$port"

Run it:

~/models/serve.sh coder        # http://127.0.0.1:8080
~/models/serve.sh qwen3 8090   # second model on a different port

The flags that matter: -ngl 99 offloads all layers to Metal, -c sets the context window, and --jinja makes llama-server apply the model's own chat template. Open the web UI at the printed URL, or point any OpenAI-compatible client at http://127.0.0.1:<port>/v1.

The model list is an example set sized for ~48 GB. Edit the case paths to match the GGUFs you downloaded in Step 2.

Local chat models don't wire into the agent's LLM provider

The OpenAI-compatible endpoint here is for your own dev / offline use (coding assistants, OpenAI-client testing). It is not a backend for a Hadron agent's chat — that provider list (Configure your LLM provider) is hosted APIs (Anthropic, OpenAI, GLM, Bedrock), with no custom base-URL field. Only the embedding path (Step 5) plugs into Hadron.

Large models and the Metal wired-memory limit

macOS caps how much unified memory the GPU may "wire" (hold non-pageable) — roughly 36 GB on a 48 GB machine. A 70B model at IQ4_XS (~38 GB) exceeds that and fails to load until you raise the limit. The llama70 case in serve.sh does this for you; to do it by hand:

sudo sysctl -w iogpu.wired_limit_mb=44000

This is macOS-only, needs sudo, and resets on reboot. Size it below your total RAM with headroom for the OS — 44 GB on a 48 GB machine. Models that fit under the default limit (everything smaller than ~36 GB) need no change.

Step 4 — Stop a server and free RAM

Ctrl-C in the serving terminal stops a server. To kill any llama.cpp server from anywhere (chat on :8080, embeddings on :8081, or both), drop this stop.sh into ~/models/:

#!/usr/bin/env bash
# Stop all running llama.cpp servers (chat :8080 and/or embeddings :8081), freeing RAM.
if pkill -f llama-server; then
  echo "stopped llama.cpp server(s) — RAM freed"
else
  echo "no llama.cpp server was running"
fi

Because llama.cpp is not a daemon, the RAM is released the instant the process exits — there's no background service holding weights in memory, and nothing relaunches at login. This is the practical difference from Ollama, which runs a persistent service and keeps models resident until OLLAMA_KEEP_ALIVE elapses.

The no-daemon tradeoff for Hadron

The flip side: you must start the embedding server yourself before mode:vector / mode:hybrid retrieval works. If serve-embed.sh isn't running, Hadron can't reach the endpoint and h-find-nodes returns reason: "embedding_unavailable" (on hybrid, the keyword half still runs). Start it again and retrieval recovers on the next search.

Step 5 — Serve embeddings for Hadron's RAG layer

This is the Hadron-specific payoff: a local nomic-embed-text-v1.5 endpoint that Hadron's embedding worker calls instead of SageMaker.

Drop this serve-embed.sh into ~/models/, chmod +x it, and run it. It serves the embedding model on :8081:

#!/usr/bin/env bash
# nomic-embed-text v1.5 embedding server for Hadron's RAG layer (768-dim).
# Point Hadron at it: EMBEDDING_API_URL=http://127.0.0.1:8081/v1/embeddings
set -euo pipefail
export PATH="/opt/homebrew/bin:$PATH"
exec llama-server \
  -m "$HOME/models/nomic-embed/nomic-embed-text-v1.5.f16.gguf" \
  --embedding --pooling mean -ngl 99 \
  -c 8192 -b 8192 --ubatch-size 8192 \
  --host 127.0.0.1 --port 8081

Why this is a drop-in for Hadron's default backend:

  • Response shape. Hadron's HTTP embedding backend (EMBEDDING_BACKEND=http, the default) sends { model, input } and already accepts the OpenAI { data: [{ embedding }] } response, so llama-server's /v1/embeddings needs no adapter.
  • Task prefixes. nomic needs search_document: / search_query: prefixes. Hadron's client applies these itself, so no endpoint-side templating is required.
  • Vector compatibility. --pooling mean with the default --embd-normalize 2 (L2) matches Ollama's normalized output. Same model, same normalization — so existing dev-DB vectors created against Ollama don't need re-embedding when you switch to llama.cpp.

5a — Wire the Hadron .env

Point the Hadron server at the local endpoint and restart it so the vars take effect:

EMBEDDING_BACKEND=http
EMBEDDING_API_URL=http://127.0.0.1:8081/v1/embeddings   # llama-server --embedding
EMBEDDING_MODEL=nomic-embed-text                         # sent verbatim; llama-server ignores it
EMBEDDING_DIM=768                                        # MUST equal the pgvector vector(N) column
# EMBEDDING_API_KEY=                                     # optional
Env var Required Notes
EMBEDDING_BACKEND no Defaults to http. The HTTP client is what speaks to llama-server.
EMBEDDING_API_URL yes The llama-server /v1/embeddings URL. Unset means every mode:vector query returns reason: "embedding_unavailable".
EMBEDDING_MODEL no Stored per-vector for lineage. Sent in the request body but llama-server ignores it (it serves whatever -m loaded). Keep it nomic-embed-text to match the Ollama lineage label.
EMBEDDING_DIM no Defaults to 768. Must equal the pgvector vector(N) column dimension, or embeds are rejected. nomic-embed-text-v1.5 is 768-dim.
EMBEDDING_API_KEY no Only if you front the endpoint with auth. A bare local llama-server needs none.

See RAG vector index — Operator configuration for the full env table.

5b — Verify

1. The endpoint returns 768-dim vectors. With serve-embed.sh running:

curl -s http://127.0.0.1:8081/v1/embeddings \
  -H 'content-type: application/json' \
  -d '{"input": "search_document: hello world"}' \
  | python3 -c 'import sys, json; print(len(json.load(sys.stdin)["data"][0]["embedding"]), "dims")'

Expect 768 dims. (Have jq? … | jq '.data[0].embedding | length'.)

2. A node embeds. Enable the index on a test memory, write a node with an abstract, and confirm the worker drains — the node's embeddingPendingAt clears on success:

mutation {
  updateMemory(id: "<memoryId>", vectorIndexEnabled: true, embeddingSource: abstract) {
    id
    vectorIndexEnabled
  }
}

(For an encrypted memory, also pass acknowledgeVectorInversionRisk: true — see Encrypted memories.)

3. Vector search returns hits. Call h-find-nodes with mode: vector and a natural-language query. A non-empty result means the worker reached llama-server, embedded successfully, and the vectors landed in pgvector. An empty result with reason: "embedding_unavailable" means the worker couldn't reach the endpoint — confirm serve-embed.sh is running and the .env URL matches.

This mirrors the SageMaker verification checklist — the Hadron-side checks are backend-agnostic.

Talking to Hadron from your local setup

This page wires one thing between llama.cpp and Hadron: the embedding backend. Two other things people reach for, and where each belongs:

  • Read/write memory from the terminal or scripts → the hadron CLI. It talks to Hadron directly over GraphQL with no model in the loop: hadron node get/add, hadron memory ls, hadron spec find (hybrid search), or the raw hadron api escape hatch — all with --json and stable exit codes. This is the right companion to a local/offline setup; see the CLI reference.

  • Let a model use Hadron's memory tools → MCP, in a capable agent client (Claude Code, Claude Desktop, Cursor, OpenCode). Hadron's MCP surface (URN grammar, scoped read/write tools) needs a strong tool-caller — small local models drive it unreliably, so pointing a local chat model at Hadron's MCP isn't a setup we recommend.

Experimental: MCP in the llama.cpp web UI

llama-server's web UI has an experimental MCP proxy (--ui-mcp-proxy) you could aim at Hadron's MCP endpoint. Upstream flags it "do not enable in untrusted environments," and tool-calling quality is bounded by the local model — treat it as tinkering, not a supported path. For real work against Hadron, use the CLI above or a full agent client.

Reverting to Ollama or SageMaker

The backend is selected by env, so switching back is a config change, not a code change. Stop the local server first (~/models/stop.sh) to free its RAM, then update the Hadron .env:

  • To OllamaEMBEDDING_BACKEND=http, EMBEDDING_API_URL=http://localhost:11434/api/embed, EMBEDDING_MODEL=nomic-embed-text. Ollama and llama.cpp both serve nomic-embed-text-v1.5 with L2-normalized output, so dev vectors interoperate — no re-embedding needed.
  • To SageMakerEMBEDDING_BACKEND=sagemaker plus the endpoint vars (see Configure AWS SageMaker for vector embeddings). If your SageMaker endpoint hosts a different model — that how-to deploys modernbert-embed-base, not nomic-embed-text-v1.5 — then switching to or from it is a model change, and the memory needs a re-index (toggle vectorIndexEnabled off and on, or change embeddingSource, to re-enqueue) before rankings are correct.

Restart the Hadron server after any change.

See also