The RAG tech stack¶
When Hadron's vector index runs over your memory, four moving pieces are involved: an embedding model turns text into a vector, an inference server runs that model, a vector store holds the results, and a search index ranks them at query time. Every one of those pieces had hosted alternatives we could have shipped instead. This page is about why Hadron picked the self-hosted path for all four — and about the two supported inference-server backends an operator standing up a Hadron server can choose between.
If you want the operational reference — env vars, embedding-source options, how to enable it — read RAG vector index first. This page explains the choices that page documents.
The principle: no node content leaves Hadron infrastructure¶
The deciding factor for the whole stack was a single privacy posture: embedding happens on plaintext, and plaintext must not travel to a third-party inference vendor.
This matters because of two facts that compound:
- Embeddings are derived from plaintext content. A search query needs to live in the same vector space as the indexed content. To put encrypted-memory content into that space, you have to embed the plaintext — there's no way to embed ciphertext and recover meaning.
- Embedding-inversion research is real. The
vec2textline of work shows partial source reconstruction from stored embeddings is feasible, especially for short, semantically-rich inputs like abstracts. Hadron discloses this honestly (the encrypted-memory disclosure), but the trust boundary the user is asked to extend is "anyone with database access." Sending plaintext to a hosted embedding API would silently widen that boundary to a third party — and there'd be no FR-026-style disclosure for it because plaintext-in-transit is a different threat model.
So the whole stack falls out of one rule: the embedding server runs on infrastructure the operator controls, and the embedding API call never reaches a hosted-inference vendor. Once that's fixed, every other choice gets easier — and it's worth pinning down what "controls" means, because cloud-hosted operator infrastructure qualifies:
- Hosted-inference vendors (OpenAI, Voyage, Cohere, AWS Bedrock) — your plaintext leaves your tenant boundary and is processed by a service that charges per token. Rejected by the principle.
- Cloud-tenant compute (AWS SageMaker hosting your own model, EC2 running your own embedding server, equivalents on GCP/Azure) — your plaintext stays inside your tenant; the cloud is treated as an infrastructure provider, the same way you treat the host where the Postgres database runs. Accepted.
- Self-hosted on a Linux server you own (a Linux host you operate; a developer laptop) — same trust shape as cloud-tenant compute. Accepted.
The distinction is "who processes the plaintext, and under what contract" — not "is it on a cloud." A SageMaker endpoint hosting your own model under your IAM role is materially different from a Bedrock API call that sends your bytes into AWS's shared inference infrastructure.
The embedding model: nomic-embed-text-v1.5¶
We picked nomic-embed-text-v1.5 (768 dimensions, Apache-2.0
licensed) as the platform-fixed model. Three things lined up:
- Fit. It's competitive on MTEB for its size class, has an 8192-token context window (so a 512-token chunk or a paragraph-length abstract both fit cleanly), and is trained with Matryoshka representation learning — meaning the 768d vectors can be truncated to 512d or 256d later without re-embedding, if storage pressure ever becomes a real concern.
- License. Apache-2.0 lets Hadron self-host without negotiating terms with anyone.
- Ops. Small enough to serve on CPU for the corpus sizes Hadron targets. No GPU required for the dev path; production can choose GPU or CPU based on indexing volume.
A platform-fixed model is itself a choice. Every vector in Hadron v1
shares a single embedding model and a single dimension (768), so
all vectors are mutually comparable. That makes the storage layer
simple — one column, one index, one similarity metric — at the cost
of foreclosing per-memory model selection. We think that's the right
trade: cross-memory ranked search stays viable as a future feature
because of the single-model invariant. We stamp the model id on
every stored vector anyway, so a future platform-wide migration (new
model, same v1 storage shape) is observable rather than implicit.
Why not OpenAI's embeddings, or Voyage, or Cohere?¶
The hosted-API alternatives are slightly cheaper to operate (you don't run a model server) and frequently a notch higher quality, but they all share one disqualifying property for Hadron: embedding plaintext means sending plaintext to a third party. That's the privacy line in the principle above.
If you'd like to evaluate a hosted model on your own infrastructure
for your own corpus, nothing prevents it — the model is configured
by env (EMBEDDING_MODEL), the URL is configured by env
(EMBEDDING_API_URL), and the worker speaks both the OpenAI
{data: [{embedding}]} shape and the Ollama batch
{embeddings: number[][]} shape. You'd be opting yourself out of the
platform's privacy invariant, deliberately. The plumbing supports it;
the defaults don't.
The inference server: two supported backends¶
Hadron's embedding worker speaks to two interchangeable backends.
Both serve the same nomic-embed-text-v1.5 weights with the same
nomic task-instruction prefixes and return the same 768-dimension
vectors — retrieval quality is identical between them. What changes
is where the server runs, how the worker reaches it, and the
shape of the operational surface you inherit.
Pick whichever matches the deployment you're standing up. The worker
dispatches on a single environment variable (EMBEDDING_BACKEND),
so the choice can change later without re-embedding the corpus.
Option A — Ollama (HTTP backend)¶
Ollama is a single-binary model runner that
serves models over HTTP on localhost:11434. To stand up an
embedding server with Ollama you run two commands:
That's the whole setup. No Python virtualenv, no CUDA toolkit, no Docker image pull, no model-converter, no GPU. Ollama runs on a laptop, a small VM, or a Linux host of any size, and serves CPU inference at workable speed for low-to-moderate volume.
The nomic-embed-text tag it ships serves nomic-embed-text-v1.5 —
the exact model the SageMaker backend hosts too. Model parity
between backends is a load-bearing property of the stack:
embedding-model drift between environments shows up as silently
degraded retrieval (vectors created with one model don't rank
correctly against queries embedded with another), and Ollama's
prebuilt nomic tag eliminates a class of "works on my machine"
failures when developers move between this backend and SageMaker.
Three properties that recommend Ollama:
- Zero-config install.
brew install ollama(or the single-binary download on Linux) and you have a working embedding server. - Batch endpoint. The
/api/embedendpoint accepts an array of inputs and returns an array of vectors — exactly the shape the Hadron embedding worker needs to drain the queue efficiently. - API stability. The response shape (
{embeddings: number[][]}) has been stable across Ollama releases, so the worker doesn't need version-aware parsing.
Configuration:
The http backend isn't limited to Ollama. Any endpoint that returns
the Ollama {embeddings} shape, the OpenAI {data: [{embedding}]}
shape, or the HuggingFace TEI raw 2D-array shape works — so an
operator who'd rather run TEI
or vLLM directly on a VM
they manage can do so under the same backend.
Option B — AWS SageMaker¶
AWS SageMaker Real-Time
Inference
hosts nomic-embed-text-v1.5 on a managed endpoint, typically via
the HuggingFace Text-Embeddings-Inference
(TEI)
container. The worker invokes the endpoint through the AWS SageMaker
Runtime SDK; the response is the TEI native shape (a top-level 2D
array of floats) which the worker parses directly.
Four properties that recommend SageMaker:
- Privacy posture preserved. SageMaker endpoints run on AWS-tenant compute — the model lives in your AWS account, under your IAM role, behind your authentication. The request stays inside the tenant boundary; AWS is treated as an infrastructure provider, not a hosted-inference vendor. This is materially different from calling Amazon Bedrock or OpenAI, where plaintext leaves the tenant and is processed by a shared inference service. See "Why not Amazon Bedrock?" below for the contractual distinction.
- Operational surface. SageMaker endpoints carry the things a production embedding workload actually needs — autoscaling, health checks, instance-type sizing, CloudWatch logs and metrics, IAM-gated access. You get those by configuring the endpoint, not by building them.
- AWS-credit eligible. SageMaker is in the set of services covered by typical AWS credit programs (including the Amazon Imagine grant for nonprofits), which materially lowers the cost of running an embedding endpoint while you're still learning what its real volume is.
- Model parity with Option A. SageMaker is the runtime; the
model it hosts is the same
nomic-embed-text-v1.5an Ollama install serves. Operators can switch between backends — or run both in parallel for failover — without re-embedding the corpus.
The worker reaches the endpoint via the AWS SDK rather than raw HTTPS, which gives you SigV4 signing, IAM-derived authentication, and regional endpoint resolution for free.
Configuration:
EMBEDDING_BACKEND=sagemaker
EMBEDDING_SAGEMAKER_ENDPOINT_NAME=hadron-embed-nomic-v1
EMBEDDING_SAGEMAKER_REGION=us-east-1
# Recommended hardening — dedicated SageMaker credentials.
# When both are set, the worker uses these instead of the SDK's
# default credential chain. A leaked key only buys `sagemaker:InvokeEndpoint`
# on the configured endpoint ARN — NOT whatever S3 / account-level
# surface the host's default AWS identity can reach.
EMBEDDING_SAGEMAKER_ACCESS_KEY_ID=AKIA...
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY=...
# If the dedicated vars above are unset, the worker falls through to
# the AWS SDK's standard chain (IAM role on host, AWS_PROFILE,
# AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY). Either way, the dedicated
# IAM principal needs `sagemaker:InvokeEndpoint` on the endpoint ARN;
# no other AWS permission.
Hardening: dedicated SageMaker credentials¶
A typical AWS host has one set of credentials that any AWS SDK call will pick up — convenient, but a leak of those credentials grants the attacker whatever IAM action the host's default identity can perform (S3, the rest of the account surface, occasionally cross-service escalation paths). The SageMaker backend supports an opt-in dedicated-credentials path that scopes the embedding worker's AWS identity to a single IAM action on a single endpoint ARN.
Set both EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY (corresponding to an IAM user
or role whose policy is just the one statement below) and the worker
constructs its SageMaker client with those credentials explicitly:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "sagemaker:InvokeEndpoint",
"Resource": "arn:aws:sagemaker:<region>:<account>:endpoint/<endpoint-name>"
}]
}
A leaked credential now buys exactly one capability: invoking that endpoint. It cannot enumerate the account, cannot touch S3, cannot reach CloudWatch logs, cannot assume other roles. The blast radius collapses to the single IAM action you've explicitly granted.
The dedicated path is opt-in: leave both env vars unset and the worker falls through to the SDK's default credential chain — fine for dev or single-tenant deploys where a separate IAM principal isn't worth the secrets-manager overhead. The factory refuses to boot if only one of the two is set, so a half-configured pair errors loudly rather than silently falling back.
Choosing between them¶
| If your situation looks like… | Lean toward |
|---|---|
| Single-host deploy, you want minimal ops, throughput is modest, you'd rather not depend on a cloud provider | Ollama (Option A) |
| You're already on AWS, you want a managed endpoint with autoscaling and IAM-gated access, AWS credits are available, you don't want to hand-roll instance health checks | SageMaker (Option B) |
| You're on AWS but the credit window is short and the workload is small | Either works — the Ollama-on-EC2 variant of Option A is the cheapest steady-state path; SageMaker is the cheapest while credits last and the easiest to operate |
| You're piloting and want to ship something running today, then re-evaluate | Start with Ollama (Option A); the backend can be swapped later via the env var without re-embedding |
| You need to disclose to a third party what's running where | Both are honestly "self-hosted nomic-embed-text," because both are AWS-tenant or operator-tenant compute. Bedrock would not be |
Why not Amazon Bedrock?¶
Bedrock's Titan Text Embeddings v2 and Cohere Embed v3 are tempting on cost — pay-per-token rather than per-instance-hour, no autoscaling to configure, no model to manage. But Bedrock is a hosted-inference API: a request to it sends plaintext into AWS's managed inference service, materially the same trust shape as calling OpenAI. That crosses the privacy line in the principle above.
SageMaker-with-your-own-model and Bedrock-as-a-service look superficially similar (both are AWS, both bill against the same account) but the contractual trust boundary differs. The Bedrock terms govern a hosted inference service; the SageMaker terms govern AWS-tenant compute. That difference matters to anyone who has to defend the privacy posture of an encrypted memory's vector index downstream.
Why not "just always run TEI on a self-hosted VM"?¶
Running TEI directly on an EC2 or other VM you manage is also a
viable backend — it preserves the privacy posture, runs the same
model, and the worker speaks the TEI response shape natively via the
http backend. The trade is that you take on instance health,
autoscaling, IAM, and monitoring yourself. If you'd otherwise be
hand-rolling those, SageMaker is the shortcut; if you have an
existing platform that handles them already, TEI-on-VM is the leaner
choice.
The principle¶
Match the inference-server choice to the deployment's actual constraints, not to a "one backend everywhere" ideal. Ollama is the lowest-friction path; SageMaker is the managed-endpoint path with the AWS operational surface. The model on the wire is identical, the worker code routes through the right client without the rest of the platform caring, and you can change your mind later with a single env var.
The vector store: pgvector, in the same Postgres¶
Hadron stores vectors in pgvector columns in the same Postgres database that holds memories, nodes, and edges. The vector index lives next to the data it indexes.
The alternatives we did not pick:
-
Sidecar vector DB (Qdrant, Weaviate, Pinecone, Milvus). Each adds a second persistence layer: a second backup story, a second failure mode, a second authentication path, a second data-residency boundary, and a transaction boundary between "node written" and "vector indexed." That last one is the big one — with a sidecar you end up either accepting eventual cross-store consistency (and building reconciliation tooling) or doing distributed transactions (and accepting the operational cost).
-
App-level vector math (store vectors as
float[], compute cosine in TypeScript). Works at toy scale, fails at any real corpus size — no index means O(n) per query and you can't get under a second on a memory with ten thousand nodes.
Choosing pgvector keeps everything in one DB. The interim
embedding-queue rides the same Postgres (durable markers on Node),
the access-control predicates that gate Memory reads also gate the
vector queries, and backups stay a single story. The trade is that
pgvector has a real but finite scale ceiling — somewhere in the
high tens of millions of vectors per index, depending on hardware
and parameters — and Hadron will need to think about a sidecar if a
single memory exceeds that ceiling. That's a problem for a future
spec.
HNSW, not IVFFlat¶
pgvector ships two index types: HNSW (Hierarchical Navigable
Small World) and IVFFlat (Inverted-File Flat). Hadron uses HNSW
with vector_cosine_ops, m = 16, ef_construction = 64.
Two reasons:
- No training step. HNSW can be built incrementally as vectors arrive. IVFFlat needs a representative sample of vectors to train its cluster centroids; building it incrementally as the embedding worker drains the queue would give you poorly-clustered centroids, and rebuilding the index periodically is operational overhead the async-write pipeline doesn't need.
- Better recall/latency at Hadron's scale. HNSW gives higher recall for the same latency budget in the low-millions-of-vectors range, which is where Hadron memories live in practice.
The cost of HNSW is slower index build times and higher index storage, neither of which has bitten us yet.
Cosine, not dot product or Euclidean¶
Sentence-embedding models — including nomic — are trained with
normalized vectors and cosine similarity in mind. Using cosine at
query time matches the training distribution and produces the
expected ranking behavior. We use vector_cosine_ops (the cosine
operator class) on the HNSW index.
What's deliberately not in the v1 stack¶
Choosing one stack means deferring others. The shapes here don't foreclose any of them:
- Per-memory model selection. Today every vector across the platform uses one model. A future spec could allow per-memory model picks, at the cost of losing cross-memory ranked search until a re-embed brings them onto a common model. We chose the simpler invariant first.
- Matryoshka truncation.
nomic-embed-text-v1.5supports truncation to 512d or 256d without re-embedding. Hadron stores 768d vectors and gets full quality. If storage pressure becomes real, the same vectors can be re-stored at a smaller dimension without re-running the embedder. - Semantic chunking. v1 chunks at structural breaks (markdown sections, paragraphs) or by fixed-size token windows. A future "embed sentences, cut at similarity drops" strategy is cleaner for some corpora; the chunk-locator API (Passages) doesn't change.
- Hosted-API model fallback. Nothing about the worker prevents pointing it at an OpenAI, Voyage, or Bedrock endpoint — the env vars and the HTTP-backend response-shape parsers are there. The defaults keep you on self-hosted / tenant-managed inference; choosing otherwise is a deliberate operator decision that crosses the privacy-posture line, not a platform default.
See also¶
- RAG vector index — the
operational reference: env vars, embedding sources, how to enable
the index per memory, what
mode: vector/mode: hybridreturns. - Spec 033 — RAG vector retrieval — the source specification, including the research document where the tech-stack decisions are recorded with their alternatives and rejected paths.
- Ollama and
nomic-embed-texton Ollama — the dev embedding-server install. - Run local LLMs with llama.cpp
— the no-daemon dev alternative to Ollama on macOS / Apple Silicon:
a local
nomic-embed-textendpoint plus general chat models. - AWS SageMaker Real-Time Inference — the AWS doc for the runtime Hadron's production embedding workload runs on.
- HuggingFace Text-Embeddings-Inference (TEI) — the container image that hosts nomic on our SageMaker endpoint. Also a viable HTTP-backend target if you'd rather run TEI directly on a VM you manage.
- pgvector — the Postgres extension Hadron uses for vector storage.