Skip to content

The RAG tech stack

When Hadron's vector index runs over your memory, four moving pieces are involved: an embedding model turns text into a vector, an inference server runs that model, a vector store holds the results, and a search index ranks them at query time. Every one of those pieces had hosted alternatives we could have shipped instead. This page is about why Hadron picked the self-hosted path for all four — and about the two supported inference-server backends an operator standing up a Hadron server can choose between.

If you want the operational reference — env vars, embedding-source options, how to enable it — read RAG vector index first. This page explains the choices that page documents.

The principle: no node content leaves Hadron infrastructure

The deciding factor for the whole stack was a single privacy posture: embedding happens on plaintext, and plaintext must not travel to a third-party inference vendor.

This matters because of two facts that compound:

  1. Embeddings are derived from plaintext content. A search query needs to live in the same vector space as the indexed content. To put encrypted-memory content into that space, you have to embed the plaintext — there's no way to embed ciphertext and recover meaning.
  2. Embedding-inversion research is real. The vec2text line of work shows partial source reconstruction from stored embeddings is feasible, especially for short, semantically-rich inputs like abstracts. Hadron discloses this honestly (the encrypted-memory disclosure), but the trust boundary the user is asked to extend is "anyone with database access." Sending plaintext to a hosted embedding API would silently widen that boundary to a third party — and there'd be no FR-026-style disclosure for it because plaintext-in-transit is a different threat model.

So the whole stack falls out of one rule: the embedding server runs on infrastructure the operator controls, and the embedding API call never reaches a hosted-inference vendor. Once that's fixed, every other choice gets easier — and it's worth pinning down what "controls" means, because cloud-hosted operator infrastructure qualifies:

  • Hosted-inference vendors (OpenAI, Voyage, Cohere, AWS Bedrock) — your plaintext leaves your tenant boundary and is processed by a service that charges per token. Rejected by the principle.
  • Cloud-tenant compute (AWS SageMaker hosting your own model, EC2 running your own embedding server, equivalents on GCP/Azure) — your plaintext stays inside your tenant; the cloud is treated as an infrastructure provider, the same way you treat the host where the Postgres database runs. Accepted.
  • Self-hosted on a Linux server you own (a Linux host you operate; a developer laptop) — same trust shape as cloud-tenant compute. Accepted.

The distinction is "who processes the plaintext, and under what contract" — not "is it on a cloud." A SageMaker endpoint hosting your own model under your IAM role is materially different from a Bedrock API call that sends your bytes into AWS's shared inference infrastructure.

The embedding model: nomic-embed-text-v1.5

We picked nomic-embed-text-v1.5 (768 dimensions, Apache-2.0 licensed) as the platform-fixed model. Three things lined up:

  • Fit. It's competitive on MTEB for its size class, has an 8192-token context window (so a 512-token chunk or a paragraph-length abstract both fit cleanly), and is trained with Matryoshka representation learning — meaning the 768d vectors can be truncated to 512d or 256d later without re-embedding, if storage pressure ever becomes a real concern.
  • License. Apache-2.0 lets Hadron self-host without negotiating terms with anyone.
  • Ops. Small enough to serve on CPU for the corpus sizes Hadron targets. No GPU required for the dev path; production can choose GPU or CPU based on indexing volume.

A platform-fixed model is itself a choice. Every vector in Hadron v1 shares a single embedding model and a single dimension (768), so all vectors are mutually comparable. That makes the storage layer simple — one column, one index, one similarity metric — at the cost of foreclosing per-memory model selection. We think that's the right trade: cross-memory ranked search stays viable as a future feature because of the single-model invariant. We stamp the model id on every stored vector anyway, so a future platform-wide migration (new model, same v1 storage shape) is observable rather than implicit.

Why not OpenAI's embeddings, or Voyage, or Cohere?

The hosted-API alternatives are slightly cheaper to operate (you don't run a model server) and frequently a notch higher quality, but they all share one disqualifying property for Hadron: embedding plaintext means sending plaintext to a third party. That's the privacy line in the principle above.

If you'd like to evaluate a hosted model on your own infrastructure for your own corpus, nothing prevents it — the model is configured by env (EMBEDDING_MODEL), the URL is configured by env (EMBEDDING_API_URL), and the worker speaks both the OpenAI {data: [{embedding}]} shape and the Ollama batch {embeddings: number[][]} shape. You'd be opting yourself out of the platform's privacy invariant, deliberately. The plumbing supports it; the defaults don't.

The inference server: two supported backends

Hadron's embedding worker speaks to two interchangeable backends. Both serve the same nomic-embed-text-v1.5 weights with the same nomic task-instruction prefixes and return the same 768-dimension vectors — retrieval quality is identical between them. What changes is where the server runs, how the worker reaches it, and the shape of the operational surface you inherit.

Pick whichever matches the deployment you're standing up. The worker dispatches on a single environment variable (EMBEDDING_BACKEND), so the choice can change later without re-embedding the corpus.

Option A — Ollama (HTTP backend)

Ollama is a single-binary model runner that serves models over HTTP on localhost:11434. To stand up an embedding server with Ollama you run two commands:

ollama serve
ollama pull nomic-embed-text

That's the whole setup. No Python virtualenv, no CUDA toolkit, no Docker image pull, no model-converter, no GPU. Ollama runs on a laptop, a small VM, or a Linux host of any size, and serves CPU inference at workable speed for low-to-moderate volume.

The nomic-embed-text tag it ships serves nomic-embed-text-v1.5 — the exact model the SageMaker backend hosts too. Model parity between backends is a load-bearing property of the stack: embedding-model drift between environments shows up as silently degraded retrieval (vectors created with one model don't rank correctly against queries embedded with another), and Ollama's prebuilt nomic tag eliminates a class of "works on my machine" failures when developers move between this backend and SageMaker.

Three properties that recommend Ollama:

  1. Zero-config install. brew install ollama (or the single-binary download on Linux) and you have a working embedding server.
  2. Batch endpoint. The /api/embed endpoint accepts an array of inputs and returns an array of vectors — exactly the shape the Hadron embedding worker needs to drain the queue efficiently.
  3. API stability. The response shape ({embeddings: number[][]}) has been stable across Ollama releases, so the worker doesn't need version-aware parsing.

Configuration:

EMBEDDING_BACKEND=http
EMBEDDING_API_URL=http://localhost:11434/api/embed

The http backend isn't limited to Ollama. Any endpoint that returns the Ollama {embeddings} shape, the OpenAI {data: [{embedding}]} shape, or the HuggingFace TEI raw 2D-array shape works — so an operator who'd rather run TEI or vLLM directly on a VM they manage can do so under the same backend.

Option B — AWS SageMaker

AWS SageMaker Real-Time Inference hosts nomic-embed-text-v1.5 on a managed endpoint, typically via the HuggingFace Text-Embeddings-Inference (TEI) container. The worker invokes the endpoint through the AWS SageMaker Runtime SDK; the response is the TEI native shape (a top-level 2D array of floats) which the worker parses directly.

Four properties that recommend SageMaker:

  1. Privacy posture preserved. SageMaker endpoints run on AWS-tenant compute — the model lives in your AWS account, under your IAM role, behind your authentication. The request stays inside the tenant boundary; AWS is treated as an infrastructure provider, not a hosted-inference vendor. This is materially different from calling Amazon Bedrock or OpenAI, where plaintext leaves the tenant and is processed by a shared inference service. See "Why not Amazon Bedrock?" below for the contractual distinction.
  2. Operational surface. SageMaker endpoints carry the things a production embedding workload actually needs — autoscaling, health checks, instance-type sizing, CloudWatch logs and metrics, IAM-gated access. You get those by configuring the endpoint, not by building them.
  3. AWS-credit eligible. SageMaker is in the set of services covered by typical AWS credit programs (including the Amazon Imagine grant for nonprofits), which materially lowers the cost of running an embedding endpoint while you're still learning what its real volume is.
  4. Model parity with Option A. SageMaker is the runtime; the model it hosts is the same nomic-embed-text-v1.5 an Ollama install serves. Operators can switch between backends — or run both in parallel for failover — without re-embedding the corpus.

The worker reaches the endpoint via the AWS SDK rather than raw HTTPS, which gives you SigV4 signing, IAM-derived authentication, and regional endpoint resolution for free.

Configuration:

EMBEDDING_BACKEND=sagemaker
EMBEDDING_SAGEMAKER_ENDPOINT_NAME=hadron-embed-nomic-v1
EMBEDDING_SAGEMAKER_REGION=us-east-1

# Recommended hardening — dedicated SageMaker credentials.
# When both are set, the worker uses these instead of the SDK's
# default credential chain. A leaked key only buys `sagemaker:InvokeEndpoint`
# on the configured endpoint ARN — NOT whatever S3 / account-level
# surface the host's default AWS identity can reach.
EMBEDDING_SAGEMAKER_ACCESS_KEY_ID=AKIA...
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY=...

# If the dedicated vars above are unset, the worker falls through to
# the AWS SDK's standard chain (IAM role on host, AWS_PROFILE,
# AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY). Either way, the dedicated
# IAM principal needs `sagemaker:InvokeEndpoint` on the endpoint ARN;
# no other AWS permission.

Hardening: dedicated SageMaker credentials

A typical AWS host has one set of credentials that any AWS SDK call will pick up — convenient, but a leak of those credentials grants the attacker whatever IAM action the host's default identity can perform (S3, the rest of the account surface, occasionally cross-service escalation paths). The SageMaker backend supports an opt-in dedicated-credentials path that scopes the embedding worker's AWS identity to a single IAM action on a single endpoint ARN.

Set both EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY (corresponding to an IAM user or role whose policy is just the one statement below) and the worker constructs its SageMaker client with those credentials explicitly:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sagemaker:InvokeEndpoint",
    "Resource": "arn:aws:sagemaker:<region>:<account>:endpoint/<endpoint-name>"
  }]
}

A leaked credential now buys exactly one capability: invoking that endpoint. It cannot enumerate the account, cannot touch S3, cannot reach CloudWatch logs, cannot assume other roles. The blast radius collapses to the single IAM action you've explicitly granted.

The dedicated path is opt-in: leave both env vars unset and the worker falls through to the SDK's default credential chain — fine for dev or single-tenant deploys where a separate IAM principal isn't worth the secrets-manager overhead. The factory refuses to boot if only one of the two is set, so a half-configured pair errors loudly rather than silently falling back.

Choosing between them

If your situation looks like… Lean toward
Single-host deploy, you want minimal ops, throughput is modest, you'd rather not depend on a cloud provider Ollama (Option A)
You're already on AWS, you want a managed endpoint with autoscaling and IAM-gated access, AWS credits are available, you don't want to hand-roll instance health checks SageMaker (Option B)
You're on AWS but the credit window is short and the workload is small Either works — the Ollama-on-EC2 variant of Option A is the cheapest steady-state path; SageMaker is the cheapest while credits last and the easiest to operate
You're piloting and want to ship something running today, then re-evaluate Start with Ollama (Option A); the backend can be swapped later via the env var without re-embedding
You need to disclose to a third party what's running where Both are honestly "self-hosted nomic-embed-text," because both are AWS-tenant or operator-tenant compute. Bedrock would not be

Why not Amazon Bedrock?

Bedrock's Titan Text Embeddings v2 and Cohere Embed v3 are tempting on cost — pay-per-token rather than per-instance-hour, no autoscaling to configure, no model to manage. But Bedrock is a hosted-inference API: a request to it sends plaintext into AWS's managed inference service, materially the same trust shape as calling OpenAI. That crosses the privacy line in the principle above.

SageMaker-with-your-own-model and Bedrock-as-a-service look superficially similar (both are AWS, both bill against the same account) but the contractual trust boundary differs. The Bedrock terms govern a hosted inference service; the SageMaker terms govern AWS-tenant compute. That difference matters to anyone who has to defend the privacy posture of an encrypted memory's vector index downstream.

Why not "just always run TEI on a self-hosted VM"?

Running TEI directly on an EC2 or other VM you manage is also a viable backend — it preserves the privacy posture, runs the same model, and the worker speaks the TEI response shape natively via the http backend. The trade is that you take on instance health, autoscaling, IAM, and monitoring yourself. If you'd otherwise be hand-rolling those, SageMaker is the shortcut; if you have an existing platform that handles them already, TEI-on-VM is the leaner choice.

The principle

Match the inference-server choice to the deployment's actual constraints, not to a "one backend everywhere" ideal. Ollama is the lowest-friction path; SageMaker is the managed-endpoint path with the AWS operational surface. The model on the wire is identical, the worker code routes through the right client without the rest of the platform caring, and you can change your mind later with a single env var.

The vector store: pgvector, in the same Postgres

Hadron stores vectors in pgvector columns in the same Postgres database that holds memories, nodes, and edges. The vector index lives next to the data it indexes.

The alternatives we did not pick:

  • Sidecar vector DB (Qdrant, Weaviate, Pinecone, Milvus). Each adds a second persistence layer: a second backup story, a second failure mode, a second authentication path, a second data-residency boundary, and a transaction boundary between "node written" and "vector indexed." That last one is the big one — with a sidecar you end up either accepting eventual cross-store consistency (and building reconciliation tooling) or doing distributed transactions (and accepting the operational cost).

  • App-level vector math (store vectors as float[], compute cosine in TypeScript). Works at toy scale, fails at any real corpus size — no index means O(n) per query and you can't get under a second on a memory with ten thousand nodes.

Choosing pgvector keeps everything in one DB. The interim embedding-queue rides the same Postgres (durable markers on Node), the access-control predicates that gate Memory reads also gate the vector queries, and backups stay a single story. The trade is that pgvector has a real but finite scale ceiling — somewhere in the high tens of millions of vectors per index, depending on hardware and parameters — and Hadron will need to think about a sidecar if a single memory exceeds that ceiling. That's a problem for a future spec.

HNSW, not IVFFlat

pgvector ships two index types: HNSW (Hierarchical Navigable Small World) and IVFFlat (Inverted-File Flat). Hadron uses HNSW with vector_cosine_ops, m = 16, ef_construction = 64.

Two reasons:

  • No training step. HNSW can be built incrementally as vectors arrive. IVFFlat needs a representative sample of vectors to train its cluster centroids; building it incrementally as the embedding worker drains the queue would give you poorly-clustered centroids, and rebuilding the index periodically is operational overhead the async-write pipeline doesn't need.
  • Better recall/latency at Hadron's scale. HNSW gives higher recall for the same latency budget in the low-millions-of-vectors range, which is where Hadron memories live in practice.

The cost of HNSW is slower index build times and higher index storage, neither of which has bitten us yet.

Cosine, not dot product or Euclidean

Sentence-embedding models — including nomic — are trained with normalized vectors and cosine similarity in mind. Using cosine at query time matches the training distribution and produces the expected ranking behavior. We use vector_cosine_ops (the cosine operator class) on the HNSW index.

What's deliberately not in the v1 stack

Choosing one stack means deferring others. The shapes here don't foreclose any of them:

  • Per-memory model selection. Today every vector across the platform uses one model. A future spec could allow per-memory model picks, at the cost of losing cross-memory ranked search until a re-embed brings them onto a common model. We chose the simpler invariant first.
  • Matryoshka truncation. nomic-embed-text-v1.5 supports truncation to 512d or 256d without re-embedding. Hadron stores 768d vectors and gets full quality. If storage pressure becomes real, the same vectors can be re-stored at a smaller dimension without re-running the embedder.
  • Semantic chunking. v1 chunks at structural breaks (markdown sections, paragraphs) or by fixed-size token windows. A future "embed sentences, cut at similarity drops" strategy is cleaner for some corpora; the chunk-locator API (Passages) doesn't change.
  • Hosted-API model fallback. Nothing about the worker prevents pointing it at an OpenAI, Voyage, or Bedrock endpoint — the env vars and the HTTP-backend response-shape parsers are there. The defaults keep you on self-hosted / tenant-managed inference; choosing otherwise is a deliberate operator decision that crosses the privacy-posture line, not a platform default.

See also