Configure AWS SageMaker for vector embeddings¶
Hadron's RAG vector index uses an embedding server to turn node content into vectors. The Hadron embedding worker supports two backends — an HTTP backend (Ollama, TEI-on-a-VM, vLLM, etc.) and an AWS SageMaker backend. This page walks through the SageMaker path end-to-end: deploy the endpoint via the AWS CLI, set the environment variables on the Hadron server, scope the worker's AWS identity to just the endpoint, and verify the wiring before turning the index on.
For the conceptual background on the two backends and when to pick which, see The RAG tech stack. For the operational reference (every embedding env var, embedding-source options, search modes), see RAG vector index.
Prerequisites¶
- AWS CLI v2 installed and configured with a profile that has
SageMaker, IAM, and ECR read permissions for the target account.
Verify with
aws sts get-caller-identity. - AWS account permissions to create SageMaker models / endpoints and an IAM user with a custom policy. Account-level quotas matter too — the default endpoint quota per region is plenty for a single Hadron deploy, but check Service Quotas if you're already running other SageMaker workloads.
- Write access to the Hadron server's environment (Doppler, your process manager's env file, or whatever the deploy uses). The env vars are read at process start.
Step 1: Deploy the SageMaker endpoint with the AWS CLI¶
If you already have a SageMaker endpoint hosting a 768-dimension embedding model from the Nomic family, skip to Step 2.
Otherwise, the AWS CLI is the cleanest path for a first-pass production deploy: three commands to register the model, define the endpoint shape, and create the endpoint itself. Once you have one working, you can move to CDK / Terraform / CloudFormation when you want the config checked into source.
1a. Create a SageMaker execution role (if you don't have one)¶
The execution role is what SageMaker assumes to pull the container image and read any model artifacts you stage in S3. This is a distinct IAM principal from the one Hadron uses to invoke the endpoint — that one is created in Step 3 below.
If your account already has a SageMaker execution role (often named
something like AmazonSageMaker-ExecutionRole-*), record its ARN
and skip to 1b. Otherwise, create one:
# Trust policy — lets the SageMaker service assume the role
cat > /tmp/sagemaker-trust.json <<'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "sagemaker.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}
EOF
aws iam create-role \
--role-name HadronSageMakerExecutionRole \
--assume-role-policy-document file:///tmp/sagemaker-trust.json
aws iam attach-role-policy \
--role-name HadronSageMakerExecutionRole \
--policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
The AmazonSageMakerFullAccess managed policy is broad — tighter
custom policies exist if you want to scope to specific S3 buckets,
ECR repos, and CloudWatch log groups, but the managed policy is the
boring starting point.
Record the role ARN (arn:aws:iam::<account-id>:role/HadronSageMakerExecutionRole)
for the next sub-step.
1b. Find the TEI container image URI for your region¶
The HuggingFace Text-Embeddings-Inference container is published as
a Hugging Face Deep Learning Container (DLC) in HuggingFace's ECR
repo (account ID 683313688378). The URI's tag — TEI version,
Python version, CUDA version — shifts as new TEI releases land.
Recommended discovery — the SageMaker Python SDK helper. The SDK resolves the current URI for the region you pass, so you don't reconstruct strings by hand:
from sagemaker.huggingface import get_huggingface_llm_image_uri
# CPU (e.g. ml.t3.medium)
print(get_huggingface_llm_image_uri("huggingface-tei-cpu", region="us-east-1"))
# GPU (e.g. ml.g4dn.xlarge)
print(get_huggingface_llm_image_uri("huggingface-tei", region="us-east-1"))
The SDK v3 release is recent and some helpers may not be fully
updated for it yet — pinning sagemaker<3.0.0 for this one lookup
is the safer path until that settles. The printed URI is the value
you'll substitute into the --primary-container argument in
Step 1c.
Why the explicit region=
Omitting region makes the SDK instantiate a Session() to
discover the default region from the AWS config, which in turn
forces the boto credential chain to load. On a host without AWS
credentials configured yet (or one missing the optional
botocore[crt] extra), that errors with
MissingDependencyException before the URI is ever resolved.
Passing region= keeps the lookup credential-free — you'll
configure credentials anyway before Step 1c
runs, but this one lookup doesn't need them.
Or look it up manually. The current URI table lives at the
Hugging Face SageMaker DLC catalog
— look for the "Text Embeddings Inference" table. Concrete current
examples for us-east-1 (verify before using — these strings shift
with each TEI release):
| Accelerator | Container URI |
|---|---|
| CPU | 683313688378.dkr.ecr.us-east-1.amazonaws.com/tei-cpu:2.0.1-tei1.8.2-cpu-py310-ubuntu22.04 |
| GPU | 683313688378.dkr.ecr.us-east-1.amazonaws.com/tei:2.0.1-tei1.8.2-gpu-py310-cu122-ubuntu22.04 |
Limited region availability
The TEI DLC is published in a smaller set of regions than the
general HuggingFace inference DLC. us-east-1 is reliable; some
other regions don't have it. If your target region doesn't carry
the TEI image, the simplest path is to deploy the endpoint in
us-east-1 and set EMBEDDING_SAGEMAKER_REGION=us-east-1 on the
Hadron side. The Python SDK helper will surface an error if the
region you're targeting doesn't carry the image.
CPU is the right choice for low-to-moderate volume. Step up to GPU
(ml.g4dn.xlarge or similar in Step 1d)
when embedding throughput becomes the bottleneck — see The RAG
tech stack
for the choosing-between-them table.
1c. Create the model¶
This registers the model + container combination as a named SageMaker
resource. TEI loads the actual weights from HuggingFace Hub at
container startup via the HF_MODEL_ID environment variable — no S3
upload of model artifacts needed for the standard nomic case.
aws sagemaker create-model \
--model-name hadron-embed-nomic-v1 \
--execution-role-arn arn:aws:iam::<account-id>:role/HadronSageMakerExecutionRole \
--primary-container '{
"Image": "<TEI-container-URI-from-step-1b>",
"Environment": {
"HF_MODEL_ID": "nomic-ai/modernbert-embed-base",
"MAX_BATCH_TOKENS": "8192"
}
}'
MAX_BATCH_TOKENS caps the total tokens TEI will pack into one
forward pass — and it also bounds the synthetic batch TEI uses
during warmup. 8192 is one max-length ModernBERT sequence per
batch and produces a ~3.2 GB attention allocation during warmup.
That allocation, plus model weights (~600 MB), plus ONNX Runtime
arena overhead (1–2 GB), plus container OS + runtime (~700 MB),
needs at minimum a 16 GB instance (ml.m5.xlarge) to run without
the BFCArena failing to find contiguous memory. ml.m5.large
(8 GB) is too tight for ModernBERT at any reasonable batch size
and will OOM during warmup. See the FusedMatMul row in Common
Errors below for the exact symptom.
Why modernbert-embed-base and not nomic-embed-text-v1.5
Both are from Nomic, both are 768-dimension, both Matryoshka, both
8192-token context — but nomic-embed-text-v1.5's current
config.json on HuggingFace Hub has a duplicate field that TEI's
strict Rust JSON parser rejects (see the
"TEI rejected config.json" row in Common
Errors below). modernbert-embed-base is Nomic's newer
ModernBERT-architecture model and has a clean config. The Hadron
spec-033 properties (Matryoshka, 8192-token context, Apache 2.0
license) all carry over.
1d. Create the endpoint config¶
This defines the instance shape (count, type, autoscaling). For
ModernBERT (the model from Step 1c), ml.m5.xlarge is the smallest
viable CPU instance — ml.m5.large is 8 GB which doesn't have
enough contiguous memory for the warmup allocation plus runtime
overhead (see the explanation in Step 1c above):
aws sagemaker create-endpoint-config \
--endpoint-config-name hadron-embed-nomic-v1-config \
--production-variants '[{
"VariantName": "default",
"ModelName": "hadron-embed-nomic-v1",
"InitialInstanceCount": 1,
"InstanceType": "ml.m5.xlarge"
}]'
For GPU, swap ml.m5.xlarge → ml.g4dn.xlarge (NVIDIA T4) and use
the -gpu container URI from 1b. SageMaker's autoscaling lives in
the aws application-autoscaling service — out of scope here; the
fixed-count config above is the right starting point.
SageMaker Hosting doesn't accept every EC2 instance type
The CreateEndpointConfig API has its own enum of allowed
InstanceType values that diverges from EC2's general list. Notably,
ml.t3.* is not in the enum — t3.medium works on EC2 but
fails CreateEndpointConfig with a ValidationException listing
the allowed types. Cheapest viable CPU options that ARE in the
enum:
| Instance | vCPU | RAM | ~$/hr | Notes |
|---|---|---|---|---|
ml.t2.medium |
2 | 4 GB | ~$0.06 | Burstable; cheapest. Variable performance under load. Too small for ModernBERT — use only for non-ModernBERT smoke tests. |
ml.m5.large |
2 | 8 GB | ~$0.13 | Non-burstable. Insufficient for modernbert-embed-base — TEI's warmup allocation plus runtime overhead doesn't fit in 8 GB. Works for smaller embedding models. |
ml.m5.xlarge |
4 | 16 GB | ~$0.26 | Recommended for modernbert-embed-base. Comfortable headroom for warmup; runs MAX_BATCH_TOKENS=8192 cleanly with room to tune up. |
ml.c6i.xlarge |
4 | 8 GB | ~$0.21 | Newer compute-optimized; same 8 GB ceiling as m5.large — also insufficient for ModernBERT. Compute is good for smaller models that fit. |
ml.c6i.2xlarge |
8 | 16 GB | ~$0.41 | Compute-optimized with 16 GB. Faster than m5.xlarge but more expensive. Worth it only when you've measured CPU as the bottleneck. |
The full allowed list is region-dependent; if the call rejects your choice it returns the exact enum, so you can pick from that error message.
1e. Create the endpoint¶
This is the async step. The CLI returns immediately; SageMaker provisions the instance and pulls the container in the background.
aws sagemaker create-endpoint \
--endpoint-name hadron-embed-nomic-v1 \
--endpoint-config-name hadron-embed-nomic-v1-config
Poll for InService — typically 5-10 minutes for the first creation:
aws sagemaker describe-endpoint \
--endpoint-name hadron-embed-nomic-v1 \
--query 'EndpointStatus' --output text
The status walks through Creating → InService on success or
Failed on container startup error. Pricing starts the moment
status reaches InService, not when you call create-endpoint.
If it lands on Failed, check FailureReason in the
describe-endpoint JSON output and the CloudWatch log group at
/aws/sagemaker/Endpoints/hadron-embed-nomic-v1 for the container
startup logs.
Step 2: Set the embedding-backend environment variables¶
Three variables select and locate the SageMaker backend:
| Env var | Required | Notes |
|---|---|---|
EMBEDDING_BACKEND |
yes | Set to sagemaker. Selects the SageMaker client over the default HTTP client. |
EMBEDDING_SAGEMAKER_ENDPOINT_NAME |
yes | The SageMaker endpoint's name — not a URL, not an ARN. e.g. hadron-embed-nomic-v1. |
EMBEDDING_SAGEMAKER_REGION |
yes | AWS region of the endpoint, e.g. us-east-1. |
EMBEDDING_SAGEMAKER_INFERENCE_COMPONENT |
no | Only set for multi-model endpoints. Names the inference component within the endpoint. |
Three more variables tune the model + dimension:
| Env var | Required | Default | Notes |
|---|---|---|---|
EMBEDDING_MODEL |
recommended | nomic-embed-text |
Stored per-vector for FR-014 future-migration observability. Not sent to the SageMaker request body (the endpoint hosts a single model). Set this to match the HF_MODEL_ID you deployed in Step 1c — e.g. modernbert-embed-base — so the lineage label on stored vectors reflects what actually produced them. The default value matches the Ollama tag (Option A's HTTP-backend path), not your SageMaker deploy. |
EMBEDDING_DIM |
no | 768 |
MUST equal the pgvector vector(N) column dimension on the Hadron database. Mismatched embeds are rejected. |
EMBEDDING_PROVIDER |
no | aws-sagemaker |
Stored per-vector for lineage. Override to disambiguate multi-tenant deploys (e.g. mm-prod vs hadron-prod). |
Step 3: Create a dedicated IAM principal for Hadron (recommended)¶
This is a different IAM principal from the SageMaker execution
role you created (or reused) in Step 1a.
The execution role lets SageMaker pull the container; this
principal lets the Hadron embedding worker call InvokeEndpoint
against the deployed endpoint. Two separate identities, two separate
trust boundaries.
A typical AWS host has one set of credentials that any AWS SDK call will pick up — convenient, but a leak of those credentials grants the attacker whatever IAM action the host's default identity can perform (S3, the rest of the account surface, occasionally cross-service escalation paths). The SageMaker backend supports an opt-in dedicated-credentials path that scopes the embedding worker's AWS identity to a single IAM action on a single endpoint ARN.
Create the IAM user¶
In the AWS Console (or via the CLI), create a new IAM user — for
example hadron-embed-sagemaker — with programmatic access only
(no Console password). Generate an access key pair for the user and
record both halves.
Attach the minimal policy¶
Attach an inline policy with exactly one statement:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "sagemaker:InvokeEndpoint",
"Resource": "arn:aws:sagemaker:<region>:<account-id>:endpoint/<endpoint-name>"
}]
}
Substitute <region>, <account-id>, and <endpoint-name> for
your values. For example, with the endpoint name
hadron-embed-nomic-v1 in region us-east-1 and account
123456789012:
A leaked credential with this policy can do exactly one thing: invoke that endpoint. It cannot enumerate the account, cannot touch S3, cannot reach CloudWatch logs, cannot assume other roles. The blast radius collapses to the single IAM action.
Pass the credentials to Hadron¶
Set the access key pair on the Hadron server's environment:
| Env var | Required | Notes |
|---|---|---|
EMBEDDING_SAGEMAKER_ACCESS_KEY_ID |
required (paired) | Access key from the dedicated IAM user. Must be set together with EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY. |
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY |
required (paired) | Secret access key from the dedicated IAM user. |
EMBEDDING_SAGEMAKER_SESSION_TOKEN |
no | STS session token. Set only when using temporary credentials (e.g. from sts:AssumeRole). |
When both EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY are present, the worker
constructs its SageMaker client using these credentials directly —
the AWS SDK's default credential chain is bypassed for this one
client. A half-configured pair (only one of the two set) refuses
to boot, so a mis-set Doppler value errors loudly rather than
silently falling back.
Or skip the dedicated user¶
Leave EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY unset and the SageMaker
client falls through to the AWS SDK's standard credential chain
(IAM role on the host, AWS_PROFILE, AWS_ACCESS_KEY_ID /
AWS_SECRET_ACCESS_KEY, etc.). Use this path for dev environments
or single-tenant deploys where a separate IAM principal isn't worth
the secrets-manager overhead.
Step 4: Verify the wiring¶
Restart the Hadron server so the new env vars take effect, then confirm the configuration. Three checks, in order of cheapness:
Check 1: server identity (no AWS call)¶
Call the h-server-info MCP tool. The response confirms the server
is running but does not exercise the SageMaker path. Useful as a
sanity check that you're connected to the right Hadron server before
the rest of the verification.
Check 2: enable the index on a test memory¶
In the portal — or via the GraphQL updateMemory mutation — enable
the vector index on a single test memory:
mutation {
updateMemory(
id: "<memoryId>"
vectorIndexEnabled: true
embeddingSource: abstract
) {
id
vectorIndexEnabled
}
}
If the memory is encrypted, also pass
acknowledgeVectorInversionRisk: true. See
RAG vector index — Encrypted memories
for the disclosure-text contract.
Write a node with an abstract to that memory and wait for the
embedding worker to drain (typically seconds). The node's
embeddingPendingAt field clears on success.
Check 3: search the memory¶
Call h-find-nodes with mode: vector and a natural-language
query. A non-empty result means the worker is reaching SageMaker,
embedding successfully, and the vectors are landing in pgvector.
If the result is empty AND reason: "embedding_unavailable" is on
the envelope, the worker can't reach the configured backend — check
the Hadron server's logs for the embedding-worker error
(typically an AWS SDK error name like ValidationException for a
wrong endpoint name or AccessDeniedException for an IAM issue).
The error reason also lands on each affected node's embeddingError
field, surfaced by h-validate as [embed-failed] <node-urn> —
<reason>.
Common errors¶
Deployment (Step 1)¶
| Symptom | Likely cause |
|---|---|
create-model: ValidationException: Could not find role with arn ... |
The execution role ARN in --execution-role-arn doesn't exist, OR the role's trust policy doesn't include sagemaker.amazonaws.com. Re-check the ARN; redo Step 1a if needed. |
create-endpoint-config: ResourceLimitExceeded |
Account-level SageMaker quota hit. Check Service Quotas → SageMaker for the per-region endpoint and instance-type limits. |
Endpoint stuck on Creating for > 15 minutes |
Container image pull failure (wrong URI / region / private ECR perms), or the container is crashing on startup. Check FailureReason in describe-endpoint and the /aws/sagemaker/Endpoints/<name> CloudWatch log group. |
Endpoint reaches Failed, FailureReason: "The primary container … did not pass the ping health check" + CloudWatch log shows Error: Failed to parse config.json and duplicate field |
TEI rejected the model's HuggingFace config because of a strict JSON parser. Known with nomic-embed-text-v1.5's current Hub config (it has a duplicate max_position_embeddings). Swap to nomic-ai/modernbert-embed-base (same Nomic family, same 768 dim, same 8192-token context, clean config) and re-create the model. Or pin to an older revision via HF_MODEL_REVISION if you need that specific model. |
Endpoint reaches Failed, CloudWatch log shows Error: Model backend is not healthy + FusedMatMul ... Failed to allocate memory for requested buffer of size <large-number> during "Warming up model" |
TEI's warmup tries to allocate one full attention matrix at MAX_BATCH_TOKENS. For ModernBERT (12 heads, max-length 8192), each MAX_BATCH_TOKENS=N step allocates roughly (N/8192)² × 3.2 GB. At MAX_BATCH_TOKENS=16384 that's ~6.4 GB; at 8192 that's ~3.2 GB. Even 3.2 GB OOMs on ml.m5.large once you add model weights (~600 MB), ONNX Runtime arena overhead (1–2 GB), and container OS/runtime (~700 MB). Fix: step up to ml.m5.xlarge (16 GB). Dropping MAX_BATCH_TOKENS further (to 4096) lets you stay on ml.m5.large but constrains throughput; instance upgrade is the cleaner answer. |
Endpoint reaches Failed, FailureReason mentions OOM (without the FusedMatMul detail above) |
Instance type too small for the container. If MAX_BATCH_TOKENS is already at the safe value, the cause may be a GPU container running on a CPU instance, or model weights too large for the instance. Step up the instance type. |
Configuration (Steps 2–3)¶
| Symptom | Likely cause |
|---|---|
| Server refuses to boot: "EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY must be set together" | Only one of the two dedicated-credential vars is set. Either set both or unset both. |
| Server refuses to boot: "EMBEDDING_SAGEMAKER_ENDPOINT_NAME is not set" | EMBEDDING_BACKEND=sagemaker was set but the endpoint name wasn't. |
Runtime (Step 4 and beyond)¶
| Symptom | Likely cause |
|---|---|
embeddingError: ValidationException ... |
The endpoint name doesn't match an endpoint in the configured region, OR the region is wrong. Double-check both. |
embeddingError: AccessDeniedException ... |
The IAM principal's policy doesn't allow sagemaker:InvokeEndpoint on the endpoint ARN. Check the Step 3 policy + the ARN substitutions. |
embeddingError: ModelError ... |
The SageMaker endpoint received the request but the model container returned an error. Check CloudWatch logs for the endpoint. |
embeddingError: EMBEDDING_DIMENSION_MISMATCH ... |
The endpoint is returning vectors of a different dimension than EMBEDDING_DIM. Verify the deployed model's dimension matches the pgvector column (default 768). |
See also¶
- RAG vector index — operational
reference for the index itself: when to enable it, what
embeddingSourcedoes, whath-find-nodesreturns. - The RAG tech stack — background on the two supported backends and the privacy posture behind both.
- AWS SageMaker Real-Time Inference — AWS's documentation on creating, sizing, and operating the endpoint itself.
- Hugging Face SageMaker DLC catalog — Available DLCs — the lookup table for the TEI container URI in your region (referenced from Step 1b). Replaces the older AWS DLC docs page, which dropped TEI from its catalog.
- HuggingFace Text-Embeddings-Inference (TEI) — the upstream project. The DLC above wraps this for SageMaker.
nomic-ai/modernbert-embed-baseon HuggingFace Hub — the model TEI loads via theHF_MODEL_IDenv var. ModernBERT architecture, 768 dimensions, Matryoshka, Apache-2.0.