Skip to content

Configure AWS SageMaker for vector embeddings

Hadron's RAG vector index uses an embedding server to turn node content into vectors. The Hadron embedding worker supports two backends — an HTTP backend (Ollama, TEI-on-a-VM, vLLM, etc.) and an AWS SageMaker backend. This page walks through the SageMaker path end-to-end: deploy the endpoint via the AWS CLI, set the environment variables on the Hadron server, scope the worker's AWS identity to just the endpoint, and verify the wiring before turning the index on.

For the conceptual background on the two backends and when to pick which, see The RAG tech stack. For the operational reference (every embedding env var, embedding-source options, search modes), see RAG vector index.

Prerequisites

  • AWS CLI v2 installed and configured with a profile that has SageMaker, IAM, and ECR read permissions for the target account. Verify with aws sts get-caller-identity.
  • AWS account permissions to create SageMaker models / endpoints and an IAM user with a custom policy. Account-level quotas matter too — the default endpoint quota per region is plenty for a single Hadron deploy, but check Service Quotas if you're already running other SageMaker workloads.
  • Write access to the Hadron server's environment (Doppler, your process manager's env file, or whatever the deploy uses). The env vars are read at process start.

Step 1: Deploy the SageMaker endpoint with the AWS CLI

If you already have a SageMaker endpoint hosting a 768-dimension embedding model from the Nomic family, skip to Step 2.

Otherwise, the AWS CLI is the cleanest path for a first-pass production deploy: three commands to register the model, define the endpoint shape, and create the endpoint itself. Once you have one working, you can move to CDK / Terraform / CloudFormation when you want the config checked into source.

1a. Create a SageMaker execution role (if you don't have one)

The execution role is what SageMaker assumes to pull the container image and read any model artifacts you stage in S3. This is a distinct IAM principal from the one Hadron uses to invoke the endpoint — that one is created in Step 3 below.

If your account already has a SageMaker execution role (often named something like AmazonSageMaker-ExecutionRole-*), record its ARN and skip to 1b. Otherwise, create one:

# Trust policy — lets the SageMaker service assume the role
cat > /tmp/sagemaker-trust.json <<'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "sagemaker.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}
EOF

aws iam create-role \
  --role-name HadronSageMakerExecutionRole \
  --assume-role-policy-document file:///tmp/sagemaker-trust.json

aws iam attach-role-policy \
  --role-name HadronSageMakerExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

The AmazonSageMakerFullAccess managed policy is broad — tighter custom policies exist if you want to scope to specific S3 buckets, ECR repos, and CloudWatch log groups, but the managed policy is the boring starting point.

Record the role ARN (arn:aws:iam::<account-id>:role/HadronSageMakerExecutionRole) for the next sub-step.

1b. Find the TEI container image URI for your region

The HuggingFace Text-Embeddings-Inference container is published as a Hugging Face Deep Learning Container (DLC) in HuggingFace's ECR repo (account ID 683313688378). The URI's tag — TEI version, Python version, CUDA version — shifts as new TEI releases land.

Recommended discovery — the SageMaker Python SDK helper. The SDK resolves the current URI for the region you pass, so you don't reconstruct strings by hand:

pip install "sagemaker<3.0.0"
from sagemaker.huggingface import get_huggingface_llm_image_uri

# CPU (e.g. ml.t3.medium)
print(get_huggingface_llm_image_uri("huggingface-tei-cpu", region="us-east-1"))

# GPU (e.g. ml.g4dn.xlarge)
print(get_huggingface_llm_image_uri("huggingface-tei", region="us-east-1"))

The SDK v3 release is recent and some helpers may not be fully updated for it yet — pinning sagemaker<3.0.0 for this one lookup is the safer path until that settles. The printed URI is the value you'll substitute into the --primary-container argument in Step 1c.

Why the explicit region=

Omitting region makes the SDK instantiate a Session() to discover the default region from the AWS config, which in turn forces the boto credential chain to load. On a host without AWS credentials configured yet (or one missing the optional botocore[crt] extra), that errors with MissingDependencyException before the URI is ever resolved. Passing region= keeps the lookup credential-free — you'll configure credentials anyway before Step 1c runs, but this one lookup doesn't need them.

Or look it up manually. The current URI table lives at the Hugging Face SageMaker DLC catalog — look for the "Text Embeddings Inference" table. Concrete current examples for us-east-1 (verify before using — these strings shift with each TEI release):

Accelerator Container URI
CPU 683313688378.dkr.ecr.us-east-1.amazonaws.com/tei-cpu:2.0.1-tei1.8.2-cpu-py310-ubuntu22.04
GPU 683313688378.dkr.ecr.us-east-1.amazonaws.com/tei:2.0.1-tei1.8.2-gpu-py310-cu122-ubuntu22.04

Limited region availability

The TEI DLC is published in a smaller set of regions than the general HuggingFace inference DLC. us-east-1 is reliable; some other regions don't have it. If your target region doesn't carry the TEI image, the simplest path is to deploy the endpoint in us-east-1 and set EMBEDDING_SAGEMAKER_REGION=us-east-1 on the Hadron side. The Python SDK helper will surface an error if the region you're targeting doesn't carry the image.

CPU is the right choice for low-to-moderate volume. Step up to GPU (ml.g4dn.xlarge or similar in Step 1d) when embedding throughput becomes the bottleneck — see The RAG tech stack for the choosing-between-them table.

1c. Create the model

This registers the model + container combination as a named SageMaker resource. TEI loads the actual weights from HuggingFace Hub at container startup via the HF_MODEL_ID environment variable — no S3 upload of model artifacts needed for the standard nomic case.

aws sagemaker create-model \
  --model-name hadron-embed-nomic-v1 \
  --execution-role-arn arn:aws:iam::<account-id>:role/HadronSageMakerExecutionRole \
  --primary-container '{
    "Image": "<TEI-container-URI-from-step-1b>",
    "Environment": {
      "HF_MODEL_ID": "nomic-ai/modernbert-embed-base",
      "MAX_BATCH_TOKENS": "8192"
    }
  }'

MAX_BATCH_TOKENS caps the total tokens TEI will pack into one forward pass — and it also bounds the synthetic batch TEI uses during warmup. 8192 is one max-length ModernBERT sequence per batch and produces a ~3.2 GB attention allocation during warmup. That allocation, plus model weights (~600 MB), plus ONNX Runtime arena overhead (1–2 GB), plus container OS + runtime (~700 MB), needs at minimum a 16 GB instance (ml.m5.xlarge) to run without the BFCArena failing to find contiguous memory. ml.m5.large (8 GB) is too tight for ModernBERT at any reasonable batch size and will OOM during warmup. See the FusedMatMul row in Common Errors below for the exact symptom.

Why modernbert-embed-base and not nomic-embed-text-v1.5

Both are from Nomic, both are 768-dimension, both Matryoshka, both 8192-token context — but nomic-embed-text-v1.5's current config.json on HuggingFace Hub has a duplicate field that TEI's strict Rust JSON parser rejects (see the "TEI rejected config.json" row in Common Errors below). modernbert-embed-base is Nomic's newer ModernBERT-architecture model and has a clean config. The Hadron spec-033 properties (Matryoshka, 8192-token context, Apache 2.0 license) all carry over.

1d. Create the endpoint config

This defines the instance shape (count, type, autoscaling). For ModernBERT (the model from Step 1c), ml.m5.xlarge is the smallest viable CPU instance — ml.m5.large is 8 GB which doesn't have enough contiguous memory for the warmup allocation plus runtime overhead (see the explanation in Step 1c above):

aws sagemaker create-endpoint-config \
  --endpoint-config-name hadron-embed-nomic-v1-config \
  --production-variants '[{
    "VariantName": "default",
    "ModelName": "hadron-embed-nomic-v1",
    "InitialInstanceCount": 1,
    "InstanceType": "ml.m5.xlarge"
  }]'

For GPU, swap ml.m5.xlargeml.g4dn.xlarge (NVIDIA T4) and use the -gpu container URI from 1b. SageMaker's autoscaling lives in the aws application-autoscaling service — out of scope here; the fixed-count config above is the right starting point.

SageMaker Hosting doesn't accept every EC2 instance type

The CreateEndpointConfig API has its own enum of allowed InstanceType values that diverges from EC2's general list. Notably, ml.t3.* is not in the enum — t3.medium works on EC2 but fails CreateEndpointConfig with a ValidationException listing the allowed types. Cheapest viable CPU options that ARE in the enum:

Instance vCPU RAM ~$/hr Notes
ml.t2.medium 2 4 GB ~$0.06 Burstable; cheapest. Variable performance under load. Too small for ModernBERT — use only for non-ModernBERT smoke tests.
ml.m5.large 2 8 GB ~$0.13 Non-burstable. Insufficient for modernbert-embed-base — TEI's warmup allocation plus runtime overhead doesn't fit in 8 GB. Works for smaller embedding models.
ml.m5.xlarge 4 16 GB ~$0.26 Recommended for modernbert-embed-base. Comfortable headroom for warmup; runs MAX_BATCH_TOKENS=8192 cleanly with room to tune up.
ml.c6i.xlarge 4 8 GB ~$0.21 Newer compute-optimized; same 8 GB ceiling as m5.large — also insufficient for ModernBERT. Compute is good for smaller models that fit.
ml.c6i.2xlarge 8 16 GB ~$0.41 Compute-optimized with 16 GB. Faster than m5.xlarge but more expensive. Worth it only when you've measured CPU as the bottleneck.

The full allowed list is region-dependent; if the call rejects your choice it returns the exact enum, so you can pick from that error message.

1e. Create the endpoint

This is the async step. The CLI returns immediately; SageMaker provisions the instance and pulls the container in the background.

aws sagemaker create-endpoint \
  --endpoint-name hadron-embed-nomic-v1 \
  --endpoint-config-name hadron-embed-nomic-v1-config

Poll for InService — typically 5-10 minutes for the first creation:

aws sagemaker describe-endpoint \
  --endpoint-name hadron-embed-nomic-v1 \
  --query 'EndpointStatus' --output text

The status walks through CreatingInService on success or Failed on container startup error. Pricing starts the moment status reaches InService, not when you call create-endpoint.

If it lands on Failed, check FailureReason in the describe-endpoint JSON output and the CloudWatch log group at /aws/sagemaker/Endpoints/hadron-embed-nomic-v1 for the container startup logs.

Step 2: Set the embedding-backend environment variables

Three variables select and locate the SageMaker backend:

Env var Required Notes
EMBEDDING_BACKEND yes Set to sagemaker. Selects the SageMaker client over the default HTTP client.
EMBEDDING_SAGEMAKER_ENDPOINT_NAME yes The SageMaker endpoint's name — not a URL, not an ARN. e.g. hadron-embed-nomic-v1.
EMBEDDING_SAGEMAKER_REGION yes AWS region of the endpoint, e.g. us-east-1.
EMBEDDING_SAGEMAKER_INFERENCE_COMPONENT no Only set for multi-model endpoints. Names the inference component within the endpoint.

Three more variables tune the model + dimension:

Env var Required Default Notes
EMBEDDING_MODEL recommended nomic-embed-text Stored per-vector for FR-014 future-migration observability. Not sent to the SageMaker request body (the endpoint hosts a single model). Set this to match the HF_MODEL_ID you deployed in Step 1c — e.g. modernbert-embed-base — so the lineage label on stored vectors reflects what actually produced them. The default value matches the Ollama tag (Option A's HTTP-backend path), not your SageMaker deploy.
EMBEDDING_DIM no 768 MUST equal the pgvector vector(N) column dimension on the Hadron database. Mismatched embeds are rejected.
EMBEDDING_PROVIDER no aws-sagemaker Stored per-vector for lineage. Override to disambiguate multi-tenant deploys (e.g. mm-prod vs hadron-prod).

This is a different IAM principal from the SageMaker execution role you created (or reused) in Step 1a. The execution role lets SageMaker pull the container; this principal lets the Hadron embedding worker call InvokeEndpoint against the deployed endpoint. Two separate identities, two separate trust boundaries.

A typical AWS host has one set of credentials that any AWS SDK call will pick up — convenient, but a leak of those credentials grants the attacker whatever IAM action the host's default identity can perform (S3, the rest of the account surface, occasionally cross-service escalation paths). The SageMaker backend supports an opt-in dedicated-credentials path that scopes the embedding worker's AWS identity to a single IAM action on a single endpoint ARN.

Create the IAM user

In the AWS Console (or via the CLI), create a new IAM user — for example hadron-embed-sagemaker — with programmatic access only (no Console password). Generate an access key pair for the user and record both halves.

Attach the minimal policy

Attach an inline policy with exactly one statement:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sagemaker:InvokeEndpoint",
    "Resource": "arn:aws:sagemaker:<region>:<account-id>:endpoint/<endpoint-name>"
  }]
}

Substitute <region>, <account-id>, and <endpoint-name> for your values. For example, with the endpoint name hadron-embed-nomic-v1 in region us-east-1 and account 123456789012:

arn:aws:sagemaker:us-east-1:123456789012:endpoint/hadron-embed-nomic-v1

A leaked credential with this policy can do exactly one thing: invoke that endpoint. It cannot enumerate the account, cannot touch S3, cannot reach CloudWatch logs, cannot assume other roles. The blast radius collapses to the single IAM action.

Pass the credentials to Hadron

Set the access key pair on the Hadron server's environment:

Env var Required Notes
EMBEDDING_SAGEMAKER_ACCESS_KEY_ID required (paired) Access key from the dedicated IAM user. Must be set together with EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY.
EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY required (paired) Secret access key from the dedicated IAM user.
EMBEDDING_SAGEMAKER_SESSION_TOKEN no STS session token. Set only when using temporary credentials (e.g. from sts:AssumeRole).

When both EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY are present, the worker constructs its SageMaker client using these credentials directly — the AWS SDK's default credential chain is bypassed for this one client. A half-configured pair (only one of the two set) refuses to boot, so a mis-set Doppler value errors loudly rather than silently falling back.

Or skip the dedicated user

Leave EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY unset and the SageMaker client falls through to the AWS SDK's standard credential chain (IAM role on the host, AWS_PROFILE, AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY, etc.). Use this path for dev environments or single-tenant deploys where a separate IAM principal isn't worth the secrets-manager overhead.

Step 4: Verify the wiring

Restart the Hadron server so the new env vars take effect, then confirm the configuration. Three checks, in order of cheapness:

Check 1: server identity (no AWS call)

Call the h-server-info MCP tool. The response confirms the server is running but does not exercise the SageMaker path. Useful as a sanity check that you're connected to the right Hadron server before the rest of the verification.

Check 2: enable the index on a test memory

In the portal — or via the GraphQL updateMemory mutation — enable the vector index on a single test memory:

mutation {
  updateMemory(
    id: "<memoryId>"
    vectorIndexEnabled: true
    embeddingSource: abstract
  ) {
    id
    vectorIndexEnabled
  }
}

If the memory is encrypted, also pass acknowledgeVectorInversionRisk: true. See RAG vector index — Encrypted memories for the disclosure-text contract.

Write a node with an abstract to that memory and wait for the embedding worker to drain (typically seconds). The node's embeddingPendingAt field clears on success.

Check 3: search the memory

Call h-find-nodes with mode: vector and a natural-language query. A non-empty result means the worker is reaching SageMaker, embedding successfully, and the vectors are landing in pgvector.

If the result is empty AND reason: "embedding_unavailable" is on the envelope, the worker can't reach the configured backend — check the Hadron server's logs for the embedding-worker error (typically an AWS SDK error name like ValidationException for a wrong endpoint name or AccessDeniedException for an IAM issue). The error reason also lands on each affected node's embeddingError field, surfaced by h-validate as [embed-failed] <node-urn> — <reason>.

Common errors

Deployment (Step 1)

Symptom Likely cause
create-model: ValidationException: Could not find role with arn ... The execution role ARN in --execution-role-arn doesn't exist, OR the role's trust policy doesn't include sagemaker.amazonaws.com. Re-check the ARN; redo Step 1a if needed.
create-endpoint-config: ResourceLimitExceeded Account-level SageMaker quota hit. Check Service Quotas → SageMaker for the per-region endpoint and instance-type limits.
Endpoint stuck on Creating for > 15 minutes Container image pull failure (wrong URI / region / private ECR perms), or the container is crashing on startup. Check FailureReason in describe-endpoint and the /aws/sagemaker/Endpoints/<name> CloudWatch log group.
Endpoint reaches Failed, FailureReason: "The primary container … did not pass the ping health check" + CloudWatch log shows Error: Failed to parse config.json and duplicate field TEI rejected the model's HuggingFace config because of a strict JSON parser. Known with nomic-embed-text-v1.5's current Hub config (it has a duplicate max_position_embeddings). Swap to nomic-ai/modernbert-embed-base (same Nomic family, same 768 dim, same 8192-token context, clean config) and re-create the model. Or pin to an older revision via HF_MODEL_REVISION if you need that specific model.
Endpoint reaches Failed, CloudWatch log shows Error: Model backend is not healthy + FusedMatMul ... Failed to allocate memory for requested buffer of size <large-number> during "Warming up model" TEI's warmup tries to allocate one full attention matrix at MAX_BATCH_TOKENS. For ModernBERT (12 heads, max-length 8192), each MAX_BATCH_TOKENS=N step allocates roughly (N/8192)² × 3.2 GB. At MAX_BATCH_TOKENS=16384 that's ~6.4 GB; at 8192 that's ~3.2 GB. Even 3.2 GB OOMs on ml.m5.large once you add model weights (~600 MB), ONNX Runtime arena overhead (1–2 GB), and container OS/runtime (~700 MB). Fix: step up to ml.m5.xlarge (16 GB). Dropping MAX_BATCH_TOKENS further (to 4096) lets you stay on ml.m5.large but constrains throughput; instance upgrade is the cleaner answer.
Endpoint reaches Failed, FailureReason mentions OOM (without the FusedMatMul detail above) Instance type too small for the container. If MAX_BATCH_TOKENS is already at the safe value, the cause may be a GPU container running on a CPU instance, or model weights too large for the instance. Step up the instance type.

Configuration (Steps 2–3)

Symptom Likely cause
Server refuses to boot: "EMBEDDING_SAGEMAKER_ACCESS_KEY_ID and EMBEDDING_SAGEMAKER_SECRET_ACCESS_KEY must be set together" Only one of the two dedicated-credential vars is set. Either set both or unset both.
Server refuses to boot: "EMBEDDING_SAGEMAKER_ENDPOINT_NAME is not set" EMBEDDING_BACKEND=sagemaker was set but the endpoint name wasn't.

Runtime (Step 4 and beyond)

Symptom Likely cause
embeddingError: ValidationException ... The endpoint name doesn't match an endpoint in the configured region, OR the region is wrong. Double-check both.
embeddingError: AccessDeniedException ... The IAM principal's policy doesn't allow sagemaker:InvokeEndpoint on the endpoint ARN. Check the Step 3 policy + the ARN substitutions.
embeddingError: ModelError ... The SageMaker endpoint received the request but the model container returned an error. Check CloudWatch logs for the endpoint.
embeddingError: EMBEDDING_DIMENSION_MISMATCH ... The endpoint is returning vectors of a different dimension than EMBEDDING_DIM. Verify the deployed model's dimension matches the pgvector column (default 768).

See also