kagent on EKS

The marquee AWS-native deployment shape: three Kubernetes-native agents sharing one governed memory layer on Amazon EKS, talking to Aurora PostgreSQL for storage and Amazon Bedrock for embeddings + LLM-as-judge.

What this is

A reference multi-agent setup running on kagent — a Kubernetes-native AI agent framework — with memledger as the shared memory + trust layer. Three agents share the same vector store under explicit namespace RBAC and full attribution:

triage-agent — ingests incoming alerts, correlates with past incidents, escalates.
eks-ops-agent — troubleshoots EKS clusters, recalls remediations, learns from incidents.
compliance-agent — audits provenance, enforces retention, generates trust attestations.

Backend: Aurora PostgreSQL with IAM auth + Bedrock embeddings (amazon.titan-embed-text-v2:0, 1024 dims). OpenSearch is supported as an alternative storage backend for hybrid (BM25 + vector) search.

Three agents, one trust layer

`triage-agent` — `/triage/*`

Role: ingest alerts, correlate with past incidents, escalate.

Tools that touch memledger

ingest_alert(source, severity, description) — searches /ops/incidents and /triage/alerts for correlations, then writes a RecordType.EPISODIC record under /triage/alerts with severity-driven confidence (critical=0.9, high=0.85, medium=0.6, low=0.4) and triggered_by=<alert-id>.
correlate_incidents(query, namespace) — search_hybrid() against ops history.
escalate_to_ops(summary, alert_id, severity) — writes a record under /shared/escalations with confidence=0.85.
remember_knowledge(...) / recall_knowledge / recall_context.

Owns: /triage/alerts, /triage/incidents/..., /triage/findings, /shared/escalations.

`eks-ops-agent` — `/eks-ops/*`

Role: troubleshoot EKS clusters, recall remediations, learn from incidents.

Tools

recall_knowledge(query, namespace) — searches /incidents/*, /runbooks, /learnings, and (cross-agent) /triage/incidents. The reply prints full record UUIDs so the agent can pass them as derived_from in a follow-up remember_knowledge.
remember_knowledge(...) — typically writes to /eks-ops/remediations/<service> with derived_from=[<triage_record_id>] for cross-agent provenance.
mark_memory_outcome(...) — wraps record_outcome(); flags whether a recalled memory led to a successful resolution.
memory_audit / memory_lineage — read-only views over the audit log and provenance chain.

Owns: /eks-ops/remediations/..., /incidents/..., /runbooks, /learnings, /users/<user_id>/defaults.

`compliance-agent` — `/compliance/*` + cross-namespace audit

Role: enforce retention, scan for staleness, generate trust audits.

Tools

memory_audit(record_id, last_n) — read the audit log of operations.
memory_lineage(...) — full provenance: created_by, derived_from, supersedes chain, accessed-by.
scan_staleness(days) — finds records not accessed for N days; writes a compliance report to /compliance/reports.
enforce_lifecycle(action, scope, namespace, days) — bulk expire-stale / archive-expired / deprecate-conflicting.
check_namespace_compliance(namespace) — RBAC-gated read; surfaces low-confidence/hedged records.

Owns: /compliance/reports. Read access into all other namespaces granted under the RBAC policy.

How they cooperate

A canonical multi-agent flow through memledger:

                                      (memledger pgvector + Bedrock Titan)
                                                   ▲
alert ── triage-agent ── ingest_alert ─────────────┼── /triage/alerts/<id>
                          │                        │      conf=0.9 (severity=critical)
                          │                        │      triggered_by=alert-...
                          ▼                        │
                      (correlation)                │
                          │                        │
                          ▼                        │
recall  ◄── eks-ops-agent ◄── recall_knowledge ────┤
                          │       (cross-agent     │
                          ▼        /triage/incidents)
                      remember_knowledge ──────────┼── /eks-ops/remediations/payment
                          │  derived_from=[triage] │      conf=0.85
                          ▼                        │      derived_from=[<triage_id>]
                      mark_memory_outcome ─────────┼── (success_count++)
                                                   │
audit ── compliance-agent ── memory_lineage ───────┤
                                                   │
                      scan_staleness ──────────────┴── /compliance/reports
                                                          conf=0.95

When compliance-agent calls memory_lineage on the eks-ops remediation, memledger walks the chain and returns:

chain_depth     = 2
min_confidence  = 0.85          # weakest-link across agent boundaries
agents_involved = ['eks-ops-agent', 'triage-agent']
hops:
  hop 0 origin   eks-ops-agent (conf 0.85) — remediation
  hop 1 derived  triage-agent  (conf 0.90) — root incident

How agents use memledger

Every agent in this reference deployment follows the same shape:

OpenTelemetry init — set up a real TracerProvider + BatchSpanProcessor + OTLPSpanExporter pointed at the cluster OTel collector on :4317. Without this the global tracer is a no-op ProxyTracerProvider and memledger spans are lost.

Lazy memledger instance — create a Memledger per process with Bedrock embeddings, then opt into tracing:

from memledger import Memledger
from memledger.models import EmbeddingConfig
from memledger.telemetry import instrument

self._ml = await Memledger.create(
    backend_name="pgvector",
    connection_string=self._pg_dsn,
    embedding_config=EmbeddingConfig(
        provider="bedrock",
        model="amazon.titan-embed-text-v2:0",
        dimensions=1024,
    ),
)
instrument(self._ml)   # opt-in OTel wrapping

remember_knowledge — accept and forward the full trust kwargs (confidence, hedged, derived_from, supersedes, agent_id, created_by, workflow_id, triggered_by).
recall_knowledge — print full record UUIDs in the response so downstream agents can pass them as derived_from. Truncated 8-char prefixes break cross-agent linking.

Deploy

Prerequisites

kubectl ≥ 1.30
helm ≥ 3.14
awscli v2 with credentials for the EKS account
docker with buildx
jq

export AWS_REGION=us-west-2
export AWS_PROFILE=<your-profile>
export CLUSTER_NAME=<your-cluster-name>

1. kagent infrastructure

Stand up an EKS cluster with kagent enabled. Minimal Terraform variables:

# kagent.tfvars
kagent_enabled               = true
kagent_database_type         = "sqlite"
kagent_enable_ui             = true
kagent_enable_bedrock_access = true

Apply against your EKS module:

terraform apply \
  -var="cluster_name=<your-cluster-name>" \
  -var="region=us-west-2" \
  -var-file="kagent.tfvars"

For high availability, set kagent_database_type=postgresql and kagent_controller_replicas=3. Reach the kagent UI with:

kubectl port-forward -n <your-namespace> svc/kagent-ui 8080:8080

2. LLM integration (Bedrock recommended)

Bedrock is the recommended LLM provider for AWS-native deployments. With kagent_enable_bedrock_access=true Terraform configures IRSA on the kagent service account; you then declare a ModelConfig:

apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: claude-sonnet
  namespace: <your-namespace>
spec:
  provider: Bedrock
  model: us.anthropic.claude-sonnet-4-6
  region: us-west-2

Self-hosted vLLM (OpenAI-compatible) and OpenAI APIs are also supported.

3. Helm-install memledger

export MEMLEDGER_CHART=/path/to/memledger-core/charts/memledger

helm upgrade --install memledger "$MEMLEDGER_CHART" \
  --namespace <your-namespace> \
  --set database.deploy=true \
  --set database.migration.enabled=true \
  --set database.migration.tableName=agent_memory \
  --set database.migration.vectorDimensions=1024 \
  --set embeddings.provider=bedrock \
  --set embeddings.model=amazon.titan-embed-text-v2:0 \
  --set embeddings.dimensions=1024 \
  --set memledger.defaultBackend=pgvector \
  --wait --timeout 5m

For Aurora PostgreSQL with IAM auth (recommended AWS-native path), set database.deploy=false and point at your Aurora cluster — see Aurora PostgreSQL. For OpenSearch hybrid search, see OpenSearch.

4. Build & deploy the three agent images

Each agent has a build-and-deploy.sh that builds a linux/amd64 image, pushes to ECR, and runs helm upgrade --install for the agent chart with an IRSA-annotated ServiceAccount.

export TF_DIR=/path/to/eks-cluster/terraform
export REPO_DIR=/path/to/this/repo
export MEMLEDGER_PG_DSN="postgresql://memledger:<DB_PASSWORD>@memledger-pgvector:5432/memledger"

# eks-ops-agent — pgvector mode with EKS MCP tools
cd "$REPO_DIR/examples/agentic/eks-ops-agent"
ENABLE_MCP_TOOLS=true ENABLE_MEMORY=true ./build-and-deploy.sh

# triage-agent
cd "$REPO_DIR/examples/agentic/triage-agent"
./build-and-deploy.sh

# compliance-agent
cd "$REPO_DIR/examples/agentic/compliance-agent"
./build-and-deploy.sh

Expected: three Push complete lines and three Deploy complete! banners. Pods reach 1/1 Running in your namespace.

5. Apply agent CRDs

The build scripts already render the kagent Agent CRD via Helm. To re-apply CRDs after editing an agent prompt:

kubectl apply -f "$REPO_DIR/examples/agentic/triage-agent/triage-agent.yaml"
kubectl apply -f "$REPO_DIR/examples/agentic/eks-ops-agent/eks-ops-agent.yaml"
kubectl apply -f "$REPO_DIR/examples/agentic/compliance-agent/compliance-agent.yaml"

kubectl get agents -n <your-namespace>

Reference deployment

Working reference deployment lives at aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow on the memory-integration branch under examples/agentic/. This is the Terraform + Helm + agent images that backed the v2.0.0 launch validation.

Next steps

Aurora PostgreSQL — IAM auth, token rotation, the AWS-native storage path
OpenSearch — hybrid (BM25 + vector) search
Bedrock — embeddings + LLM-as-judge
Phoenix observability — span inventory and dashboards
MAI evaluation — three-tier evaluator suite scored against the validation fixture set

What this is​

Three agents, one trust layer​

triage-agent — /triage/*​

eks-ops-agent — /eks-ops/*​

compliance-agent — /compliance/* + cross-namespace audit​

How they cooperate​

How agents use memledger​

Deploy​

Prerequisites​

1. kagent infrastructure​

2. LLM integration (Bedrock recommended)​

3. Helm-install memledger​

4. Build & deploy the three agent images​

5. Apply agent CRDs​

Reference deployment​

Next steps​