kagent on EKS
The marquee AWS-native deployment shape: three Kubernetes-native agents sharing one governed memory layer on Amazon EKS, talking to Aurora PostgreSQL for storage and Amazon Bedrock for embeddings + LLM-as-judge.
What this is
A reference multi-agent setup running on kagent — a Kubernetes-native AI agent framework — with memledger as the shared memory + trust layer. Three agents share the same vector store under explicit namespace RBAC and full attribution:
triage-agent— ingests incoming alerts, correlates with past incidents, escalates.eks-ops-agent— troubleshoots EKS clusters, recalls remediations, learns from incidents.compliance-agent— audits provenance, enforces retention, generates trust attestations.
Backend: Aurora PostgreSQL with IAM auth + Bedrock embeddings (amazon.titan-embed-text-v2:0, 1024 dims). OpenSearch is supported as an alternative storage backend for hybrid (BM25 + vector) search.
Three agents, one trust layer
triage-agent — /triage/*
Role: ingest alerts, correlate with past incidents, escalate.
Tools that touch memledger
ingest_alert(source, severity, description)— searches/ops/incidentsand/triage/alertsfor correlations, then writes aRecordType.EPISODICrecord under/triage/alertswith severity-drivenconfidence(critical=0.9, high=0.85, medium=0.6, low=0.4) andtriggered_by=<alert-id>.correlate_incidents(query, namespace)—search_hybrid()against ops history.escalate_to_ops(summary, alert_id, severity)— writes a record under/shared/escalationswithconfidence=0.85.remember_knowledge(...)/recall_knowledge/recall_context.
Owns: /triage/alerts, /triage/incidents/..., /triage/findings, /shared/escalations.
eks-ops-agent — /eks-ops/*
Role: troubleshoot EKS clusters, recall remediations, learn from incidents.
Tools
recall_knowledge(query, namespace)— searches/incidents/*,/runbooks,/learnings, and (cross-agent)/triage/incidents. The reply prints full record UUIDs so the agent can pass them asderived_fromin a follow-upremember_knowledge.remember_knowledge(...)— typically writes to/eks-ops/remediations/<service>withderived_from=[<triage_record_id>]for cross-agent provenance.mark_memory_outcome(...)— wrapsrecord_outcome(); flags whether a recalled memory led to a successful resolution.memory_audit/memory_lineage— read-only views over the audit log and provenance chain.
Owns: /eks-ops/remediations/..., /incidents/..., /runbooks, /learnings, /users/<user_id>/defaults.
compliance-agent — /compliance/* + cross-namespace audit
Role: enforce retention, scan for staleness, generate trust audits.
Tools
memory_audit(record_id, last_n)— read the audit log of operations.memory_lineage(...)— full provenance:created_by,derived_from, supersedes chain, accessed-by.scan_staleness(days)— finds records not accessed for N days; writes a compliance report to/compliance/reports.enforce_lifecycle(action, scope, namespace, days)— bulk expire-stale / archive-expired / deprecate-conflicting.check_namespace_compliance(namespace)— RBAC-gated read; surfaces low-confidence/hedged records.
Owns: /compliance/reports. Read access into all other namespaces granted under the RBAC policy.
How they cooperate
A canonical multi-agent flow through memledger:
(memledger pgvector + Bedrock Titan)
▲
alert ── triage-agent ── ingest_alert ─────────────┼── /triage/alerts/<id>
│ │ conf=0.9 (severity=critical)
│ │ triggered_by=alert-...
▼ │
(correlation) │
│ │
▼ │
recall ◄── eks-ops-agent ◄── recall_knowledge ────┤
│ (cross-agent │
▼ /triage/incidents)
remember_knowledge ──────────┼── /eks-ops/remediations/payment
│ derived_from=[triage] │ conf=0.85
▼ │ derived_from=[<triage_id>]
mark_memory_outcome ─────────┼── (success_count++)
│
audit ── compliance-agent ── memory_lineage ───────┤
│
scan_staleness ──────────────┴── /compliance/reports
conf=0.95
When compliance-agent calls memory_lineage on the eks-ops remediation, memledger walks the chain and returns:
chain_depth = 2
min_confidence = 0.85 # weakest-link across agent boundaries
agents_involved = ['eks-ops-agent', 'triage-agent']
hops:
hop 0 origin eks-ops-agent (conf 0.85) — remediation
hop 1 derived triage-agent (conf 0.90) — root incident
How agents use memledger
Every agent in this reference deployment follows the same shape:
-
OpenTelemetry init — set up a real
TracerProvider+BatchSpanProcessor+OTLPSpanExporterpointed at the cluster OTel collector on:4317. Without this the global tracer is a no-opProxyTracerProviderand memledger spans are lost. -
Lazy memledger instance — create a
Memledgerper process with Bedrock embeddings, then opt into tracing:from memledger import Memledgerfrom memledger.models import EmbeddingConfigfrom memledger.telemetry import instrumentself._ml = await Memledger.create(backend_name="pgvector",connection_string=self._pg_dsn,embedding_config=EmbeddingConfig(provider="bedrock",model="amazon.titan-embed-text-v2:0",dimensions=1024,),)instrument(self._ml) # opt-in OTel wrapping -
remember_knowledge— accept and forward the full trust kwargs (confidence,hedged,derived_from,supersedes,agent_id,created_by,workflow_id,triggered_by). -
recall_knowledge— print full record UUIDs in the response so downstream agents can pass them asderived_from. Truncated 8-char prefixes break cross-agent linking.
Deploy
Prerequisites
kubectl≥ 1.30helm≥ 3.14awscliv2 with credentials for the EKS accountdockerwith buildxjq
export AWS_REGION=us-west-2
export AWS_PROFILE=<your-profile>
export CLUSTER_NAME=<your-cluster-name>
1. kagent infrastructure
Stand up an EKS cluster with kagent enabled. Minimal Terraform variables:
# kagent.tfvars
kagent_enabled = true
kagent_database_type = "sqlite"
kagent_enable_ui = true
kagent_enable_bedrock_access = true
Apply against your EKS module:
terraform apply \
-var="cluster_name=<your-cluster-name>" \
-var="region=us-west-2" \
-var-file="kagent.tfvars"
For high availability, set kagent_database_type=postgresql and kagent_controller_replicas=3. Reach the kagent UI with:
kubectl port-forward -n <your-namespace> svc/kagent-ui 8080:8080
2. LLM integration (Bedrock recommended)
Bedrock is the recommended LLM provider for AWS-native deployments. With kagent_enable_bedrock_access=true Terraform configures IRSA on the kagent service account; you then declare a ModelConfig:
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
name: claude-sonnet
namespace: <your-namespace>
spec:
provider: Bedrock
model: us.anthropic.claude-sonnet-4-6
region: us-west-2
Self-hosted vLLM (OpenAI-compatible) and OpenAI APIs are also supported.
3. Helm-install memledger
export MEMLEDGER_CHART=/path/to/memledger-core/charts/memledger
helm upgrade --install memledger "$MEMLEDGER_CHART" \
--namespace <your-namespace> \
--set database.deploy=true \
--set database.migration.enabled=true \
--set database.migration.tableName=agent_memory \
--set database.migration.vectorDimensions=1024 \
--set embeddings.provider=bedrock \
--set embeddings.model=amazon.titan-embed-text-v2:0 \
--set embeddings.dimensions=1024 \
--set memledger.defaultBackend=pgvector \
--wait --timeout 5m
For Aurora PostgreSQL with IAM auth (recommended AWS-native path), set database.deploy=false and point at your Aurora cluster — see Aurora PostgreSQL. For OpenSearch hybrid search, see OpenSearch.
4. Build & deploy the three agent images
Each agent has a build-and-deploy.sh that builds a linux/amd64 image, pushes to ECR, and runs helm upgrade --install for the agent chart with an IRSA-annotated ServiceAccount.
export TF_DIR=/path/to/eks-cluster/terraform
export REPO_DIR=/path/to/this/repo
export MEMLEDGER_PG_DSN="postgresql://memledger:<DB_PASSWORD>@memledger-pgvector:5432/memledger"
# eks-ops-agent — pgvector mode with EKS MCP tools
cd "$REPO_DIR/examples/agentic/eks-ops-agent"
ENABLE_MCP_TOOLS=true ENABLE_MEMORY=true ./build-and-deploy.sh
# triage-agent
cd "$REPO_DIR/examples/agentic/triage-agent"
./build-and-deploy.sh
# compliance-agent
cd "$REPO_DIR/examples/agentic/compliance-agent"
./build-and-deploy.sh
Expected: three Push complete lines and three Deploy complete! banners. Pods reach 1/1 Running in your namespace.
5. Apply agent CRDs
The build scripts already render the kagent Agent CRD via Helm. To re-apply CRDs after editing an agent prompt:
kubectl apply -f "$REPO_DIR/examples/agentic/triage-agent/triage-agent.yaml"
kubectl apply -f "$REPO_DIR/examples/agentic/eks-ops-agent/eks-ops-agent.yaml"
kubectl apply -f "$REPO_DIR/examples/agentic/compliance-agent/compliance-agent.yaml"
kubectl get agents -n <your-namespace>
Reference deployment
Working reference deployment lives at
aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow
on the memory-integration branch under examples/agentic/. This is the
Terraform + Helm + agent images that backed the v2.0.0 launch validation.
Next steps
- Aurora PostgreSQL — IAM auth, token rotation, the AWS-native storage path
- OpenSearch — hybrid (BM25 + vector) search
- Bedrock — embeddings + LLM-as-judge
- Phoenix observability — span inventory and dashboards
- MAI evaluation — three-tier evaluator suite scored against the validation fixture set