Skip to main content

kagent on EKS

The marquee AWS-native deployment shape: three Kubernetes-native agents sharing one governed memory layer on Amazon EKS, talking to Aurora PostgreSQL for storage and Amazon Bedrock for embeddings + LLM-as-judge.

What this is

A reference multi-agent setup running on kagent — a Kubernetes-native AI agent framework — with memledger as the shared memory + trust layer. Three agents share the same vector store under explicit namespace RBAC and full attribution:

  • triage-agent — ingests incoming alerts, correlates with past incidents, escalates.
  • eks-ops-agent — troubleshoots EKS clusters, recalls remediations, learns from incidents.
  • compliance-agent — audits provenance, enforces retention, generates trust attestations.

Backend: Aurora PostgreSQL with IAM auth + Bedrock embeddings (amazon.titan-embed-text-v2:0, 1024 dims). OpenSearch is supported as an alternative storage backend for hybrid (BM25 + vector) search.

Three agents, one trust layer

triage-agent/triage/*

Role: ingest alerts, correlate with past incidents, escalate.

Tools that touch memledger

  • ingest_alert(source, severity, description) — searches /ops/incidents and /triage/alerts for correlations, then writes a RecordType.EPISODIC record under /triage/alerts with severity-driven confidence (critical=0.9, high=0.85, medium=0.6, low=0.4) and triggered_by=<alert-id>.
  • correlate_incidents(query, namespace)search_hybrid() against ops history.
  • escalate_to_ops(summary, alert_id, severity) — writes a record under /shared/escalations with confidence=0.85.
  • remember_knowledge(...) / recall_knowledge / recall_context.

Owns: /triage/alerts, /triage/incidents/..., /triage/findings, /shared/escalations.

eks-ops-agent/eks-ops/*

Role: troubleshoot EKS clusters, recall remediations, learn from incidents.

Tools

  • recall_knowledge(query, namespace) — searches /incidents/*, /runbooks, /learnings, and (cross-agent) /triage/incidents. The reply prints full record UUIDs so the agent can pass them as derived_from in a follow-up remember_knowledge.
  • remember_knowledge(...) — typically writes to /eks-ops/remediations/<service> with derived_from=[<triage_record_id>] for cross-agent provenance.
  • mark_memory_outcome(...) — wraps record_outcome(); flags whether a recalled memory led to a successful resolution.
  • memory_audit / memory_lineage — read-only views over the audit log and provenance chain.

Owns: /eks-ops/remediations/..., /incidents/..., /runbooks, /learnings, /users/<user_id>/defaults.

compliance-agent/compliance/* + cross-namespace audit

Role: enforce retention, scan for staleness, generate trust audits.

Tools

  • memory_audit(record_id, last_n) — read the audit log of operations.
  • memory_lineage(...) — full provenance: created_by, derived_from, supersedes chain, accessed-by.
  • scan_staleness(days) — finds records not accessed for N days; writes a compliance report to /compliance/reports.
  • enforce_lifecycle(action, scope, namespace, days) — bulk expire-stale / archive-expired / deprecate-conflicting.
  • check_namespace_compliance(namespace) — RBAC-gated read; surfaces low-confidence/hedged records.

Owns: /compliance/reports. Read access into all other namespaces granted under the RBAC policy.

How they cooperate

A canonical multi-agent flow through memledger:

(memledger pgvector + Bedrock Titan)

alert ── triage-agent ── ingest_alert ─────────────┼── /triage/alerts/<id>
│ │ conf=0.9 (severity=critical)
│ │ triggered_by=alert-...
▼ │
(correlation) │
│ │
▼ │
recall ◄── eks-ops-agent ◄── recall_knowledge ────┤
│ (cross-agent │
▼ /triage/incidents)
remember_knowledge ──────────┼── /eks-ops/remediations/payment
│ derived_from=[triage] │ conf=0.85
▼ │ derived_from=[<triage_id>]
mark_memory_outcome ─────────┼── (success_count++)

audit ── compliance-agent ── memory_lineage ───────┤

scan_staleness ──────────────┴── /compliance/reports
conf=0.95

When compliance-agent calls memory_lineage on the eks-ops remediation, memledger walks the chain and returns:

chain_depth = 2
min_confidence = 0.85 # weakest-link across agent boundaries
agents_involved = ['eks-ops-agent', 'triage-agent']
hops:
hop 0 origin eks-ops-agent (conf 0.85) — remediation
hop 1 derived triage-agent (conf 0.90) — root incident

How agents use memledger

Every agent in this reference deployment follows the same shape:

  1. OpenTelemetry init — set up a real TracerProvider + BatchSpanProcessor + OTLPSpanExporter pointed at the cluster OTel collector on :4317. Without this the global tracer is a no-op ProxyTracerProvider and memledger spans are lost.

  2. Lazy memledger instance — create a Memledger per process with Bedrock embeddings, then opt into tracing:

    from memledger import Memledger
    from memledger.models import EmbeddingConfig
    from memledger.telemetry import instrument

    self._ml = await Memledger.create(
    backend_name="pgvector",
    connection_string=self._pg_dsn,
    embedding_config=EmbeddingConfig(
    provider="bedrock",
    model="amazon.titan-embed-text-v2:0",
    dimensions=1024,
    ),
    )
    instrument(self._ml) # opt-in OTel wrapping
  3. remember_knowledge — accept and forward the full trust kwargs (confidence, hedged, derived_from, supersedes, agent_id, created_by, workflow_id, triggered_by).

  4. recall_knowledge — print full record UUIDs in the response so downstream agents can pass them as derived_from. Truncated 8-char prefixes break cross-agent linking.

Deploy

Prerequisites

  • kubectl ≥ 1.30
  • helm ≥ 3.14
  • awscli v2 with credentials for the EKS account
  • docker with buildx
  • jq
export AWS_REGION=us-west-2
export AWS_PROFILE=<your-profile>
export CLUSTER_NAME=<your-cluster-name>

1. kagent infrastructure

Stand up an EKS cluster with kagent enabled. Minimal Terraform variables:

# kagent.tfvars
kagent_enabled = true
kagent_database_type = "sqlite"
kagent_enable_ui = true
kagent_enable_bedrock_access = true

Apply against your EKS module:

terraform apply \
-var="cluster_name=<your-cluster-name>" \
-var="region=us-west-2" \
-var-file="kagent.tfvars"

For high availability, set kagent_database_type=postgresql and kagent_controller_replicas=3. Reach the kagent UI with:

kubectl port-forward -n <your-namespace> svc/kagent-ui 8080:8080

Bedrock is the recommended LLM provider for AWS-native deployments. With kagent_enable_bedrock_access=true Terraform configures IRSA on the kagent service account; you then declare a ModelConfig:

apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
name: claude-sonnet
namespace: <your-namespace>
spec:
provider: Bedrock
model: us.anthropic.claude-sonnet-4-6
region: us-west-2

Self-hosted vLLM (OpenAI-compatible) and OpenAI APIs are also supported.

3. Helm-install memledger

export MEMLEDGER_CHART=/path/to/memledger-core/charts/memledger

helm upgrade --install memledger "$MEMLEDGER_CHART" \
--namespace <your-namespace> \
--set database.deploy=true \
--set database.migration.enabled=true \
--set database.migration.tableName=agent_memory \
--set database.migration.vectorDimensions=1024 \
--set embeddings.provider=bedrock \
--set embeddings.model=amazon.titan-embed-text-v2:0 \
--set embeddings.dimensions=1024 \
--set memledger.defaultBackend=pgvector \
--wait --timeout 5m

For Aurora PostgreSQL with IAM auth (recommended AWS-native path), set database.deploy=false and point at your Aurora cluster — see Aurora PostgreSQL. For OpenSearch hybrid search, see OpenSearch.

4. Build & deploy the three agent images

Each agent has a build-and-deploy.sh that builds a linux/amd64 image, pushes to ECR, and runs helm upgrade --install for the agent chart with an IRSA-annotated ServiceAccount.

export TF_DIR=/path/to/eks-cluster/terraform
export REPO_DIR=/path/to/this/repo
export MEMLEDGER_PG_DSN="postgresql://memledger:<DB_PASSWORD>@memledger-pgvector:5432/memledger"

# eks-ops-agent — pgvector mode with EKS MCP tools
cd "$REPO_DIR/examples/agentic/eks-ops-agent"
ENABLE_MCP_TOOLS=true ENABLE_MEMORY=true ./build-and-deploy.sh

# triage-agent
cd "$REPO_DIR/examples/agentic/triage-agent"
./build-and-deploy.sh

# compliance-agent
cd "$REPO_DIR/examples/agentic/compliance-agent"
./build-and-deploy.sh

Expected: three Push complete lines and three Deploy complete! banners. Pods reach 1/1 Running in your namespace.

5. Apply agent CRDs

The build scripts already render the kagent Agent CRD via Helm. To re-apply CRDs after editing an agent prompt:

kubectl apply -f "$REPO_DIR/examples/agentic/triage-agent/triage-agent.yaml"
kubectl apply -f "$REPO_DIR/examples/agentic/eks-ops-agent/eks-ops-agent.yaml"
kubectl apply -f "$REPO_DIR/examples/agentic/compliance-agent/compliance-agent.yaml"

kubectl get agents -n <your-namespace>

Reference deployment

Working reference deployment lives at aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow on the memory-integration branch under examples/agentic/. This is the Terraform + Helm + agent images that backed the v2.0.0 launch validation.

Next steps