Role Overview
As the AI Systems Architect , you’ll own the end-to-end design and delivery of production-grade agentic and Generative AI systems. This is a highly hands-on role requiring deep architectural insight, coding proficiency, and an obsession with performance, scalability, and reliability. You’ll architect secure, cost-efficient AI platforms on AWS, guide developers through complex debugging and optimization, and ensure all systems are observable, governed, and production-ready.
Key Responsibilities
- Architect Production AI Systems : Design robust architectures for agentic systems (planning, reasoning, tool-calling), GenAI / RAG pipelines, and evaluation workflows. Create detailed design documents including flow / UML / sequence diagrams and AWS deployment topologies.
- Optimize for Cost & Performance : Model throughput, latency, concurrency, autoscaling, CPU / GPU sizing, and vector index performance to ensure scalable, efficient deployments.
- Lead Debugging & Stability Efforts : Conduct deep-dive debugging, fix critical defects, and resolve production incidents;
pair-program with developers to improve code quality and performance.
Standardize Agentic Frameworks : Build reference implementations using Semantic Kernel (preferred), LangGraph, AutoGen, or CrewAI with strong schema validation, grounding, and memory management.Engineer Retrieval & Search Systems : Architect hybrid retrieval solutions including ingestion, chunking, embeddings, ranking, caching, and freshness management while minimizing hallucination risk.Productionize on AWS : Deploy and manage systems using Amazon EKS, Bedrock, S3, SQS / SNS, RDS, and ElastiCache. Integrate IAM / Okta, Secrets Manager, and Datadog for observability, enforcing SLIs / SLOs and error budgets.Implement Observability & Monitoring : Set up distributed tracing, metrics, and logging via OpenTelemetry and Datadog. Standardize dashboards, alerts, and incident response workflows.Govern Evaluation & Rollouts : Build test and evaluation frameworks—golden sets, A / B experiments, regression suites, and controlled rollouts—to ensure consistent quality across releases.Embed Security & Safety : Enforce least privilege, PII protection, and policy compliance through threat modeling, sandboxed execution, and prompt-injection defense.Establish Engineering Standards : Create reusable SDKs, connectors, CI / CD templates, and architecture review checklists to promote consistency across teams.Cross-Functional Leadership : Collaborate with product, data, and SRE teams for capacity planning, DR strategies, and post-incident RCA reviews. Mentor engineers to strengthen design and reliability practices.Must-Have Qualifications
7–10 years in software / AI engineering, including 4+ years in GenAI application development and 2+ years architecting agentic AI systems.Expert in Python 3.11+ (asyncio, typing, packaging, profiling, pytest).Hands-on experience with Semantic Kernel , LangGraph , AutoGen , or CrewAI .Proven delivery of GenAI / RAG systems on AWS Bedrock or equivalent vector-based platforms (OpenSearch Serverless, Pinecone, Redis).Deep understanding of AWS ecosystem : EKS, Bedrock, S3, SQS / SNS, RDS, ElastiCache, Secrets Manager, IAM / Okta, Kong API Gateway, Datadog.Expertise in observability and incident management using OpenTelemetry and Datadog.Strong focus on cost, performance, and security engineering —FinOps mindset, autoscaling, caching, and policy enforcement.Exceptional communication—clear diagrams, ADRs, and peer review practices.Nice-to-Have Skills
Multi-agent orchestration (task decomposition, coordinator-worker, graph-based planning).Expertise with vector databases (OpenSearch, Pinecone, pgvector, Redis).Experience with AI evaluation, guardrails, and rollout gating.Familiarity with frontend agent interfaces, secure APIs, and AuthN / Z best practices.Exposure to policy-as-code , multi-tenant architectures, and feature management (Kong, LaunchDarkly, Flipt).Experience with CI / CD via GitHub Actions and IaC (Terraform / AWS CloudFormation).