About the job
Role Overview
As the AI Systems Architect, you’ll own the end-to-end design and delivery of production-grade agentic and Generative AI systems. This is a highly hands-on role requiring deep architectural insight, coding proficiency, and an obsession with performance, scalability, and reliability. You’ll architect secure, cost-efficient AI platforms on AWS, guide developers through complex debugging and optimization, and ensure all systems are observable, governed, and production-ready.
Key Responsibilities
- Architect Production AI Systems : Design robust architectures for agentic systems (planning, reasoning, tool-calling), Gen AI / RAG pipelines, and evaluation workflows. Create detailed design documents, including flow / UML / sequence diagrams and AWS deployment topologies.
- Optimize for Cost & Performance : Model throughput, latency, concurrency, autoscaling, CPU / GPU sizing, and vector index performance to ensure scalable, efficient deployments.
- Lead Debugging & Stability Efforts : Conduct deep-dive debugging, fix critical defects, and resolve production incidents; pair-program with developers to improve code quality and performance.
- Standardize Agentic Frameworks : Build reference implementations using Semantic Kernel (preferred), Lang Graph, Auto Gen, or Crew AI with strong schema validation, grounding, and memory management.
- Engineer Retrieval & Search Systems : Architect hybrid retrieval solutions including ingestion, chunking, embeddings, ranking, caching, and freshness management while minimizing hallucination risk.
- Productionize on AWS : Deploy and manage systems using Amazon EKS, Bedrock, S3, SQS / SNS, RDS, and Elasti Cache. Integrate IAM / Okta, Secrets Manager, and Datadog for observability, enforcing SLIs / SLOs and error budgets.
- Implement Observability & Monitoring : Set up distributed tracing, metrics, and logging via Open Telemetry and Datadog. Standardize dashboards, alerts, and incident response workflows.
- Govern Evaluation & Rollouts : Build test and evaluation frameworks—golden sets, A / B experiments, regression suites, and controlled rollouts—to ensure consistent quality across releases.
- Embed Security & Safety : Enforce least privilege, PII protection, and policy compliance through threat modeling, sandboxed execution, and prompt-injection defense.
- Establish Engineering Standards : Create reusable SDKs, connectors, CI / CD templates, and architecture review checklists to promote consistency across teams.
- Cross-Functional Leadership : Collaborate with product, data, and SRE teams for capacity planning, DR strategies, and post-incident RCA reviews. Mentor engineers to strengthen design and reliability practices.
Must-Have Qualifications
7–10 years in software / AI engineering, including 4+ years in Gen AI application development and 2+ years architecting agentic AI systems.Expert in Python 3.11+ (asyncio, typing, packaging, profiling, pytest).Hands-on experience with Semantic Kernel, Lang Graph, Auto Gen, or Crew AI.Proven delivery of Gen AI / RAG systems on AWS Bedrock or equivalent vector-based platforms (Open Search Serverless, Pinecone, Redis).Deep understanding of AWS ecosystem : EKS, Bedrock, S3, SQS / SNS, RDS, Elasti Cache, Secrets Manager, IAM / Okta, Kong API Gateway, Datadog.Expertise in observability and incident management using Open Telemetry and Datadog.Strong focus on cost, performance, and security engineering—Fin Ops mindset, autoscaling, caching, and policy enforcement.Exceptional communication—clear diagrams, ADRs, and peer review practices.Nice-to-Have Skills
Multi-agent orchestration (task decomposition, coordinator-worker, graph-based planning).Expertise with vector databases (Open Search, Pinecone, pgvector, Redis).Experience with AI evaluation, guardrails, and rollout gating.Familiarity with frontend agent interfaces, secure APIs, and Auth N / Z best practices.Exposure to policy-as-code, multi-tenant architectures, and feature management (Kong, Launch Darkly, Flipt).Experience with CI / CD via Git Hub Actions and Ia C (Terraform / AWS Cloud Formation).