About the Role
We are looking for a Systems or Solutions Architect with deep expertise in networking, infrastructure-as-a-service (IaaS), and cloud-scale system design to help architect and optimize AI / ML infrastructure.
The ideal candidate combines strong fundamentals in cloud architecture (AWS or equivalent), networking, compute, and storage, with hands-on experience in Kubernetes, observability, and automation.
You’ll design scalable systems that support large AI workloads — enabling efficient training, inference, and data pipelines across distributed environments.
Key Responsibilities
- Architect and scale AI / ML infrastructure across public cloud (AWS / Azure / GCP) and hybrid environments.
- Design and optimize compute, storage, and network topologies for distributed training and inference clusters.
- Build and manage containerized environments using Kubernetes, Docker, and Helm.
- Develop automation frameworks for provisioning, scaling, and monitoring infrastructure using Python, Go, and IaC (Terraform / CloudFormation).
- Partner with data science and ML Ops teams to align AI infrastructure requirements (GPU / CPU scaling, caching, throughput, latency).
- Implement observability, logging, and tracing using Prometheus, Grafana, CloudWatch, or Open Telemetry.
- Drive networking automation (BGP, routing, load balancing, VPNs, service meshes) using software-defined networking (SDN) and modern APIs.
- Lead performance, reliability, and cost-optimization efforts for AI training and inference pipelines.
- Collaborate cross-functionally with product, platform, and operations teams to ensure secure, performant, and resilient infrastructure.
Required Qualifications
Knowledge of AI / ML infrastructure patterns, including distributed training, inference pipelines, and GPU orchestration.Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.10+ years of experience in systems, infrastructure, or solutions architecture roles.Deep understanding of :
Cloud architecture : AWS (preferred), Azure, or GCPNetworking : VPC, Transit Gateway, DNS, routing, peering, load balancing, VPNCompute and storage : EC2, ECS / EKS, S3, EBS, EFS, FSx, caching systemsCore infrastructure : virtualization, containers, distributed systems, and OS-level tuningProficiency in Linux systems engineering and scripting with Python and Bash.Experience with Kubernetes (EKS / GKE / AKS) for large-scale workload orchestration.Experience with Go (Golang) for infrastructure or network automation.Familiarity with Infrastructure-as-Code (IaC) tools like Terraform, Ansible, or CloudFormation.Experience implementing monitoring and observability systems (Prometheus, Grafana, ELK, Datadog, CloudWatch).Preferred Qualifications
Experience with DevOps and MLOps ecosystems (SageMaker, Kubeflow, MLflow, Airflow).AWS or cloud certifications such as Solutions Architect Professional or Advanced Networking Specialty.Experience in performance benchmarking, security hardening, and cost optimization for compute-intensive workloads.Strong collaboration skills and ability to communicate complex infrastructure concepts clearly.