About the Role
We are looking for a
Systems or Solutions Architect
with deep expertise in
networking, infrastructure-as-a-service (IaaS), and cloud-scale system design
to help architect and optimize
AI / ML infrastructure .
The ideal candidate combines strong fundamentals in
cloud architecture (AWS or equivalent) ,
networking ,
compute , and
storage , with hands-on experience in
Kubernetes, observability, and automation .
You’ll design scalable systems that support large AI workloads — enabling efficient training, inference, and data pipelines across distributed environments.
Key Responsibilities
Architect and scale AI / ML infrastructure
across public cloud (AWS / Azure / GCP) and hybrid environments.
Design and optimize compute, storage, and network topologies
for distributed training and inference clusters.
Build and manage
containerized environments
using
Kubernetes, Docker, and Helm .
Develop
automation frameworks
for provisioning, scaling, and monitoring infrastructure using
Python, Go, and IaC (Terraform / CloudFormation) .
Partner with data science and ML Ops teams to align
AI infrastructure requirements
(GPU / CPU scaling, caching, throughput, latency).
Implement
observability, logging, and tracing
using
Prometheus, Grafana, CloudWatch, or Open Telemetry .
Drive
networking automation
(BGP, routing, load balancing, VPNs, service meshes) using software-defined networking (SDN) and modern APIs.
Lead performance, reliability, and cost-optimization efforts for AI training and inference pipelines.
Collaborate cross-functionally with product, platform, and operations teams to ensure
secure, performant, and resilient infrastructure .
Required Qualifications
Knowledge of
AI / ML infrastructure patterns , including distributed training, inference pipelines, and GPU orchestration.
Bachelor’s or Master’s degree
in Computer Science, Information Technology, or related field.
10+ years of experience
in systems, infrastructure, or solutions architecture roles.
Deep understanding of :
Cloud architecture :
AWS (preferred), Azure, or GCP
Networking :
VPC, Transit Gateway, DNS, routing, peering, load balancing, VPN
Compute and storage :
EC2, ECS / EKS, S3, EBS, EFS, FSx, caching systems
Core infrastructure :
virtualization, containers, distributed systems, and OS-level tuning
Proficiency in Linux systems engineering
and
scripting with Python and Bash .
Experience with Kubernetes
(EKS / GKE / AKS) for large-scale workload orchestration.
Experience with Go (Golang)
for infrastructure or network automation.
Familiarity with
Infrastructure-as-Code (IaC)
tools like Terraform, Ansible, or CloudFormation.
Experience implementing
monitoring and observability systems
(Prometheus, Grafana, ELK, Datadog, CloudWatch).
Preferred Qualifications
Experience with
DevOps and MLOps ecosystems
(SageMaker, Kubeflow, MLflow, Airflow).
AWS or cloud certifications such as
Solutions Architect Professional
or
Advanced Networking Specialty .
Experience in
performance benchmarking ,
security hardening , and
cost optimization
for compute-intensive workloads.
Strong collaboration skills and ability to communicate complex infrastructure concepts clearly.
Solution Architect • Saint Thomas Mount, Tamil Nadu, India