Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

AIONBengaluru, KA, IN
30+ days ago
Job type
  • Quick Apply
Job description

About AION

AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.

By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. The platform's innovative Proof of Compute Contribution (PoCC) protocol rewards contributors based on performance, creating a transparent and efficient ecosystem.

Integrated with Tether (USD₮ & USD₮0) for stability and regulatory clarity, AION eliminates volatility, ensuring predictable costs and seamless transactions. With cutting-edge partnerships and a USD-backed economy, AION is pioneering the commoditization of high-performance compute, empowering global innovation and bridging the AI wealth gap.

Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India.

Who you are

You are a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You thrive on building robust monitoring solutions and creating self-healing infrastructure. You understand the challenges of maintaining high availability across distributed systems and have experience implementing SRE best practices. You're passionate about creating production-ready environments that can scale efficiently and recover automatically from failures.

Technical Skills & Experience

  • 3-8 years of experience in Site Reliability Engineering or DevOps (exceptional candidates with different experience profiles will be considered)
  • A Tier1 college education or previous work experience at FAANG / top startups is preferred but not required
  • Cloud Platforms : Deep expertise with AWS, GCP, or Azure infrastructure services
  • Kubernetes : Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
  • Infrastructure as Code : Strong experience with Terraform, Pulumi, or similar IaC tools
  • Observability : Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
  • Service Mesh : Experience with Istio, Linkerd, or similar service mesh technologies
  • Networking : Understanding of network architectures, DNS, load balancing, and security groups
  • CI / CD : Knowledge of automated deployment pipelines and GitOps workflows
  • Scripting : Proficiency in Bash, Python, or Go for automation scripts
  • Container Technologies : Deep understanding of Docker, containerd, and OCI specifications
  • Security : Knowledge of infrastructure security best practices and compliance requirements
  • Incident Management : Experience with incident response, post-mortems, and developing SOP documentation

Key Responsibilities

  • Responsible for designing and implementing comprehensive monitoring and alerting systems across all AION platforms.
  • Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes.
  • Create and maintain runbooks and playbooks for handling common operational scenarios and incidents.
  • Responsible for implementing service mesh solutions for observability, traffic management, and security.
  • Design and implement logging systems that provide visibility into complex distributed systems.
  • Responsible for capacity planning and resource optimization across cloud environments.
  • Implement CI / CD pipelines for reliable and consistent deployments across all environments.
  • Design and build self-healing systems that automatically recover from common failure modes.
  • Develop infrastructure for both the compute platform and data annotation services with consistent reliability practices.
  • Responsible for designing and implementing disaster recovery strategies and testing procedures.
  • Create and maintain production, staging, and development environments with appropriate isolation.
  • Collaborate with security teams to implement infrastructure security best practices and compliance requirements.
  • Location

    Individuals in this role are expected to relocate to Bangalore, though exceptions can be made. We offer a hybrid working setup with 3 days in-office setup. Employees would have flexibility to work from anywhere for a few months during a year.

    Why Join Us

  • Be part of a mission-driven team at the intersection of web3 and AI, tackling some of the most exciting challenges in the industry.
  • Join the ground floor of an AI startup, with the opportunity to make a significant impact on the company and the industry.
  • Collaborate with top-tier talent from the tech industry.
  • Competitive salary and benefits package.
  • Flexible work environment with opportunities for professional growth and development.
  • If you are a skilled and motivated Site Reliability Engineer with a passion for building reliable, scalable infrastructure for cutting-edge compute systems, we would love to hear from you.

    Create a job alert for this search

    Site Reliability Engineer • Bengaluru, KA, IN