Site Reliability EngineerSails Software Inc • India

Site Reliability Engineer

Sails Software Inc • India

1 day ago

Job description

SRE- AWS

Job Summary

We are looking for an experienced and driven Senior Site Reliability Engineer (SRE) to architect, implement, and maintain robust cloud infrastructure. This role demands a deep understanding of AWS, Kubernetes, ECS, and the ability to build scalable, secure, and highly available infrastructure from scratch. The ideal candidate will be a strong advocate for DevOps principles, automation, and reliability, and will possess the skills to support and optimize complex microservices-based architectures.

Key Responsibilities

Infrastructure Design & Implementation
Design and build highly scalable, fault-tolerant, and secure cloud infrastructure using AWS, Kubernetes, and ECS.
Lead efforts in infrastructure as code (IaC) using tools like Terraform or CloudFormation.
Develop and enforce best practices for infrastructure provisioning, security, and cost optimization.

System Reliability & Performance

Ensure availability, performance, scalability, and security of production systems.

Implement observability strategies including monitoring, logging, and alerting using tools such as Prometheus, Grafana, ELK, or Datadog.

Analyse system performance metrics and proactively identify potential issues and bottlenecks.

DevOps & Automation

Build and maintain CI / CD pipelines to streamline code deployments across environments.

Drive automation in infrastructure provisioning, configuration management, and operational tasks.

Ensure repeatable and reliable deployments using containers and orchestration tools like Kubernetes and ECS.

Service Management

Own the SRE lifecycle, including incident management, postmortems, root cause analysis, and runbook creation.

Collaborate closely with development and QA teams to ensure seamless microservices integration, deployment, and lifecycle management.

Maintain service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.

Security & Compliance

Implement and enforce cloud security best practices for networking, identity and access management, and data protection.

Support audits, compliance assessments, and vulnerability remediation.

Monitor for security anomalies and work with security teams to respond to threats.

Technical Skills

6+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Engineering.

Expertise in AWS services such as EC2, S3, RDS, IAM, VPC, Lambda, CloudWatch, etc.

Strong knowledge of Kubernetes and container orchestration best practices.

Experience managing services on Amazon ECS (Fargate or EC2).

Proficient in infrastructure-as-code tools like Terraform, CloudFormation, or Pulumi.

Skilled in scripting languages such as Python, Bash, or Go.

Solid grasp of networking, load balancing, DNS, and firewall rules in cloud environments.

Deep understanding of microservices architectures, API gateways, and service meshes.

Soft Skills

Proven leadership and cross-functional collaboration skills.

Strong problem-solving and incident-resolution mindset.

Clear communication, documentation, and stakeholder reporting abilities.

Passion for continuous improvement and automation.

Preferred Qualifications

AWS certifications such as AWS Certified DevOps Engineer, Solutions Architect – Professional, or equivalent.

Familiarity with service meshes like Istio or Linkerd.

Experience with serverless architectures and event-driven systems.

Knowledge of regulatory compliance (SOC2, ISO 27001, GDPR) in cloud environments.

Skills – AWS Cloud, CICD, EC2, Kubernete, Grafana, Datadog, Python

Key Responsibilities :

Cloud Platform : GCP

Infrastructure Automation : Design, implement, and manage infrastructure as code using Terraform to provision and manage GCP resources.

Container Orchestration : Deploy and manage Kubernetes clusters, ensuring efficient operation of containerized applications.

Continuous Integration / Continuous Deployment (CI / CD) : Develop and maintain CI / CD pipelines using Jenkins to automate application build, test, and deployment processes.

Containerization : Collaborate with development teams to containerize applications using Docker and manage deployments with Helm Charts.

Code Quality Assurance : Integrate and manage SonarQube to ensure code quality and security standards are met.

Monitoring and Logging : Implement and manage monitoring solutions using Datadog to ensure system health, performance, and security.

Collaboration : Work closely with cross-functional teams, including developers, QA, and operations, to streamline processes and improve productivity.

Requirements :

Experience : 5+ years in DevOps or cloud engineering roles, with at least 3 years of relevant experience in the specified technologies.

Technical Proficiency :

o Hands-on experience with GCP services and architecture.

o Proficiency in Terraform for infrastructure as code implementations.

o Strong understanding and experience with Kubernetes and Docker.

o Experience in setting up and managing CI / CD pipelines using Jenkins.

o Familiarity with Helm Charts for application deployment.

o Experience with SonarQube for code quality analysis.

o Proficiency in monitoring and logging tools, particularly Datadog.

Scripting Skills : Proficiency in scripting languages such as Bash or Python is an added advantage.

o Strong problem-solving abilities and analytical thinking.

o Excellent communication skills, both verbal and written.

o Ability to work collaboratively in a team environment.

o Strong organizational and time management skills.

Skills – Terraform, Kubernetes, Cluster, Docker, GCP, Sonar

Technical Skills

6+ years of hands-on experience in Site Reliability Engineering, DevOps, or Cloud Engineering.

Expertise in AWS services such as EC2, S3, RDS, IAM, VPC, Lambda, CloudWatch, etc.

Strong knowledge of Kubernetes and container orchestration best practices.

Experience managing services on Amazon ECS (Fargate or EC2).

Proficient in infrastructure-as-code tools like Terraform, CloudFormation, or Pulumi.

Skilled in scripting languages such as Python, Bash, or Go.

Solid grasp of networking, load balancing, DNS, and firewall rules in cloud environments.

Deep understanding of microservices architectures, API gateways, and service meshes.

Soft Skills

Proven leadership and cross-functional collaboration skills.

Strong problem-solving and incident-resolution mindset.

Clear communication, documentation, and stakeholder reporting abilities.

Passion for continuous improvement and automation.

Preferred Qualifications

AWS certifications such as AWS Certified DevOps Engineer, Solutions Architect – Professional, or equivalent.

Familiarity with service meshes like Istio or Linkerd.

Experience with serverless architectures and event-driven systems.

Knowledge of regulatory compliance (SOC2, ISO 27001, GDPR) in cloud environments.

Skills – AWS Cloud, CICD, EC2, Kubernete, Grafana, Datadog, Python

Create a job alert for this search

Site Reliability Engineer • India

Related jobs

Site Reliability Engineer

super.money • India

Site Reliability Engineer (SRE) Level 3.A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and...Show more

Last updated: 1 day ago • Promoted

Site Reliability Engineer

Media.net • India

Our proprietary contextual technology is at the forefront of enhancing Programmatic buying, the latest industry standard in ad buying for digital platforms. HQ is based in New York, and the Global H...Show more

Last updated: 1 day ago • Promoted

Site Reliability Engineer Ii

RecRoots • Republic Of India, IN

Key Job Responsibilities and Duties : .The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned addressing...Show more

Last updated: 30+ days ago • Promoted

Freelance Site Reliability Engineer (SRE) / DevOps Engineer

ThreatXIntel • Nagpur, Maharashtra, India

Company Description ThreatXIntel is a startup cyber security company focused on delivering customized, affordable solutions to protect businesses and organizations from cyber threats.Our experience...Show more

Last updated: 18 hours ago • Promoted • New!

Site Reliability Engineer Rotation Shift

Synechron • Pune, Republic Of India, IN

We have immediate opportunity for.SRE (Senior Site Reliability Engineer) 5-8 years.SRE (Senior Site Reliability Engineer). We began life in 2001 as a small, self-funded team of technology specialist...Show more

Last updated: 14 days ago • Promoted

Sr Site Reliability Engineer

Media.net • Republic Of India, IN

Net is a leading, global ad tech company that focuses on creating the most transparent and efficient path for advertiser budgets to become publisher revenue. Our proprietary contextual technology is...Show more

Last updated: 30+ days ago • Promoted

Aws Site Reliability Engineer

HTC Global Services • Chennai, Republic Of India, IN

Troy, Michigan, is a leading global Information Technology solution and BPO provider.HTC assists clients across multiple industry verticals, offering turnkey project lifecycle in, e-business, data ...Show more

Last updated: 20 days ago • Promoted

Site Reliability Engineer (SRE) – Infrastructure & Automation

InstaService • Nagpur, IN

InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show more

Last updated: 19 days ago • Promoted

Site Reliability Engineer

HRhelpdesk • Indore, Republic Of India, IN

Company is a rapidly growing, private equity backed SaaS product company and provides cloud-based solutions.As a Site Reliability Engineer (SRE), you will be responsible for building and maintainin...Show more

Last updated: 10 days ago • Promoted

AWS Site Reliability Engineer

HTC Global Services • India

Last updated: 1 day ago • Promoted

Engineer, Site Reliability [T500-20266]

TMUS Global Solutions • India

NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show more

Last updated: 1 day ago • Promoted

Lead Site Reliability Engineer

Media.net • Republic Of India, IN

Net is a leading, global ad tech company that focuses on creating the most transparent and efficient path for advertising budgets to become publisher revenue. Our proprietary contextual technology i...Show more

Last updated: 9 days ago • Promoted

Site Reliability Engineer

Tata Consultancy Services • Chennai, Republic Of India, IN

Role : Site Reliability Engineer.Locations : Chennai / Pune / Kolkata.Show more

Last updated: 29 days ago • Promoted

Senior Site Reliability Engineer

GigSky • Republic Of India, IN

We're Hiring : Site Reliability Engineer (5–10 Years Experience).Location : Bangalore, India | 🏢 Gigsky India Private Limited. Are you passionate about building resilient, scalable, and secure infras...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Grootan Technologies • Chennai, Republic Of India, IN

Site Reliability Engineer (SRE).In this role, you will be responsible for building and maintaining reliable, scalable, and secure infrastructure to support our applications.You will leverage your e...Show more

Last updated: 10 days ago • Promoted

Site Reliability Engineer

PhonePe • Pune, Republic Of India, IN

Troubleshoot issues across the entire stack - hardware, software, application, and network.Work to improve the reliability and performance of the next generation of distributed systems.Work to impr...Show more

Last updated: 20 days ago • Promoted

Site Reliability Engineer

JRD Systems • India

Site Reliability Engineer (Windows / Cloud / Automation).We are seeking an experienced Site Reliability Engineer with a strong background in managing Windows infrastructure and cloud environments.T...Show more

Last updated: 1 day ago • Promoted

Site Reliability Engineer

PRI Global • Pune, Republic Of India, IN

Experience in Linux, Azure cloud certification and candidate must have good knowledge on Bash / jenkins / Chef / chef-habitat technologies.Show more

Last updated: 21 days ago • Promoted