Position Overview :
The Site Reliability Engineer (SRE) is responsible for ensuring the
stability, scalability, performance, and reliability
of production systems and services. This role bridges software development and operations, using automation, monitoring, and performance optimization to build resilient systems that can scale efficiently and recover quickly from failures.
Key Responsibilities :
Design, build, and maintain
highly reliable and scalable systems and infrastructure .
Automate deployment, monitoring, and maintenance processes using
DevOps tools and scripts .
Implement and manage
CI / CD pipelines
to support continuous delivery.
Monitor application performance, identify bottlenecks, and improve
uptime and reliability .
Develop and maintain
incident response procedures , including root cause analysis and postmortems.
Collaborate with development teams to design systems for
fault tolerance, load balancing, and failover .
Manage and optimize
cloud infrastructure
(AWS, Azure, GCP).
Implement observability solutions —
logging, metrics, tracing, and alerting .
Maintain strong
security and compliance standards
across infrastructure.
Participate in
on-call rotations
and ensure 24 / 7 system availability.
Document processes, configurations, and runbooks for operational consistency.
Required Skills & Qualifications :
Bachelor’s degree in
Computer Science, Information Technology, or related field .
Strong knowledge of
Linux / Unix systems administration
and
shell scripting .
Proficiency with
automation and configuration tools
(Ansible, Terraform, Chef, Puppet).
Experience with
cloud platforms
— AWS, Azure, or Google Cloud.
Familiarity with
containerization and orchestration tools
(Docker, Kubernetes).
Solid understanding of
CI / CD tools
(Jenkins, GitLab CI, CircleCI).
Strong experience with
monitoring and observability tools
(Prometheus, Grafana, ELK Stack, Datadog).
Knowledge of
networking fundamentals , load balancing, and DNS management.
Proficiency in at least one programming language (Python, Go, or Bash).
Excellent analytical, problem-solving, and communication skills.
Preferred Qualifications :
Experience with
infrastructure-as-code (IaC)
and
serverless architectures .
Knowledge of
reliability metrics
such as SLOs, SLIs, and error budgets.
Exposure to
database administration
(MySQL, PostgreSQL, MongoDB, Redis).
Familiarity with
security practices
for cloud-native systems.
Certifications such as
AWS Certified DevOps Engineer ,
Google SRE Certification , or
CKA (Certified Kubernetes Administrator) .
Site Reliability Engineer • Amravati, Maharashtra, India