Cloud Infrastructure Engineer

super.moneyBengaluru, Republic Of India, IN

1 day ago

Job description

Site Reliability Engineer (SRE) Level 3

Overview :

A Site Reliability Engineer (SRE) Level 3 is a senior technical leadership role focused on designing, implementing, and maintaining large-scale, complex, and highly reliable systems. This role emphasizes a blend of software and systems engineering to ensure the availability, latency, performance, and capacity of critical services. SREs at this level are passionate about quality, efficiency, and reliability, and they play a crucial role in accelerating innovation and driving continuous improvement.

Key Responsibilities :

Reliability and System Design :
Design and implement reliability patterns for client applications, communication protocols, and back-office services.
Build, deploy, tune, and own distributed and resilient systems.
Drive the entire lifecycle of a service, from inception and design through deployment, operation, and refinement.
Operations and Automation :
Operate and influence data collection, processing, and delivery systems that are scalable, resilient, and capable of operating at a global scale.
Leverage monitoring and observability tools (e.G., Prometheus, Grafana, Datadog) to ensure system health and reliability.
Lead automation efforts to reduce toil and maintain system efficiency.
Proactively identify opportunities to eliminate toil and automate issue triage to improve overall operational stability.
Incident Management and Improvement :
Lead incident response efforts and participate in on-call rotations.
Ensure root cause analysis and drive continuous improvement after incidents.
Drive postmortem processes, focusing on identifying and remediating systemic issues to prevent recurrence.
Collaboration and Leadership :
Collaborate with other software engineers, operations, product managers, and executives to design and implement deployment approaches using highly scalable, automated, continuous integration, and continuous delivery pipelines.
Work closely with embedded vehicle teams, data engineers, infrastructure engineers, developer experience, and application teams.
Proactively promote the adoption of site reliability engineering best practices within the team and organization.
Lead technical decision-making, balancing reliability, performance, and cost.

Required Experience and Skills :

Typically 5-8+ years of combined experience in SRE, software development, or infrastructure engineering.

Strong experience in building and operating enterprise cloud applications.

Proficiency with cloud platforms such as AWS, Azure, or GCP, and container orchestration technologies like Kubernetes.

Familiarity with security practices such as DevSecOps.

Advanced programming skills in one or more languages (e.G., Python, Java, Go, Scala, C++).

Experience with Continuous Integration / Continuous Delivery (CI / CD) tools (e.G. ArgoCD, Jenkins, Gitlab CI / CD)

Familiarity with Infrastructure as Code frameworks like Terraform

Advanced knowledge of networking (firewalls, DNS, Load Balancing, Proxies) and Linux / Windows operating systems.

Excellent problem-solving, communication (written and verbal), and interpersonal skills.

Ability to learn complex systems, identify and mitigate incidents, and work cross-functionally.

Create a job alert for this search

Cloud Infrastructure Engineer • Bengaluru, Republic Of India, IN