This job offer is not available in your country.

Senior SRE (Morning 6 am- 2pm shift) : Oracle Cloud

OracleUdaipur, IN

28 days ago

Job description

Role : Senior Site Reliability Engineer

Team : OCI Reliability

Shift : 6am - 2pm

Skills required : Production Incidence, Automation, Python.

Location : Remote

Job description

As a Senior Site Reliability Engineer, you will focus on detecting, triaging, and mitigating OCI service-impacting events quickly and efficiently. You will be responsible for minimising downtime by delivering exceptional major incident management and ensuring the reliability, scalability, performance, and security of the systems that prevent incidents from occurring. Your work will directly contribute to reducing event duration by leveraging your operational expertise, best practices, and the ability to develop tools that automate and improve incident management processes.

Oracle Cloud is cutting-edge and continuously evolving. When issues arise, your team will respond within minutes to mitigate customer impact and ensure service continuity. This role will give you deep insight into the inner workings of OCI’s systems and operations. You’ll collaborate with and influence leaders across Oracle, driving organisational initiatives aimed at continually improving OCI-wide service availability. As part of an agile, high-impact team, you will play a crucial role in shaping the future of Oracle Cloud. If you're excited to be part of a fast-moving team that’s pushing the boundaries of innovation, we’d love to connect with you!

We are looking for candidates who are flexible to work APAC shift hours (6 AM to 2 PM IST).

Career Level - IC3

Responsibilities :

Lead major incident recovery by orchestrating cross-functional collaboration, driving rapid escalation, clear communication, and seamless stakeholder alignment to ensure swift and effective resolution.
Identify opportunities to automate and streamline critical incident workflows, taking full ownership of developing and implementing innovative solutions to enhance efficiency and drive faster resolutions.
Leverage deep expertise in cloud computing design patterns and dependencies to proactively mitigate complex major incidents and optimize cloud-based solutions and Leverage your expertise to quickly diagnose root causes, mitigate impact, and implement long-term fixes.
Proficient in troubleshooting cloud infrastructure issues using observability platforms to monitor, analyse, and resolve performance and reliability challenges.
Continuously improve operational processes, tools, and workflows to enhance the reliability and efficiency of the cloud infrastructure.

Minimum Qualifications

Bachelor's degree or higher in Computer Science or a related field, or equivalent work experience.

8+ years of experience in Site Reliability Engineering (SRE), DevOps, or Systems Engineering.

Extensive hands-on experience with public cloud operations (e.g., AWS, Azure, GCP, OCI).

Proven track record in Major Incident Management within cloud-based environments, with the ability to drive effective incident resolution.

Strong understanding of automation and orchestration principles, with a focus on improving system reliability and efficiency.

Proficiency in at least one modern object-oriented programming language (e.g., Python, Java, Go, etc.).

Solid experience in software engineering best practices, including Agile methodologies, coding standards, code reviews, version control, build processes, testing, and operations.

Familiarity with infrastructure automation tools such as Chef, Ansible, Jenkins, and Terraform.

Expertise in several key technologies, including Infrastructure-as-a-Service (IaaS), CI / CD systems, Docker, RESTful APIs, log analysis, and debugging tools.

Experience with observability platforms such as Grafana, Prometheus, and other monitoring, logging, and tracing tools to optimize system visibility, performance, and issue resolution.