Role Overview :
We are seeking a Senior Cloud Site Reliability Engineer to enhance the reliability, scalability, and performance of our cloud-based products and solutions. You will work in a collaborative environment with operations engineers and software developers to ensure seamless system operations, automate processes, and proactively address challenges.
Key Responsibilities :
- Analyze existing cloud infrastructure and propose scalable, efficient solutions .
- Lead incident management , conducting root cause analysis and implementing preventative measures .
- Develop strategies to improve MTBF (Mean Time Between Failures) and reduce MTTR (Mean Time to Recovery) .
- Optimize and automate operational procedures for enhanced system efficiency.
- Monitor and troubleshoot performance issues across infrastructure, software, and networks.
- Research and advocate for emerging cloud technologies and best practices .
- Collaborate with teams to enhance system reliability, architecture, and design.
- Design and execute automated tests to validate software and infrastructure reliability.
Required Qualifications :
3+ years of experience in Site Reliability Engineering or a related role.2+ years of hands-on experience with AWS services ( AWS certification – Solutions Architect or DevOps Engineer – is mandatory ).Strong knowledge of AWS services (EC2, RDS, Lambda, CloudFront, ELB, API Gateway).Experience with Linux / Unix and Windows systems, networking, and firewall concepts .Proficiency with CI / CD tools (Jenkins, TeamCity) and version control systems (Bitbucket).Advanced scripting skills (Python preferred).Strong understanding of system reliability, performance tuning, and scalability .Experience with cloud-native services, network technologies, and fault-tolerant system design .Database expertise in RDBMS and cloud databases (PostgreSQL, MySQL).Familiarity with monitoring tools (Splunk, Datadog, or equivalent).Preferred Qualifications :
Bachelor's / Master's degree in Computer Science, Engineering , or a related field.Experience with big data technologies (Spark, Hadoop, Scala) is a plus.Strong problem-solving and analytical skills with a proactive mindset.Excellent communication skills and ability to work in global, cross-functional teams .Quick adaptability to new platforms, tools, and technologies.Why Join Us
Work on cutting-edge cloud solutions with a global impact.Be part of a collaborative, high-performing team in an innovative environment.Drive mission-critical projects that enhance system reliability and scalability.Stay at the forefront of cloud technology with continuous learning and growth opportunities.Apply Now!
Skills Required
Cloud Site Reliability Engineer, Mean Time to Recovery, Mean Time Between Failures, Lead incident management