SRE lead with capability to execute SRE lifecycle and automation process.Discover, design, and implement changes to existing IT infrastructure with a focus on improved reliability, performance, and standardization.Collaborate with Engineering and business units to translate customer, business, and technical requirements into SRE architectural designs and enhancements.Develop and analyze various business and technical scenarios to drive the highest levels of executive decision-making around infrastructure resources. Drive consensus and decisions with stakeholders.Troubleshoot production issues providing root cause analysis and designing solutions to prevent future occurrences.Build automated, scalable, and rigorous solutions to infrastructure problems by leveraging or developing state-of-the-art automation, mathematical optimization, and / or AI models.Monitor services and create intelligent alarming for quicker incident detection and resolution.Identify opportunities to invent and simplify processes, identifying business risks and implementing resolutions and scalable mechanisms.Ensure efficient resource utilization and continuously improve processes leveraging automation and internal tools resulting in enhanced service delivery, maturity, and scalability.Mentor and coach other SRE team members.The Impact You Will Have :
- Enhance the reliability and performance of Synopsys IT infrastructure.
- Standardize and automate processes to increase operational efficiency.
- Translate complex requirements into actionable SRE designs and solutions.
- Provide critical insights and drive decision-making for infrastructure improvements.
- Prevent future production issues through meticulous root cause analysis and proactive solutions.
- Contribute to the scalability and robustness of our infrastructure through innovative solutions.
- Enhance incident detection and resolution times, ensuring minimal disruption.
- Streamline processes to mitigate business risks and improve scalability.
- Optimize resource utilization, ensuring cost-effective and efficient operations.
- Develop the next generation of SRE talent through mentorship and coaching.
What You'll Need :
- Extensive experience with a wide range of infrastructure technologies, such as Linux, Windows, High-performance computing, storage platforms, networking, cloud computing, cloud services (IaaS, PaaS, SaaS), virtualization, OpenStack, containerization, and orchestration technologies (e.g., Docker, Kubernetes).
- Expertise in HPC components like NFS / Shared File systems and Grid Schedulers (IBM spectrum LSF / Univa Grid / SLURM).
- Deep understanding of IT infrastructure-related services and their dependencies required to troubleshoot issues and define mitigations.
- Strong command and understanding of statistical concepts / models / analysis and how they relate to product reliability & life cycle analysis.
- Experience developing quantitative and qualitative analysis and metrics to solve business problems.
- Experience with developing service level indicators and objectives, instrumenting software, and building alerts.
- Hands-on experience with one or more of Java / Python / Go / AngularJS / NodeJS languages.
- Implementation experience in infra-automation tools and frameworks like GitHub, Maven / Gradle, Jenkins, Terraform (IaC), Ansible, Shell scripting.
Skills Required
Maven, Cloud Computing, Linux, Networking, Shell Scripting, Virtualization, Site Reliability Engineering