Partner with application developers and solution architects to ensure services are built for scale and performance.Lead setting service-level objectives, agreements and indicators (SLOs, SLAs and SLIs) for the underlying service by collaborating with Application Development, Product and Business OwnersDesign, Develop and create Scripts / Software / Tools that will improve the reliability of systems in Production including fixing issues, responding to incidents and taking on-call responsibilities.Improve the overall resilience of a system and provide visibility to the health and performance of services across all applications and infrastructureImprove service performance metrics like latency, page load speed and ETL and help proactively identify performance issues across the systemImplement monitoring solutions, create Dashboards and Alerts based on four golden signals of SRE providing single source to determine the overall performance and availability of the services they support.Writing, updating, and using documentation, including runbooks / playbooksAutomating work including infrastructure needs, testing, failover solutions, failure mitigation, and much moreUsing Chaos Engineering to test what you build under real-world conditionsSpread information across DevOps and business teams � encouraging a blameless culture focused on workflow visibility and collaborationRoot-cause analysis complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance.Services as technical owner to ensures delivery for SRE initiativePerforms deliverable reviews and coaches' team in area of expertise in SREProvide continuous competitive and best-practices research, leverage industry resources and market trends, and liaise with internal stakeholders.Skills Required
ensure services, SLOs, built for scale and performance