Detailed JD (Roles and Responsibilities)
The Site Reliability Engineer II is responsible for providing continuous feedback of site health, reliability, availability, and user experience to both engineering and product owners. Real-time measurements for production environments will be collected, aggregated, analyzed, using both infrastructure and APM tools including but not limited to SolarWinds, Dynatrace, and log analytics. In addition to monitoring and insight, a heavy focus will be placed on automation opportunities and automating operational processes to maintain 99.9% availability of AvidXchange core products.
Performs Production SaaS operational and administration duties to maintain the
health and reliability of SaaS production systems
Performs Production SaaS support, incident management, problem management,
and service restoration as needed to quickly respond to and resolve production
issues
Implements and trains team members on tools for measuring core product health
in production (with opportunities to extend those capabilities all the way back
through the entire DevOps pipeline)
Implements and trains team members for calculating system availability SLAs
across AvidXchange products
JOB OVERVIEW
The Site Reliability Engineer is responsible for providing continuous feedback of site
health, reliability, availability, and user experience to both engineering and product
owners. Real-time measurements for production environments will be collected,
aggregated, analyzed, using both infrastructure and APM tools including but not limited to
SolarWinds, Dynatrace, and log analytics. In addition to monitoring and insight, a heavy
focus will be placed on automation opportunities and automating operational processes to
maintain 99.9% availability of AvidXchange core products.
Implements and executes the tool consolidation strategy to optimize spend versus value for our end to end monitoring platform Implements rapid and continuous development and application of automated solutions to address reliability issues and automate manual tasks Works with the Software DevOps team to implement DevOps CICD continuous performance testing, monitoring, and reliability strategy using Visual Studio Team Services and other cloud-based tools Implements the measurement capability of core product availability across Azure and AvidXchange Cloud using HTTP endpoint testing and synthetic user testing Maintain automated site availability reporting and data platform Gathers data for usability, reliability, incident, and user experience of AvidXchange products for consumption by executive leadership on a weekly basis Influences product delivery teams to implement usability and reliability enhancements leading to improved user experience index scores and improved availability Provides detailed analysis and troubleshooting for systems outages providing feedback to product / software engineering
Candidate also is required and willing to work in an on-call rotation schedule. This happens every 2.5 months and when it's their turn to be on-call, it's 24x7 for 2 weeks.
Total Experience
5+ total experience
Relevant Experience
3+ years Relevant experience
Mandatory skills
Site Reliability Engineer
APM tools including but not limited to
SolarWinds, Dynatrace, and log analytics.
Site Reliability Engineer • Bengaluru, KA, India