Job Responsibilities :
We are seeking a skilled Site Reliability Engineer to manage the day-to-day operations and performance of multiple critical applications in a dynamic, high-demand environment.
The ideal candidate will have hands-on experience with SQL databases, cloud platforms, and a range of other enterprise applications, combined with problem-solving skills and the ability to troubleshoot and resolve issues with minimal Description Production Support Processes and SLAs :
- Document production support processes that encompass the full lifecycle of a delivery request through to the development team and a production release.
- Support defined SLAs based on severity and work with DevOps and Engineering to meet those SLAs.
System and Application Deployments :
Plan and execute application and database deployments following established processes with adherence to Corporate Change Management standards.Incident Management :
Participate in the troubleshooting, and resolution of production issues in real time with timely communication to affected parties.Ensure that incidents are logged, tracked, and escalated as & Alerting :Implement and optimize monitoring tools to proactively detect issues and ensure the health and performance of production Stability & Performance :Work closely with the development, infrastructure, and operations teams to ensure the stability and scalability of production systems.Recommend and implement improvements to increase system Cause Analysis (RCA) :Contribute to post-incident reviews, drive root cause analysis efforts, and ensure that lessons learned are shared across teams.Continuous Improvement :
Engage in continuous improvement efforts by identifying gaps in the support process and implementing best practices.Optimize incident response times and overall system with Stakeholders :Engage with business stakeholders, product owners, and other cross-functional teams to ensure effective communication and Management :Maintain and update documentation for support procedures, system configurations, and incident management.Create knowledge-based articles and ensure the team is well-trained on new systems and procedures.On-Call Rotation :
Participate in on-call rotation for critical incidents, ensuring that production environments are supported 24 / 7 / 365.Job Qualifications Skills & Qualifications :
Bachelors degree in computer science, Information Technology, or a related field.2+ years of experience in production support, system administration, or related technical roles with a focus on cloud-based systems management (GCP and Azure)Proven experience in a production support or IT operation team.Knowledge of incident management, system monitoring, and troubleshooting methodologies.Understanding of production systems, system architectures, and distributed systems.Hands-on experience with monitoring tools.Familiarity with scripting languages (e.g., Python, Shell) for automation and troubleshooting.Solid communication and interpersonal skills to engage with stakeholders.Ability to work under pressure and manage incidents in a fast-paced production environment.Proficiency in Windows / Linux / Unix environments and system administration.Familiarity with CI / CD pipelines and tools (e.g., Jenkins, GitHub).Hands-on experience with .NET Core, .NET Framework, Apache, IIS, PowerShell, and Pythonfor application support.
Ability to query SQL databases for application troubleshooting, reporting and deployments.Additional technologies : JIRA, Confluence, Pager Duty, Uptrends, Teams, O365ref : hirist.tech)