Job Summary :
We are seeking a skilled and proactive Site Reliability Engineer (SRE) with a strong DevOps mindset and hands-on experience in application troubleshooting. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our applications and infrastructure. This role requires a blend of software engineering, system administration, and operational expertise, with a focus on automating processes and proactively resolving issues.
Key Responsibilities :
Site Reliability & Automation :
- Design and implement tools to automate infrastructure provisioning, application deployment, and operational tasks.
- Build and manage CI / CD pipelines using Jenkins to ensure seamless and efficient software delivery.
- Utilize a strong understanding of Linux to maintain and troubleshoot server environments, including certificate renewals.
Monitoring & Troubleshooting :
Implement and manage monitoring solutions using tools like Splunk or Dynatrace to create dashboards, set up alerting, and execute log queries for proactive issue detection.Perform application troubleshooting, debugging, and root cause analysis to resolve complex incidents promptly.Leverage SQL (DML & SELECT queries) to analyze application data for performance and troubleshooting insights.Process & Collaboration :
Apply ITIL / ITSM principles for effective incident, problem, and change management.Collaborate closely with development, quality assurance, and product teams to improve system reliability.Manage and track code changes using Git or Bitbucket.Required Skills :
Core Technical Skills :
5-8 years of experience in an SRE, DevOps, or similar role.Strong proficiency in at least one scripting language : Shell, Groovy, or YAML.Expertise in monitoring tools like Splunk or Dynatrace for alerting, dashboarding, and log analysis.Hands-on experience with CI / CD tools, specifically Jenkins.System & Infrastructure :
Strong understanding of Linux system administration.Basic exposure to cloud environments, with AWS being preferred.Process & Data :
Basic knowledge of ITIL / ITSM concepts (Incident, Problem, Change Management).Proficiency in SQL (DML and SELECT queries).Preferred Skills :
Experience with configuration management tools like Ansible or Chef.Hands-on experience with Docker and Kubernetes for container orchestration.Knowledge of other monitoring tools such as Prometheus or Grafana.Relevant certifications in Linux or cloud platforms.Strong problem-solving and analytical skills, with a proactive attitude toward identifying and resolving issues.(ref : hirist.tech)