Job Description : Responsibilities :
- Ensure the smooth operation of production systems through proactive monitoring and maintenance.
- Respond promptly to system alerts and incidents, minimizing downtime and impact.
- Implement and manage observability tools such as Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana), and
- Develop and maintain dashboards and alerts to track key performance indicators (KPIs).
- Analyze monitoring data to identify trends, potential issues, and areas for optimization.
- Manage and analyze system logs to troubleshoot errors and identify root causes.
- Implement log aggregation and analysis solutions to improve error detection and resolution.
- Develop and maintain error handling procedures and documentation.
- Develop and maintain automation scripts using Python, Bash, or other scripting languages to streamline operations.
- Automate routine tasks, deployments, and infrastructure management to improve efficiency and reduce manual effort.
- Implement Infrastructure as Code (IaC) principles to manage infrastructure configurations.
- Investigate and resolve production incidents, performing root cause analysis and implementing effective solutions.
- Collaborate with development and operations teams to resolve complex issues and ensure timely resolution.
- Enhance and maintain UI automation and testing frameworks to improve testing efficiency and coverage.
- Participate in deployment processes, ensuring smooth and reliable releases.
- Implement and maintain CI / CD pipelines to automate software delivery.
- Identify and address performance bottlenecks in production systems.
- Implement performance tuning and optimization strategies to improve system efficiency and responsiveness.
- Monitor and analyze system performance metrics to identify areas for improvement.
- Create and maintain comprehensive documentation for system configurations, procedures, and troubleshooting guides.
- Share knowledge and best practices with team members through training sessions and knowledge base articles.
- Contribute to the development and improvement of internal tools and processes.
Requirements : Essential :
2+ years of experience in Production Engineering, DevOps, or Testing Automation.Strong scripting skills in Python, Bash, or similar languages.Hands-on experience with observability tools like Prometheus, Grafana, ELK, or Datadog.Experience with log management and error handling.Familiarity with UI automation and testing frameworks.Strong problem-solving and analytical skills.Ability to work independently and as part of a team.Good communication and interpersonal skills(ref : hirist.tech)