About the job
Responsibilities :
- Design, implement, and maintain comprehensive observability solutions, including metrics, logs, and traces.
- Develop and implement effective monitoring and alerting strategies to proactively identify and address system issues.
- Conduct in-depth performance testing and analysis to identify bottlenecks and optimize system performance.
- Collaborate with development teams to ensure observability and performance are built into the application lifecycle.
- Automate routine tasks and processes to improve efficiency and reduce errors.
- Participate in on-call rotations and incident response to resolve critical issues.
- Analyze system performance metrics and identify opportunities for improvement.
- Stay up to date with the latest technologies and trends in observability, monitoring, and performance engineering.
Qualifications :
6+ years of experience in Site Reliability Engineering or a related role.Strong proficiency in observability tools and technologies (e.g. Dynatrace, Prometheus, Grafana).Expertise in performance testing tools and methodologies (e.g. LoadRunner, JMeter).Experience with scripting languages (Python, Bash, etc.) for automation and analysis.Strong understanding of cloud platforms (AWS or any equivalent).Excellent problem-solving, analytical, and troubleshooting skills.Ability to work independently and as part of a team.Strong communication and interpersonal skills.Experience with containerization technologies (Docker, Kubernetes, EKS).Knowledge of chaos engineering principles and practices.Experience with incident management and response.Ability to take up initiatives, persevere (ownership).Should be tech savvy and open to learn new tools / technologies on need basis.Skills Required
Site Reliability Engineer, Performance Testing, Scripting