Key Responsibilities :
- Design, build, and maintain scalable and reliable infrastructure to support high-availability applications and services.
- Develop automation tools to reduce manual operations and increase system efficiency (Infrastructure as Code).
- Monitor system performance, proactively identify issues, and implement solutions to ensure service uptime and resilience.
- Collaborate with development and operations teams to improve deployment pipelines, service observability, and incident response processes.
- Participate in on-call rotations and lead post-incident reviews to drive continuous improvement and learning.
- Implement and maintain robust monitoring, alerting, and logging systems.
- Ensure systems meet security and compliance requirements.
- Optimize system performance through tuning, capacity planning, and cost analysis.
- Advocate for SRE principles and best practices across engineering teams.
Required Skills and Qualifications :
Bachelor s degree in computer science, Engineering, or a related field, or equivalent practical experience.Strong proficiency with development tools : Artifactory, GitHub, Jenkins, Jira, and SVN.Hands-on experience with cloud platforms (OCI, AWS, GCP, Azure) and container orchestration tools (Kubernetes, Docker).Solid understanding of networking, system internals, and Linux administration.Experience with CI / CD pipelines, monitoring tools (Zabbix, Grafana, Newrelic, Splunk Datadog, etc.), and version control systems (Git).Strong problem-solving skills and ability to thrive in high-pressure environments.Skills Required
Java, Distributed Systems, Cloud Services, Python, Databases