Role Title : SRE
Mandatory Skills : SRE AWS or GCP
Observability
Incident Management
New Relic, Grafana, Prometheus
Terraform
Python
Role Description / Skills :
Cloud Platforms :
5+ years of proficiency in AWS or GCP and modern pipelining technologies and approaches.
Containerization and Orchestration :
3+ years of design, deployment and monitoring of containerization technologies like Docker and container orchestration tools such as Kubernetes.
Systems / Infrastructure as Code (IaC) :
3+ years of hands-on experience with IaC tools, such as Terraform or CloudFormation.
Monitoring and Logging :
4+ years of expertise in implementing and managing observability platforms and monitoring tools (New Relic, Grafana, Prometheus) feeding into SLOs / SLI objectives and logging solutions like ELK (Elasticsearch, Logstash, Kibana) or Splunk.
Automation :
3+ years of hands-on experience with scripting languages such as Python or Bash and configuration management tools like Salt, Ansible or Chef.
CI / CD :
3+ years of hands-on experience with CI / CD pipelines like Jenkins.
Reliability and Performance :
5+ years of designing and implementing highly reliable, scalable and available systems with system optimization, performance and resource utilization.
Incident Response :
3+ years of primary incident management, on-call support with incident response procedures and tools such as PagerDuty and related best practices.
Collaboration and Communication :
You possess a knack for fostering professional growth and knowledge-sharing with proven ability contributing to a collaborative and skill-enhancing work environment.
Documentation :
Proficient in creating and maintaining clear and comprehensive documentation.
Problem-Solving :
You strive to understand the problem you are trying to solve before deciding on the solution, and you are thoughtful and methodical in its implementation vs. jumping to the next tool.
Ability to troubleshoot complex issues in distributed systems.
Site Reliability Engineer • Hyderabad, Telangana, India