The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned, addressing availability, scalability, latency, and efficiency challenges within the vast infrastructure here.
Responsibilities :
- Design, develop, and implement software that improves the stability, scalability, availability, and latency of the products.
- Take ownership of one or more services and have the freedom to do what is best for our business and customers.
- Solve problems occurring with our highly available production systems and build solutions and automation to prevent them from happening again.
- Build effective monitoring to supervise the health of your system, and jump in to handle outages.
- Build and run capacity tests to manage the growth of your systems.
- Plan for reliability by designing systems to work across our multinational data centers.
- Develop tools to assist the product development teams with successfully deploying 1000s of change sets every day.
- Be an advocate of engineering standard processes.
- Share the on-call rotation and be an escalation contact for incidents.
- Contribute to growth through interviewing, onboarding, or other tasks.
Requirements :
8 years of experience with building, operating, and maintaining sophisticated and scalable systems and with operations automation.Solid experience in at least one programming language. We use Java, Python, Go, Ruby, and Perl.Experience with Infrastructure as Code technologies.Knowledge of cloud computing fundamentals.Solid foundation in Linux administration and troubleshooting.Understanding of service-level agreements and objectives.Additional experience in OpenStack, Kubernetes, Networking, Security, or Storage is desirable.Supervising / observability technologies like Prometheus, Graphite, Grafana, Kibana, and Elasticsearch are a plus.Good interpersonal skills.Proficient command of the English language, both written and spoken.Here are some of the tools and technologies we use to achieve this : Python, Go, Puppet, Kubernetes, Elasticsearch, Prometheus, HAProxy, Cassandra, Kafka, etc.