Troubleshoot issues across the entire stack - hardware, software, application, and network
Work to improve the reliability and performance of the next generation of distributed systems
and containerized deployments
Work to improve the reliability and performance of the next generation of distributed systems
and containerized deployments
Diagnose and troubleshoot complex distributed systems handling millions of queries per second
Day-to-day work is heavily command-line driven, which requires a strong understanding of Linux.
Participate in on call rotation Design build and maintain core infrastructure that enables Phonepe scaling to support hundreds of thousands of concurrent users
Actively take part in the Analysis and System improvement plan.
Drive performance testing, capacity planning and high availability practices.
Own implementations of new technologies while ensuring proper testing and documentation.
Proactively monitor / identify / solve issues which could have a potential impact to our Infrastructure.
Natural team player and also have a resourceful attitude.
Buddy new team members, and get them production ready.
Skills Required
Minimum of 7-13 years of strong hands-on experience in Linux / Unix System Administration, including TCP / IP, DNS, and load balancers.
Expertise in managing and scaling proxy infrastructure, including configuring and optimizing
proxies (e.G. Nginx, HAProxy).
Knowledge in Database technologies, specifically in MySQL / NoSQL. Good to have exposure on Aerospike NoSQL.
In-depth knowledge in Python to automate tasks with minimal intervention.
Knowledge of Linux cloud services using kvm / qemu / lvm.
Create a job alert for this search
Reliability Engineer • Pune, Republic Of India, IN