Key Responsibilities :
- Reliability Engineering : Design and implement solutions to improve system reliability, availability, performance, and scalability.
- Operational Excellence : Manage SLIs, SLOs, error budgets, monitoring, and alerting. Conduct blameless postmortems and drive continuous improvement.
- Monitoring & Alerting : Develop dashboards, monitoring, and alerting mechanisms to proactively identify and resolve issues.
- Capacity Planning : Collaborate with teams to forecast resource needs and ensure scalability.
- Performance Optimization : Identify and resolve performance bottlenecks through profiling, tuning, and optimization.
- Automation : Automate repetitive tasks and processes to reduce manual intervention.
- Collaboration : Work with software engineers, performance engineers, and test engineers to influence system design for operability and reliability.
- Documentation : Maintain clear documentation, runbooks, and procedures.
- Incident Management : Participate in on-call rotations, lead incident response, and perform in-depth incident analysis.
- Troubleshooting : Diagnose and resolve complex system issues.
Skills Required
Python, Java, System Design, Performance Tuning, Automation, Debugging