Our client is seeking a Site Reliability Engineer I to join their growing technology operations team. This role is ideal for someone passionate about system reliability, incident response, and cross-team collaboration in a large-scale cloud environment.
What You’ll Do
- Act as the first point of contact for all customer-affecting issues.
- Drive and manage the resolution of technical incidents.
- Ensure proper incident management processes and completion of post-mortems.
- Provide consistent and clear communication to management.
- Respond to Zabbix alerts and perform regular monitoring, taking direct action or escalating as needed.
- Ensure smooth handoff of escalations.
- Maintain pod health across all sites, including defining pod alerts in Zabbix.
- Perform daily filesystem checks for pods.
- Troubleshoot advanced technical issues for DC Technicians (pods, deployments, migrations, Ansible playbooks).
- Identify and escalate potential network issues.
- Handle Vault pre-deployment configuration, testing, and migration monitoring.
- Document and automate daily operational tasks.
- Provide network IP documentation for upcoming deployments.
- Monitor server farm releases / updates and escalate issues when necessary.
- Participate in on-call rotation.
- Support TechOps team members with tasks as needed.
- Recommend improvements to enhance productivity.
- Work outside normal business hours as required (weekends, holidays, evenings).
Requirements
Must be located in Bangalore .2–4 years of relevant experience.Strong sysadmin and Linux skills.Willingness to learn and grow technical capabilities.Strong analytical and problem-solving skills.Excellent communication and teamwork skills.Knowledge of network cabling, network classification, and network topology.Pro5 is a global platform helping thousands of vetted professionals get hired by top employers. See what others say on our public Google Reviews and learn how we keep your data safe in our Trust Center .