Job Title : SRE Data Engineer
Experience : 3 to 6 Years
Location : Pune
Background :
We are seeking a proactive and technically strong Site Reliability Engineer (SRE) to ensure the stability, performance, and scalability of our Data Engineering Platform. You will work on cutting-edge technologies including Cloudera Hadoop, Spark, Airflow, NiFi, and Kubernetesensuring high availability and driving automation to support massive-scale data workloads, especially in the telecom domain.
Key Responsibilities :
- Ensure platform uptime and application health as per SLOs / KPIs
- Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
- Debug and resolve complex production issues, performing root cause analysis
- Automate routine tasks and implement self-healing systems
- Design and maintain dashboards, alerts, and operational playbooks
- Participate in incident management, problem resolution, and RCA documentation
- Own and update SOPs for repeatable processes
- Collaborate with L3 and Product teams for deeper issue resolution
- Support and guide L1 operations team
- Conduct periodic system maintenance and performance tuning
- Respond to user data requests and ensure timely resolution
- Address and mitigate security vulnerabilities and compliance issues
Technical Skillset :
Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, RangerStrong Linux fundamentals and scripting (Python, Shell)Experience with Apache NiFi, Airflow, Yarn, and ZookeeperProficient in monitoring and observability tools : ELK Stack, Prometheus, LokiWorking knowledge of Kubernetes, Docker, Jenkins CI / CD pipelinesStrong SQL skills (Oracle / Exadata preferred)Familiarity with DataHub, DataMesh, and security best practices is a plusWorking Arrangements : Rotating 24 / 7 Shifts, 100% from Pune Office
(ref : hirist.tech)