Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

Sonata SoftwareIndia
5 days ago
Job description

Category

Details

Role

Site Reliability Engineer (SRE) III – Data Engineering

Location

Hyderabad-

Employment Type

Full Time

Experience

7–12 years in

site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering

within

EdTech platforms (2U)

Primary Skills (Must-Have)

AWS, CI / CD, Jenkins, IAAC, Terraform, Kubernetes

Secondary Skills (Good-to-Have)

AWS systems; Dataiku data, Platform updates and patching

Tools & Platforms

Data Warehousing & Processing : Snowflake, Redshift, Apache Airflow, dbt

CI / CD & Deployment : Jenkins, GitHub Actions, AWS CodePipeline, Terraform

Cloud & Event Processing : AWS Lambda, API Gateway, SNS / SQS, Kafka, Step Functions

Monitoring & Logging : DataDog, AWS CloudWatch, Prometheus, Splunk

Incident Management : PagerDuty, Opsgenie, AWS Health Dashboard

Collaboration & Code Review : GitHub, Jira, Confluence

Key Responsibilities

Data Pipeline Reliability & Observability :

  • Maintain and optimize

highly available, fault-tolerant infrastructure

for

data pipelines, ETL jobs, and real-time data processing

  • Implement
  • end-to-end monitoring of Airflow DAGs, Snowflake queries, and AWS-based data workflows

  • Automate
  • data pipeline health checks, error handling, and auto-remediation strategies

    Infrastructure & Cloud Automation :

  • Deploy and manage
  • AWS-based data infrastructure using Terraform and CloudFormation

  • Optimize
  • Kubernetes (EKS) clusters

    for processing large-scale datasets and real-time analytics

  • Ensure
  • high availability and cost-efficient scaling

    for

    Redshift, Snowflake, and data storage solutions

    Performance, Monitoring & Incident Response :

  • Implement
  • real-time monitoring, logging, and alerting

    using

    DataDog, AWS CloudWatch, and Prometheus

  • Define and track
  • SLOs, SLIs, and error budgets

    to improve data reliability and uptime

  • Conduct
  • Root Cause Analysis (RCA), security audits, and post-mortems for incidents

    Security & Compliance :

  • Ensure
  • GDPR, CCPA, and SOC 2 compliance

    for

    data storage, access controls, and retention policies

  • Implement
  • AWS security best practices (IAM, KMS, Shield, WAF) to secure data access and encryption

  • Secure
  • API gateways, authentication mechanisms, and data lake permissions

    to prevent unauthorized access

    Collaboration & Leadership :

  • Work closely with
  • data engineers, analytics teams, and DevOps engineers

    to enhance data platform reliability

  • Participate in
  • incident response drills, disaster recovery planning, and security compliance reviews

  • Advocate for
  • best practices in automation, cost optimization, and cloud-native data solutions

    Create a job alert for this search

    Site Reliability Engineer • India