Talent.com
Site Reliability Engineer

Site Reliability Engineer

Sonata SoftwareHyderabad, Telangana, India
8 days ago
Job description

Category

Details

Role

Site Reliability Engineer (SRE) III – Data Engineering

Location

Hyderabad-

Employment Type

Full Time

Experience

7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)

Primary Skills (Must-Have)

AWS, CI / CD, Jenkins, IAAC, Terraform, Kubernetes

Secondary Skills (Good-to-Have)

AWS systems; Dataiku data, Platform updates and patching

Tools & Platforms

Data Warehousing & Processing : Snowflake, Redshift, Apache Airflow, dbt

CI / CD & Deployment : Jenkins, GitHub Actions, AWS CodePipeline, Terraform

Cloud & Event Processing : AWS Lambda, API Gateway, SNS / SQS, Kafka, Step Functions

Monitoring & Logging : DataDog, AWS CloudWatch, Prometheus, Splunk

Incident Management : PagerDuty, Opsgenie, AWS Health Dashboard

Collaboration & Code Review : GitHub, Jira, Confluence

Key Responsibilities

Data Pipeline Reliability & Observability :

  • Maintain and optimize highly available, fault-tolerant infrastructure for data pipelines, ETL jobs, and real-time data processing
  • Implement end-to-end monitoring of Airflow DAGs, Snowflake queries, and AWS-based data workflows
  • Automate data pipeline health checks, error handling, and auto-remediation strategies

Infrastructure & Cloud Automation :

  • Deploy and manage AWS-based data infrastructure using Terraform and CloudFormation
  • Optimize Kubernetes (EKS) clusters for processing large-scale datasets and real-time analytics
  • Ensure high availability and cost-efficient scaling for Redshift, Snowflake, and data storage solutions
  • Performance, Monitoring & Incident Response :

  • Implement real-time monitoring, logging, and alerting using DataDog, AWS CloudWatch, and Prometheus
  • Define and track SLOs, SLIs, and error budgets to improve data reliability and uptime
  • Conduct Root Cause Analysis (RCA), security audits, and post-mortems for incidents
  • Security & Compliance :

  • Ensure GDPR, CCPA, and SOC 2 compliance for data storage, access controls, and retention policies
  • Implement AWS security best practices (IAM, KMS, Shield, WAF) to secure data access and encryption
  • Secure API gateways, authentication mechanisms, and data lake permissions to prevent unauthorized access
  • Collaboration & Leadership :

  • Work closely with data engineers, analytics teams, and DevOps engineers to enhance data platform reliability
  • Participate in incident response drills, disaster recovery planning, and security compliance reviews
  • Advocate for best practices in automation, cost optimization, and cloud-native data solutions
  • Create a job alert for this search

    Site Reliability Engineer • Hyderabad, Telangana, India