About The Role
The Infrastructure Automation Site Reliability Engineer (SRE) bridges the gap between development and operations by applying software engineering principles to infrastructure and operational challenges. Responsibilities include creating support documentation, developing key metrics for tracking and reporting, managing monitoring services, using automation tools, and coordinating cross-team communications related to releases and maintenance.
Automation SREs support existing Infrastructure Developers by taking ownership of application support and process work required to manage these applications at scale in a 24×7 environment. This allows developers to focus on building new features and functionality.
Key Functions
Application / Tool Support
- Support existing applications and services hosted by the Infrastructure Automation (InfAuto) team
- Develop runbooks for application support and maintenance
- Create detailed alerts for incident management and monitoring tools
- Implement and manage an updated operations platform for the Technical Operations team
Service Introduction & Communications
Develop communication plans for service and tool launchesImprove messaging around service interruptions and maintenanceInfrastructure & Automation
Expand use of cloud development pipelines for new observability capabilitiesSupport cloud infrastructure integrationUse scripts to perform maintenance tasksMonitoring & Observability
Define KPIs and SLAs for managed servicesAssist with dashboard development and managementIntegrate cloud infrastructure with monitoring and reporting toolsConduct capacity planning to support proactive scalingOperational Excellence
Design and execute high availability (HA) and disaster recovery (DR) infrastructure testingPartner with operations teams to expedite issue analysisCoordinate change management activities with application usersRequired Skills And Tools Experience
Experience Range : 3–6 years using tools in the following categories :
Infrastructure as Code : Terraform, CloudFormation, or similarConfiguration Management : Ansible, Puppet, or ChefContainer Technologies : Docker, Podman, basic Kubernetes conceptsObservability Platforms : Grafana, Elastic (ELK), DataDog, SplunkIssue / Project Tracking : JIRA, ServiceNow, Trello, or similarCI / CD Pipelines : Jenkins, GitLab CI, GitHub ActionsDocumentation Tools : SharePoint, Confluence (for user guides, runbooks, etc.)Linux Operating Systems : Red Hat Enterprise Linux or similar (CentOS, Rocky, Fedora)Database Operations : SQL, PostgreSQLIDEs : Visual Studio Code (VS Code), JetBrains IntelliJ IDEADesired Skills
2–4 years in an L1 SRE or DevOps roleExperience as a Systems Engineer (infrastructure design and implementation)Platform Engineer (internal tooling and platform development)Cloud Engineer (multi-cloud experience and migration projects)Application Support (production troubleshooting)Release Engineer (software deployment and release management)Incident Response (on-call experience and production issue resolution)Company Benefits & Perks
Competitive salary package.Performance-based annual bonus (cash and stocks).Hybrid working model (3 days office / week).Group Medical & Life Insurance.Modern offices with free amenities & fully stocked cafeterias.Monthly food card & company-paid snacks.Hardship / shift allowance with company-provided pickup & drop facilityAttractive employee referral bonus.Frequent company-sponsored team-building events and outings.Depending upon the shifts.The benefits package is subject to change at the management's discretion.Skills Required
Servicenow, Trello, Chef, Postgresql, Grafana, Jira, Red Hat Enterprise Linux, Confluence, Terraform, Docker, Visual Studio Code, Cloudformation, Fedora, Sql, Datadog, Jenkins, Ansible, Sharepoint, Centos, Splunk, Puppet, Kubernetes