Urgent Hiring!!
Location : Remote
Role : Staff Engineer- SRE
Experience : 10+
The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability,
stability and performance of systems and services.
- They work with cross-functional teams to design, build and maintain systems and they
troubleshoot issues when they arise. They bridge the gap between development and
operations teams.
They work closely with business teams to define Service Level Objectives(SLO) and agreements (SLA) of critical systems. They also monitor and maintain the
uptime of these systems in-line with the defined SLO's and SLA's.
They deploy and manage monitoring tools to gain insights on system health andperformance.
They analyze performance, identify bottlenecks and implement solutions toimprove a system's scalability and latency durations.
They develop scripts, implement tools and automation frameworks to reduce the manualintervention efforts of deployment, monitoring and scaling.
They work with development teams for design and development of observabilitypractices like logging, metrics, tracing, etc. They aim to diagnose and troubleshoot issues
proactively.
They create actionable alerts on monitoring systems to ensure rapid response forpotential production incidents.
They forecast resource needs and provision adequately for current and future demand.They design and execute 'chaos experiments' to test system's failure resiliency.They own, define and implement the Disaster Recovery (DR) processes for systems.They also conduct planned and unplanned mock DR drills to test for responsepreparedness during production incidents.
They ensure that security best practices are followed and implemented during designand operations of systems.
They also own and maintain documentation of processes, playbooks, and systems.They publish KPI reports and other system health updates on a regular basis to thebusiness.
Requirements
Must-have - Bachelor's degree, preferably in CS or a related field, or equivalentExperience
Must-have - 12+ years of overall IT experienceMust-have - 7+ year of proven work experience as a Senior Site Reliability Engineer or asimilar position.
Must-have - 5+ years of AWS Cloud experience with AWS Certified DevOps Engineer orSysOps or Security etc.
Must-have - AWS experience - 3+ years' experience with using a broadrange of AWStechnologies (e.g. EC2, RDS, ELB, S3, VPC, CloudWatch & Monitoring Tools) to develop
and maintain an Amazon AWS based cloud solution, with an emphasis on best practice
cloud security.
Must-have - 2+ year of experience in CDN and / or Cache systems like Fastly, Akamai,CloudFront, etc.
Proven Understanding & strong experience with Cloud deployments ( AWS / Docker /Kubernetes)
Knowledge on provisioning IAC Tools like Terraform, Chef, Ansible, Shell, groovy,python, etc.
Experience with monitoring systems such as CloudWatch, NewRelic, Datadog / Splunk,ELK stack.
Experience managing cloud network resources (AWS Preferred) such as CloudWatch,VPC, URL proxies, private link, DNS, ACLs, firewalls, and C2S access points.
Platform or Application Engineering and Operational Knowledge in any of the CI / CDtooling like GitHub Actions, Jenkins, etc.
Experience in other tooling Technologies like JIRA, Bitbucket, Jenkins, Fortify,SonarQube, Nexus, Nexus IQ
Experience with configuration automation tools like Puppet / Ansible / Chef / SaltScripting Skills : Strong scripting (e.g. Bash & Python) and automation skills.Operating Systems : Windows and Linux system administration.Problem Solving : Ability to analyze and resolve complex infrastructure resource andapplication deployment issues
Strong attention to detail. Excellent verbal and written communication skills. Strongdocumentation skills.
Good To Have
Experience with Terraform / Ansible / Chef / PuppetExperience with GitHub ActionsExperience with CloudFront, FastlyOversees team members performing these functionsAnticipates problems and future technical needs and takes necessary steps to addressissues.
Work primarily in server side technologies and comfortable with client side wheneverRequired
Enthusiastically follow technology trends, software engineering best practices andtechnologies
Perks
Day off on the 3rd Friday of every month (one long weekend each month)Monthly Wellness Reimbursement Program to promote health well-beingPaid paternity and maternity leavesNotice Period : Immediate- 30 Days
Email to : [HIDDEN TEXT]
Skills Required
Newrelic, Chef, Fortify, Elk Stack, Bash, Datadog, Jira, Jenkins, Cloudwatch, Docker, Bitbucket, Terraform, Ansible, Sonarqube, Nexus, Splunk, Puppet, Python, Kubernetes, Aws