ROLE AND RESPONSIBILITIES
- Design and architect observability solutions to make business workflows efficient, adaptable, scalable, and reliable.
- Incident management , coordinate response effort, communicate with stake holders and maintain control over incident response.
- Build software and systems to manage services and applications through automation
- Deployment, support and monitoring of existing and new services, platforms, and application stacks
- Measurement and optimization of various workloads performance
- Responsible for availability, performance, efficiency, change management, monitoring, emergency response, capacity planning and resiliency of the entire Progress Residential technology platform.
- Ensure availability of the platform with an emphasis on automating mundane and repetitive tasks.
- As part of capacity management responsibilities, perform organic demand forecasting, incorporate inorganic demand generation sources into demand forecasting, and conduct regular end-to-end load and failover testing of the system.
- Ensure strict change management is adhered to by following best practices such as implementing progressive rollouts, quickly and accurately detecting problems, and rolling back changes safely when problems do occur.
- Implement DevOps best practices such as Canary / Blue-Green deployment & continuous monitoring.
QUALIFICATIONS AND TECHNICAL SKILLSETS
Bachelor's degree in information technology, Computer Science, or equivalent engineering field with 5+ years of experience in DevOps or SRE.Must have 1+ years of hands-on experience with Amazon Web Service s in a live production environment.Must have a solid understanding of the principles of Site Reliability Engineering.Well-versed in DevOps fundamentals and toolchains.Experience with observability tools such as New Relic, Datadog, Splunk, Logz, etcGood understanding of network infrastructur e (firewalls, ACLs, etc).Experience working with one of the IAC frameworks, such as AWS CDK, CloudFormation, Serverless, TerraformProficiency with Python is a plusSkills Required
Cloudformation, Datadog, Firewalls, New Relic, Devops, Acls, Terraform, Site Reliability Engineering, Splunk, Python