Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions
Operate, monitor, and triage all aspects of our production and non-production environments
Collaborate with other engineers on code, infrastructure, design reviews, and process enhancements.
Evaluate and integrate new technologies to improve system reliability, security, and performance
Develop and implement automation to provision, configure, deploy, and monitor Apple services
Participate in an on-call rotation providing hands-on technical expertise during service-impacting events
Design, build, and maintain highly available and scalable infrastructure
Implement and improve monitoring, alerting, and incident response systems
Automate operations tasks and develop efficient workflows
Conduct system performance analysis and optimization
Collaborate with development teams to ensure smooth deployment and release processes
Implement and maintain security best practices and compliance standards
Troubleshoot and resolve system and application issues
Participate in capacity planning and scaling efforts
Stay up-to-date with the latest trends, technologies, and advancements in SRE practices
Contribute to capacity planning, scale testing, and disaster recovery exercises.
Approach operational problems with a software engineering mindset
BS degree in computer science or equivalent field with 5+ years of experience
5+ years in an Infrastructure Ops, Site Reliability Engineering, or DevOps-focused role.
Knowledge of Linux operating system principles, networking fundamentals, and systems management.
Demonstrable fluency in at least one of the following languages : Java, Python, or Go
Experience managing and scaling distributed systems in a public, private, or hybrid cloud environment
Develop and implement automation tools and apply best practices for system reliability.
You will be responsible for the availability & scalability of our services and manage the disaster recovery and other operational tasks.
Collaborate with the development team to improve application codebase for logging, metrics and traces for observability.
Collaborate with data science teams and other business units to design, build and maintain the infrastructure that runs machine learning and generative AI workloads.
Influence architectural decisions with focus on security, scalability and performance.
Find and fix problems in production, and work to avoid them from happening again
Preferred Qualifications :
Familiarity with micro-services architecture and container orchestration with Kubernetes.
Awareness of key security principles including encryption, keys (types and exchange protocols).
Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.
Strong sense of ownership, with a desire to communicate and collaborate with other engineers and teams.
Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.