Site Reliability Engineer (SRE)
Ford's Enterprise Technology team is seeking a talented Site Reliability Engineer to enhance customer experiences and redefine mobility. In this role, you will be instrumental in developing, enhancing, and expanding our global monitoring and observability platform, ensuring the uptime, scalability, and maintainability of critical cloud services.
Key Responsibilities :
- Design, develop, configure, and deploy code to improve service reliability, setting standards for code quality.
- Lead debugging, troubleshooting, and analysis of complex service architectures and designs.
- Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry.
- Develop tooling with Terraform and other Infrastructure as Code (IaC) tools for proactive issue detection.
- Optimize performance, manage costs, and scale resources within GCP infrastructure.
- Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset.
- Develop and maintain automated solutions for on-call monitoring, performance tuning, and disaster recovery.
- Troubleshoot and resolve issues across development, test, and production environments.
- Participate in on-call rotation and postmortem analysis, implementing preventative measures.
- Implement and maintain security best practices, participate in security audits and vulnerability assessments.
- Contribute to capacity planning, forecasting, and identify / address performance bottlenecks.
- Develop, maintain, and test disaster recovery plans and procedures.
- Create comprehensive documentation including design, system analysis, runbooks, and playbooks.
Qualifications :
Bachelor's degree in Computer Science, Engineering, Mathematics, or equivalent experience.3+ years of experience as an SRE, DevOps Engineer, Software Engineer, or similar role.Strong experience with Golang development ; familiarity with Terraform Provider development is desired.Proficient with monitoring and observability tools , particularly OpenTelemetry.Strong proficiency with cloud services , with a significant preference for Kubernetes and Google Cloud Platform (GCP) experience.Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.Experience with relational and document databases.Ability to debug, optimize code, and automate routine tasks.Strong problem-solving skills and ability to work effectively in a fast-paced environment.Excellent verbal and written communication skills.Skills Required
Golang, Site Reliability Engineering, Kubernetes, Gcp, Terraform