This job offer is not available in your country.

Cloud Site Reliability Engineer

ConfidentialChennai, India

9 days ago

Job description

Job Description

Be at the Forefront of Mobility&aposs Future : Join Ford as a Site Reliability Engineer!

Enterprise Technology is the engine driving the future of transportation, and we&aposre looking for a talented Site Reliability Engineer (SRE) to help us redefine mobility. In this role, you&aposll leverage cutting-edge technology to enhance customer experiences, improve lives, and create vehicles as smart as you are.

As an SRE at Ford, you&aposll be instrumental in developing, enhancing, and expanding our global monitoring and observability platform. You&aposll blend software and systems engineering to ensure the uptime, scalability, and maintainability of our critical cloud services. You&aposll be at the intersection of SRE and Software Development, building and driving the adoption of our global monitoring capabilities.

If you&aposre passionate about using your IT expertise and analytical skills to shape the future of transportation, this is your opportunity to make a real impact. Join us and be part of a team that&aposs building the future of mobility!

Responsibilities

Write, configure, and deploy code that improves service reliability for existing or new systems; set standard for others with respect to code quality.
Provide helpful and actionable feedback and review for code or production changes.
Drive repair / optimization of complex systems with consideration towards a wide range of contributing factors.
Lead debugging, troubleshooting, and analysis of service architecture and design.
Participate in on-call rotation.
Write documentation : design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others.
Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.
Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand.
Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.
Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.
Troubleshoot and resolve issues in our dev, test, and production environments.
Participate in postmortem analysis and create preventative measures for future incidents.
Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments.
Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation.
Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues.
Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises.
Contribute to internal knowledge bases and documentation.

Qualifications

Bachelor's degree in Computer Science, Engineering, Mathematics or equivalent experience.

3+ years of experience as an SRE, DevOps Engineer, Software Engineer or similar role.

Strong experience with Cloud Infrastructure

Proficient with monitoring and observability tools, particularly OpenTelemetry or other tools.

Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.

Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.

Experience with relational and document databases.

Ability to debug, optimize code, and automate routine tasks.

Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.

Excellent verbal and written communication skills.

Show less

Skills Required

Golang, Terraform, Postgres, Kubernetes

Create a job alert for this search

Site Reliability Engineer • Chennai, India