Talent.com
This job offer is not available in your country.
Cloud Site Reliability Engineer

Cloud Site Reliability Engineer

ConfidentialChennai, India
9 days ago
Job description

Job Description

Be at the Forefront of Mobility&aposs Future : Join Ford as a Site Reliability Engineer!

Enterprise Technology is the engine driving the future of transportation, and we&aposre looking for a talented Site Reliability Engineer (SRE) to help us redefine mobility. In this role, you&aposll leverage cutting-edge technology to enhance customer experiences, improve lives, and create vehicles as smart as you are.

As an SRE at Ford, you&aposll be instrumental in developing, enhancing, and expanding our global monitoring and observability platform. You&aposll blend software and systems engineering to ensure the uptime, scalability, and maintainability of our critical cloud services. You&aposll be at the intersection of SRE and Software Development, building and driving the adoption of our global monitoring capabilities.

If you&aposre passionate about using your IT expertise and analytical skills to shape the future of transportation, this is your opportunity to make a real impact. Join us and be part of a team that&aposs building the future of mobility!

Responsibilities

  • Write, configure, and deploy code that improves service reliability for existing or new systems; set standard for others with respect to code quality.
  • Provide helpful and actionable feedback and review for code or production changes.
  • Drive repair / optimization of complex systems with consideration towards a wide range of contributing factors.
  • Lead debugging, troubleshooting, and analysis of service architecture and design.
  • Participate in on-call rotation.
  • Write documentation : design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others.
  • Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms.
  • Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand.
  • Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks.
  • Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery.
  • Troubleshoot and resolve issues in our dev, test, and production environments.
  • Participate in postmortem analysis and create preventative measures for future incidents.
  • Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments.
  • Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation.
  • Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues.
  • Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises.
  • Contribute to internal knowledge bases and documentation.

Qualifications

  • Bachelor's degree in Computer Science, Engineering, Mathematics or equivalent experience.
  • 3+ years of experience as an SRE, DevOps Engineer, Software Engineer or similar role.
  • Strong experience with Cloud Infrastructure
  • Proficient with monitoring and observability tools, particularly OpenTelemetry or other tools.
  • Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.
  • Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.
  • Experience with relational and document databases.
  • Ability to debug, optimize code, and automate routine tasks.
  • Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.
  • Excellent verbal and written communication skills.
  • Show more

    Show less

    Skills Required

    Golang, Terraform, Postgres, Kubernetes

    Create a job alert for this search

    Site Reliability Engineer • Chennai, India