Objectives
Act as the Site Reliability Engineer for global operations, ensuring system stability, scalability, and efficiency through advanced automation, observability, and proactive infrastructure management.
Provide expertise in Kubernetes, Linux, networking, and automation practices to support reliable deployments and resilient services.
Maintain a strong sense of reliability, with clear awareness of the risks and impacts that infrastructure and application changes can have.
Principal duties
Has strong knowledge of Kubernetes (including Talos) for deployment, scaling, and maintaining containerized applications.
Provides Linux administration expertise and ensures secure, efficient system operations.
Implements and maintains GitOps workflows using Flux for consistent, automated deployments.
Designs and manages infrastructure automation using Puppet and Terraform.
Ensures reliable operation of databases such as MySQL / MariaDB, Yugabyte, and MongoDB, supporting data integrity and availability.
Operates and integrates streaming platforms (Confluent, Strimzi) for event-driven and real-time processing.
Develops automation scripts and tools using Python to improve operational efficiency.
Oversees edge device management, ensuring secure connectivity and smooth lifecycle operations.
Supports and integrates solutions with Azure and hybrid / multi-cloud environments.
Builds and operates monitoring and observability systems (Datadog, Prometheus, Grafana) to ensure system health and transparency.
Designs for scalability and high availability, including disaster recovery and failover strategies.
Applies security best practices across infrastructure, applications, and data.
Evaluates risks carefully before changes, ensuring reliable rollout strategies and minimizing downtime or service disruption.
Monitors system reliability, identifies risks, and implements proactive improvements.
Collaborates with global teams to share best practices and ensure consistency across environments.
Defines and standardizes developer tooling (e.g., IDEs, code quality tools, CI / CD integrations) to ensure consistent development environments and maintain high software quality.
Manages developer workstations and operating system standards (currently Ubuntu-based), ensuring performance, security, and compatibility across the engineering organization with focus on the Asia team.
Promotes a documentation culture, ensuring clear processes, runbooks, and troubleshooting guides.
Report to the offshore Digital Manufacturing team based in Switzerland.
Site Reliability Engineer • Pune, Maharashtra, India