Overview
We are seeking a skilled Platform Engineer to join our team and drive the development, deployment, and supportability of our Kubernetes-based microservices platform, deployed on-premises by customers. You will build comprehensive observability, enable log and report extraction for service cases without real-time access, and optimize our overuse of Kafka by integrating Redis and batch processing. This role requires expertise in Kubernetes, Azure DevOps, C++ support, deployment sizing, and designing for reliability, availability, and serviceability (RAS).
Responsibilities
- Build Comprehensive Observability : Implement centralized metrics, logging, and tracing (e.g., Prometheus, Fluentd, OpenTelemetry) for .NET, Python, Java, C++, Kafka, and Redis, ensuring supportability in on-premises environments.
- Enable Log / Report Extraction : Design customer-facing tools (e.g., CLI scripts, Helm chart options) to collect and export logs / metrics from on-premises deployments for service cases, without real-time access.
- Optimize Kafka Usage : Audit and optimize Kafka configurations (e.g., topics, partitions, compression) to reduce metadata streaming overhead, monitored with Prometheus or Azure Monitor.
- Implement Alternatives : Integrate Redis (e.g., Azure Cache for Redis) for metadata caching / pub-sub and batch processing (e.g., Azure Data Factory, Kubernetes Jobs) for high-volume data, reducing Kafka dependency.
- Troubleshoot Customer Environments : Debug issues in on-premises customer deployments for services (C++, .NET, Python, Java), Kafka, and Redis, using exported logs and metrics.
- Enhance Product Supportability : Build Azure DevOps pipelines and installers (e.g., Helm charts) for consistent, supportable deployments, with documentation for customer support.
- Contribute to RAS : Own serviceability by building observability and diagnostic tools; support reliability / availability via Kubernetes optimization, autoscaling, and fault-tolerant designs.
- Enforce Standards : Implement and enforce structured logging (e.g., JSON with correlation IDs) and resource sizing standards via Azure DevOps pipelines.
- Optimize Deployment Sizing : Set Kubernetes resource requests / limits and autoscaling policies (e.g., HPA, VPA) for services, Kafka, Redis, and batch jobs, based on profiling.
- Evaluate Service Meshes : Assess service meshes (e.g., Linkerd) for improving microservice and data platform observability and communication.
- Support C++ Services : Assist developers in containerizing, deploying, and debugging C++ services, ensuring integration with observability, Kafka, Redis, or batch workflows.
- Automate with Azure DevOps : Build CI / CD pipelines in Azure DevOps for automated builds, tests, and deployments, integrating with AKS, Kafka, and Redis.
Qualifications
Experience : 3–5 years with Kubernetes, Azure DevOps (AKS, pipelines), and Kafka administration.Technical Skills :Expert in Kubernetes (CKA / CKAD preferred) and Azure DevOps (YAML pipelines, AKS integration).Proficient in observability tools (e.g., Prometheus, Grafana, Fluentd, OpenTelemetry, Azure Monitor) for metrics, logs, and tracing.Experience with on-premises Kubernetes deployments and log / report extraction for service cases.Proficient in Kafka optimization (e.g., topic management, consumer groups) and monitoring.Knowledge of Redis (e.g., Azure Cache for Redis, pub / sub) and batch processing (e.g., Azure Data Factory, Kubernetes Jobs).Familiarity with C++ build systems (e.g., CMake) and debugging (e.g., gdb) in Kubernetes.Proficiency in Kubernetes resource management and autoscaling (e.g., HPA, VPA).Scripting skills (e.g., Python, Bash) for automation, diagnostics, and log extraction.Customer Focus : Proven ability to troubleshoot on-premises customer environments and build supportable deployment and observability tools.Standards Enforcement : Experience enforcing logging, sizing, and data platform standards via Azure DevOps pipelines.RAS Expertise : Ability to design for serviceability (observability, diagnostics) and contribute to reliability / availability through platform optimization.Nice-to-Haves
Experience with service meshes (e.g., Linkerd, Istio) and their integration with Azure.Familiarity with .NET, Python, or Java for developer collaboration.Knowledge of air-gapped Kubernetes deployments (e.g., Kubeadm, K3s).