Lead Operations Engineer
Experience : 8+ years
- Own operational oversight for services running on a Java-based microservices platform . Act as the primary escalation point for production incidents; lead incident response and communication.
- Drive post-incident reviews (blameless RCAs) and embed learnings through preventive actions. Maintain service dashboards, alerts, and incident tooling (e.g., PagerDuty, Datadog).
Technical Leadership
Guide operational practices across services built using Java (Spring Boot), Kafka, MongoDB and related technologies.Oversee monitoring, observability, and performance tuning using Datadog, ELK, Prometheus, or similar tooling.Problem Management & Root Cause Elimination
Lead proactive and reactive problem management efforts. Identify recurring production issues and collaborate with engineering to design permanent solutions.Track and reduce operational toil via automation and tooling improvements.Change Enablement & Service Onboarding
Partner with development teams to onboard new services with production readiness standards.Ensure all services meet requirements for monitoring, logging, documentation, support, and resilience before go-live.Support safe, rapid change practices including canary releases, feature flags, and progressive delivery.Team Management & Leadership
Lead and mentor a team of operations engineers and / or SREs.Manage performance reviews, career development, and day-to-day team workload.Foster a high-performance culture with strong accountability, collaboration, and a learning mindset.Continuous Improvement & DevOps Practices
Drive automation and self-service initiatives to reduce manual intervention and operational burden.Champion observability best practices (metrics, traces, logs) and error budget tracking. Promote DevOps culture and continuous feedback loops between engineering and operations.Governance, Risk & Compliance
Ensure operational processes comply with security, privacy, and regulatory requirements (e.g., SOC 2, ISO 27001). Manage operational risks, service continuity plans, and audit readiness.