We are looking for an experienced Staff Software Engineer 2 to design, develop, and optimize highly scalable and distributed systems.
The ideal candidate should have strong expertise in Java and Go, along with deep knowledge of distributed computing principles, cloud-native architectures, and scalable backend services.
This role requires strong operational excellence, ensuring system reliability, scalability, and observability while driving best practices in incident management, performance tuning, and automation.
You will work closely with cross-functional teams to build resilient, high-performance applications.
Experience with CI / CD pipelines, DevOps practices, and cloud platforms (AWS / GCP / Azure) will be a plus.
Key Responsibilities :
System Architecture & Scalability :
- Design & Build : Architect and develop scalable, fault-tolerant backend systems that handle millions of requests per second.
- Microservices Development : Implement microservices using Go, Java, or Python, ensuring high availability and resilience.
- Cloud & Kubernetes : Deploy and manage applications on AWS, GCP, or Azure with Kubernetes (EKS, GKE, AKS).
- Event-Driven Architectures : Work with Kafka, Pulsar, RabbitMQ for distributed messaging and streaming workloads.
- Develop observability, logging, and monitoring with Prometheus, Grafana, and OpenTelemetry.
- Troubleshoot complex issues in high-traffic, production-scale environments.
- Mentor junior engineers and help define best coding practices, architectural patterns, and system design principles.
Operational Excellence & Incident Management :
Reliability & Resilience : Implement best practices for graceful degradation, retries, circuit breakers, and auto-scaling.Incident Response & On-Call Management : Define SLAs / SLIs / SLOs, set up robust alerting & escalation processes for incident handling.Postmortems & RCA (Root Cause Analysis) : Lead post-incident analysis, drive corrective actions, and improve system reliability.Observability & Monitoring : Define and implement logging, monitoring, and distributed tracing using Prometheus, OpenTelemetry, Grafana, Datadog.Performance Optimization & Security :
Performance Tuning : Diagnose and optimize latency, throughput, and memory utilization for large-scale distributed systems.Multithreading & Concurrency : Design and implement highly concurrent, multithreaded backend services for parallel processing.Database & Storage Optimization : Improve performance of SQL (PostgreSQL, MySQL) and NoSQL (Cassandra, DynamoDB, Redis, MongoDB) solutions.Security & Compliance : Implement API security, authentication, authorization, and ensure compliance with SOC2, ISO 27001, PCI DSS.Leadership & Collaboration :
Mentorship & Code Reviews : Guide engineers in best practices for platform engineering, microservices, and distributed systems.Cross-Team Collaboration : Work with cloud engineering, security, and product engineering teams to align platform capabilities with business needs.Key Qualifications :
4-12 years of experience in backend platform engineering, distributed systems, and microservices.Strong programming expertise in Go, Java, or Python, with a focus on multithreading and concurrency.Expertise in Kubernetes, service meshes (Istio, Linkerd), and cloud infrastructure.Deep understanding of gRPC, REST APIs, GraphQL, and API performance tuning.Hands-on experience with CI / CD and infrastructure automation (Terraform, Pulumi).Hands-on experience with Kubernetes, Docker, and Helm in cloud environments.Solid understanding of networking, API gateways, and authentication mechanisms (OAuth, JWT, gRPC, REST, GraphQL).Ability to debug and optimize high-concurrency, low-latency applications.Strong problem-solving and analytical skills in complex, distributed environments.Proven ability to manage production incidents and other operational excellence practicesref : hirist.tech)