Description :
MoneyForward is seeking a Site Reliability Engineer (SRE) to lead the reliability, scalability, and performance of our products. This role involves making critical technical decisions, collaborating with development and platform engineering teams, and ensuring that our systems remain resilient and scalable to support stable business growth.
Responsibilities :
- Service Reliability and Scalability : Design, build, and maintain highly available production services; define and implement SLOs / SLIs; perform capacity planning and optimize bottlenecks.
- Incident Management : Lead incident response, conduct postmortems / root cause analysis, and improve on-call operations.
- Automation and Operational Efficiency : Automate tasks with Infrastructure as Code (Terraform); implement self-healing and auto-scaling systems; optimize CI / CD pipelines.
- Observability and Monitoring : Implement monitoring, logging, and tracing strategies using tools like Prometheus, OpenTelemetry, Grafana, and Datadog.
- Leadership : Drive SRE practices across teams, act as a technical advisor, and guide developers in adopting reliability best practices.
- Collaboration : Work closely with SREs, platform engineers, and developers to improve infrastructure, reliability, and operational efficiency.
Requirements :
Experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.Strong coding skills (e.g., Python, Go, Java, Rust, C++, Ruby, etc.) - shell scripting alone is not sufficient.Experience operating Kubernetes in production environments.Hands-on with Infrastructure as Code (Terraform, Crossplane) and CI / CD automation tools (ArgoCD, CircleCI, GitHub Actions).Familiarity with cloud platforms (AWS or others) and cloud-native architectures.Strong knowledge of observability tools (Prometheus, OpenTelemetry, Grafana, Datadog).Experience in incident management, disaster recovery, and high-availability strategies.Proven technical leadership and project management skills.Preferred Qualifications :
Experience fostering SRE best practices within organizations.Deep understanding of microservice architectures.Proficiency in Go or Python for automation / tooling.Contributions to CNCF or open-source projects.(ref : hirist.tech)