Description GSPANN is hiring a Site Reliability Engineer with expertise in Java and Spark. The role involves ensuring service reliability, automating operations, and supporting Java-based big data applications using Spark. You'll work closely with cross-functional teams to enhance system performance, observability, and scalability.
Role and Responsibilities
- Gain a deep understanding of the business and map the full customer journey end-to-end.
- Apply software development principles to operations, leveraging broad experience in software engineering and Site Reliability Engineering (SRE) practices.
- Collaborate with stakeholders to enhance the design, observability, availability, scalability, and performance of critical services.
- Clearly communicate your availability to both the team and your manager.
- Automate manual workflows, investigate incidents thoroughly, and lead blameless post-mortems for continuous learning.
- Use standardized telemetry data to improve alert management, incident analysis, decision-making, and system optimization.
- Support planned changes by managing deployments, monitoring systems post-deployment, and creating or updating dashboards and alerts as needed.
- Develop and enhance new services, and deploy tools that automate the support of systems and services.
- Meet and uphold organizational Service Level Objectives (SLOs) consistently.
- Create value-focused deliverables including Standard Operating Procedures (SOPs), presentations, case studies, and accelerators.
Skills and Experience
5+ years of experience in software development, technical operations, and managing large-scale application environments.5+ years in Service Engineering, IT Support, or Production Operations.5+ years of hands-on experience with Java application development and support, including knowledge of Spring and Hibernate frameworks.Set up and debug Apache Spark jobs for over 4 years, with a solid understanding of data processing, cleansing, and integrity validation.Write and maintain Unix shell scripts for over 3 years, with strong hands-on scripting capability.Preferably have working knowledge of Microsoft Azure, Azure Cosmos DB, Azure Synapse Analytics, and Apache Kafka.Apply creative problem-solving skills to resolve cross-functional technical challenges in dynamic, fast-changing environments.Communicate effectively, take ownership of triage calls, and drive resolution of critical incidents to logical closure.Stay open to working in rotational shifts as required.