Job Description :
- Incident Response & On-Call Support
Serve as primary responder for production incidents; participate in on-call rotation for Java platform services.
Investigate and diagnose application-level problems (e.g., memory leaks, GC pauses, thread deadlocks, CPU bottlenecks).
Execute short-term fixes such as restarting services, modifying configurations, or managing deployment rollbacks.
Escalate critical issues to development teams or dependent stakeholders when needed.
Operational MaintenanceConduct recurring system maintenance : monthly framework upgrades, dependency patching, configuration validation.
Monitor and audit application health, performance, and availability using internal tools and dashboards.
Maintain and improve runbooks, response procedures, and documentation.
Collaboration & ObservabilityCollaborate with engineering teams during production deployments or rollouts.
Analyze application metrics, logs, and traces to identify system issues or inefficiencies.
Partner with infrastructure, database, and observability teams to tune systems for performance and reliability.
Required Qualifications
5+ years of hands-on experience supporting Java-based applications in a production environment.Solid understanding of the Java runtime (JVM, memory management, garbage collection, threading).Proficient with tools for diagnosing and monitoring Java applications (e.g., jstack, jmap, Grafana, Prometheus).Strong knowledge of Linux systems, shell scripting, and common command-line tools. Ability to troubleshoot infrastructure issues related to CPU, memory, IO, and networking.Strong verbal and written communication skills to coordinate across teams.Experience with on-call support and managing high-severity incidents.Preferred Qualifications
Experience working with Apache Flink, Apache Spark, or other distributed data processing frameworks in production.Familiarity with operational patterns for diverse data systems including :Oracle databasesKey-Value stores (e.g., Redis, RocksDB)Document databases (e.g., MongoDB or similar)Graph databases (e.g., Neo4j, JanusGraph)Understanding of production concerns like data consistency, latency, availability, and failure handling.Exposure to container-based environments (Docker, Kubernetes).