This job offer is not available in your country.

Site Reliability Engineer - DevOps

Whitefield CareersBangalore

30+ days ago

Job description

Key Responsibilities :

Troubleshoot complex issues in Linux environments and conduct application-level debugging.
Manage and provision infrastructure using Terraform and configuration management tools.
Orchestrate and manage containers using Kubernetes in a production-grade environment.
Design and maintain CI / CD pipelines to enable seamless deployments and continuous delivery.
Script automation tools and processes to enhance operational efficiency and reliability.
Monitor system health and performance using tools such as Grafana, Prometheus, and Loki.
Set up alerts and dashboards for proactive system monitoring and issue detection.
Collaborate with development, QA, and operations teams to improve application and system performance.
Lead incident response efforts, perform root cause analysis, and ensure timely resolution.
Perform API and load testing using Gatling and JMeter to validate system resilience.
Administer and support Finacle operations and its integration within the infrastructure.
Apply deep knowledge of TCP / IP, HTTP, DNS, and Load Balancing protocols to maintain highly available services.
Document system configurations, processes, and troubleshooting guides for internal use.
Work across Linux and Windows systems, providing support and implementing improvements.

Key Skills & Qualifications :

4+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

Proven expertise in Linux system administration and debugging complex application issues.

Strong experience with Terraform, Kubernetes, and container orchestration.

Hands-on experience in managing CI / CD pipelines and version control systems.

Proficiency in scripting languages (e.g., Bash, Python, or similar).

Sound knowledge of Finacle operations is highly desirable.

Familiarity with system architecture, configuration management, and automation tools.

Deep understanding of network protocols including TCP / IP, HTTP, DNS, and Load Balancing.

Experience with Grafana, Prometheus, Loki, and other observability tools.

Ability to define alerts, dashboards, and troubleshoot performance issues using system metrics.

Proficient in incident management, root cause analysis, and creating postmortems.

Skilled in API testing and load testing using Gatling and JMeter.

Strong interpersonal skills and ability to communicate complex technical topics clearly and concisely.

Strong documentation skills and ability to collaborate in cross-functional teams.

(ref : hirist.tech)

Site Reliability Engineer • Bangalore