This job offer is not available in your country.

Site Reliability Engineer - Observability Services

Futurestep Recruitment Services Private LimitedBangalore

6 days ago

Job description

Description :

One of our leading clients is looking to hire a Site Reliability Engineer (SRE) with expertise in building and maintaining reliable, scalable, and highly available systems. The ideal candidate should have strong experience in observability practices , with hands-on skills in Grafana and related monitoring / alerting tools, as well as proven ability to operate in multi-cloud environments.

Note : Only candidate with SRE + Observerabilty + Grafana + Loki / Mimir / Tempo (any 2 is mandatory) with cloud should apply for the job

Job Title : Engineer Site Reliability

Skills required : SRE + Observerabilty + Grafana + Loki / Mimir / Tempo + Any cloud

Office location : Bangalore

Experience Range : 1-3 years

Job Title : Senior Engineer Site Reliability

Skills required : SRE + Observerabilty + Grafana + Loki / Mimir / Tempo + Any cloud

Office location : Bangalore

Experience Range : 3-6 years

Job Title : Advance Engineer Site Reliability

Skills required : SRE + Observerabilty + Grafana + Loki / Mimir / Tempo + Any cloud

Office location : Bangalore

Experience Range : 6-9 years

What you will be doing :

This role will be an individual contributor responsible for building and finetuning the platform components for the Observability product. The candidate will work closely with the Lead engineer, performance team, data ingestion, platform DevOps and data visualization teams under Observability product. As a member of the platform team, the candidate needs to be able to support and maintain the applications onboarded to Grafana Observability, Ingestion and visualization written in PromQL, Log queries, etc., and monitoring technologies.

This position will preferably be based out of India GCC, Bangalore.

Key Responsibilities :

Lead technical support for applications and programs currently in production.
Analyze complex problems to determine solutions to problems to be implemented permanently into production.
Prepare for Production releases by ensuring appropriate alerts, dashboards, KB articles, Confluence pages and knowledge sharing are properly executed.
Ensures dashboards are being monitored daily to detect anomalies and corrections are shared with appropriate teams and team members.
Check that alerts are being responded to appropriately.
Ensures approvement agendas for services are being maintained and acted on with Development Engineering and DevOps Engineering partners. Experience in Observability and Monitoring initiatives as platform Engineer.
Troubleshoot platform issues and restore service by resolving customer-facing incidents
Development and implementation of build release pipelines with accountability for managing deployment schedules, issues, risks, and Agile development experience with team member accountability for commitment and delivery each sprint.
Troubleshoot and implement corrections to problems associated with connectivity between the supported applications and the clients they serve
Provide technical guidance, in the diagnosis of issues as they arise in support of critical applications
Drive collaboration sessions among IT and business groups to facilitate optimal support and operation of the relevant applications
Provide Site Reliability Engineering techniques such as observability, alerting and performance tuning
Contribute to the design, implementation, and enhancement of critical applications
Perform proactive analysis and troubleshooting to predict and prevent production incidents
Define and contribute to monitoring capabilities for critical applications
Collaborate with key vendors on functional, performance and capacity improvements
Design and build tools to automate support and monitoring functions
Ensure that all implementations of observability meet the requirements prescribed by IT Services through the effective implementation or use of approved processes, methodologies, and deliverables.
Provide expertise and build solutions for observability applications as well as system integration with internal systems and external vendors.
Able to provide coding and technical direction to less experienced staff or develops highly complex original code.
Track infrastructure delivery and dependencies to implementation.

We are searching for someone with the following skills :

Experience with gathering and organizing large volume of data to use for instrumentation into an Enterprise Observability solution.

Experience with recommending baseline monitoring thresholds, and performance monitoring KPIs and SLAs.

Experience with installing agents, forwarders, APIs, performance monitoring alerts, dashboards, and data trend analysis.

Good Knowledge and understanding of Azure foundation components e.g. App GW, APIM, Virtual Network, NSG, Load Balancer, Azure VM etc. is required.

Team-oriented, positively contributing to team morale and willing to help.

Learning-Focused, finding ways to improve in their field and use positive constructive feedback to grow personally and professionally

Think strategically and proactively anticipate future problems, needs or changes in the work

Experience with Databases Azure SQL, PostgreSQL, MySQL, MongoDB, TSDB or similar databases.

Azure / GCP hands-on with details around pulling observability data from managed services

Golang / Python coding or from solutioning background with experience on SRE development and Open telemetry implementation

Deploying / managing and optimizing enterprise level observability platform for Grafana OSS products like Mimir,Loki,Tempo, Fluentbit / Vector

Design and develop standard Grafana dashboards for critical metrics for various Azure / GCP services using the observability data

Experience must include at least one of the following languages : Java (required), Desired- Python, Go, node.js

Knowledge of monitoring tools such as Log Analytics, App Dynamics, Grafana, Prometheus, Splunk, and Sitescope

Experience in working with ServiceNow or similar Service Management tools

Familiarity with Cloud technologies in Azure, AWS, and Google Cloud

Experience on PCF, Docker, Kubernetes platform is required.

Experience with DevOps and CI / CD tools and processes is required.

Experience in high-performance and high-frequency data streaming and health confirmation techniques (using Kafka etc.) and handling large volume of batch data is strongly preferred

In-depth advanced knowledge of current monitoring tools

In-depth advanced knowledge of at least one major cloud platform and Service Container / Instance concepts

In-depth advanced knowledge of querying and inspection techniques for service and other types of logs

In-depth advanced knowledge of the full software development lifecycle and software development methodologies (Agile).

Strong ability to understand client expectations and to resolve issues that may affect service.

Strong ability to mentor, coach and train other application support engineers

Self-starter, with a demonstrated ability to learn beyond formal training with a strong aptitude for delivering quality products.

We believe the successful candidate has these qualifications and experience :

4-year degree (Computer Science, Information Systems, or relational functional field) and / or equivalent combination of education or work experience.

2to 9 years of experience on integration engineering related to Observability / Monitoring framework with open source technologies such as Grafana, Mimir, Loki, Tempo, Fluentbit, Vector etc.,

Hands-on experience with Tools and Technology is preferred.

2 to 9 years of experience as a System Reliability Engineer is required.

Experience working with Open-source platforms and Open Telemetry libraries e.g. Grafana is preferred.

(ref : hirist.tech)

Create a job alert for this search

Site Reliability Engineer • Bangalore

Related jobs

Promoted

Senior Site Reliability Engineer

Delta Air LinesBengaluru, India

Execute on the Incident, Change Management, Problem Management processes.Building and supporting a reliable application suite for the environment in order to meet the development and maintenance re...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer - Observability Services

TeamWare SolutionsBangalore

Role Summary : We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong focus on observability.The ideal candidate will have 5-8 years of experie...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

ViewSonicBengaluru, Karnataka, India

At ViewSonic Technologies, we’re passionate about building software that solves problems.We count on our site reliability engineers (SREs) to empower users with a rich feature set, high availabilit...Show moreLast updated: 30+ days ago

Promoted

Sr. Site Reliability Engineer [T500-20179]

Delta Air LinesBengaluru, Karnataka, India

Delta Air Lines (NYSE : DAL) is the U.Powered by our employees around the world, Delta has for a decade led the airline industry in operational excellence while maintaining our reputation for award-...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer- ELK Expert

iVedha Inc.hosur, tamil nadu, in

Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

ElgebraBangalore

Role Overview : We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our c...Show moreLast updated: 17 days ago

Promoted

Site Reliability Engineer II

RecRootsBengaluru, Karnataka, India

Key Job Responsibilities and Duties : The core premise for the SRE lies in treating operational issues as a software problem. We code our way out of problems where operations are concerned addressin...Show moreLast updated: 9 days ago

Promoted

Site Reliability Engineer

Core Minds Tech SOlutionsHosur

Job Description : - Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions&l...Show moreLast updated: 30+ days ago

Promoted

Observability - Engineer Site Reliability [T500-20244]

Albertsons Companies IndiaBengaluru, Karnataka, India

About Albertsons Companies Inc.As a leading food and drug retailer in the United States, Albertsons Companies, Inc.Our well-known banners across the United States, including Albertsons, Safeway, Vo...Show moreLast updated: 21 days ago

Promoted

Site Reliability Engineer

QualityKiosk Technologies Pvt. Ltd.Bengaluru, Karnataka, India

QualityKiosk Technologies is one of the world's largest independent Quality Engineering (QE) providers and digital transformation enablers, helping companies build and manage applications for optim...Show moreLast updated: 9 days ago

Promoted

Senior Site Reliability Engineer

WSO2Bengaluru, Karnataka, India

Founded in 2005, WSO2 is the largest independent software vendor providing open-source API management, integration, and identity and access management (IAM) to thousands of enterprises in over 90 c...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

SynechronBangalore Urban, Karnataka, India

We have immediate opportunity for.SRE (Senior Site Reliability Engineer) 5 to 9 years.SRE (Senior Site Reliability Engineer). We began life in 2001 as a small, self-funded team of technology special...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

RecRootsBangalore Urban, Karnataka, India

The core premise for the SRE lies in treating operational issues as a software problem.We code our way out of problems where operations are concerned, addressing availability, scalability, latency,...Show moreLast updated: 9 days ago

Promoted

Staff Site Reliability Engineer (Observability)

Palo Alto NetworksBengaluru, Karnataka, India

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...Show moreLast updated: 18 days ago

Promoted

Lead Site Reliability Engineer

Delta Air LinesBengaluru, India

Execute on the Incident, Change Management, Problem Management processes.Building and supporting reliable applications that meet development and maintenance requirements. Provide consultation and di...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Amicon Hub ServicesBengaluru, Karnataka, India

Manage and scale production systems hosted on.Automate operational tasks using.Improve system reliability and reduce manual interventions through automation. Collaborate with development teams to en...Show moreLast updated: 20 days ago

Promoted

Site Reliability Engineer

TrantorBengaluru, Karnataka, India

Job Title - Site Reliability Engineer Role- Contract (9 Months- Extendable) Exp- 5+ years Loc- Bangalore ( Hybrid) Notice- Immediate joiner only Duties : Responsible for maintaining and scaling pro...Show moreLast updated: 13 days ago

Promoted

Site Reliability Engineer - Chaos Management

Xebiahosur, tamil nadu, in

AWS Engineer with strong Python development and Chaos Engineering expertise.The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault toler...Show moreLast updated: 21 days ago