Senior Manager - Site Reliability Engineering

ConfidentialChennai, India

11 days ago

Job description

Responsibilities

JOB SUMMARY

Lead the implementation and advocacy for SRE (Support Site Reliability Engineer) principles to improve the reliability and availability of our applications
Drive work on setting and maintaining SLI / SLO / Error budgets for our applications
Responsible for developing and executing on the Chapter Vision together with the other Chapter Leads
Drive technology strategy, technology stack selection, and implementation for a future-ready technology stack, to achieve outcomes of highly scalable, robust, resilient system.
Experienced former practitioner with leadership ability.
Oversees the execution of functional standards and best practices
Provide thought leadership on the craft, inspire, and retain talents by developing and nurturing an extensive internal and external network of practitioners.
This role is around capability building, it is not to own applications or delivery
Creates a strategy roadmap of technical work
Works to drive technology convergence and simplification across their chapter area

Technical Responsibilities

Service Reliability : Monitor and maintain the reliability, availability, and performance of production services and infrastructure.

Automation and Tooling : Develop and maintain automation tools and processes to streamline system provisioning, configuration management, deployment, and monitoring.

Incident Management : Respond to and troubleshoot incidents, outages, and performance issues in production environments, ensuring timely resolution and minimal impact on users.

Blameless Postmortems and Learning from Incidents – Participate in the wider root cause analysis and support & drive collaborative actions.

Capacity Planning : Analyse system performance and capacity trends to forecast future resource requirements and optimize infrastructure utilization.

Performance Optimization : Identify and address performance bottlenecks and optimization opportunities across the software stack, from application code to underlying infrastructure.

Security and Compliance : Implement security best practices and ensure compliance with regulatory requirements, collaborating with security and compliance teams as needed.

Continuous Improvement : Continuously evaluate and improve system reliability, scalability, and performance through automation, process refinement, and technology upgrades.

Documentation and Knowledge Sharing : Document system designs, configurations, and procedures, and share knowledge with team members through documentation, training, and mentoring.

Strategy

Reliability Engineering Strategy – Develop and execute a comprehensive reliability engineering strategy to ensure high availability, fault tolerance and disaster recovery capabilities for critical systems and services

Scalability Planning – Design and implement scalable architecture solution that can accommodate growth in user traffic and data volume over time

Monitoring and Alerting Strategy – Defining and implementing monitoring and alerting strategies to proactively identify and address issues before they reach the end users

Capacity Planning Strategies – Develop capacity planning strategies to ensure that systems have sufficient resources to handle current and future workloads

Business

Experienced practitioner and hands on contribution to the squad delivery for their craft (E.g. SRE).

Responsible for balancing skills and capabilities across teams (squads) and hives in partnership with the Chief Product Owner & Hive Leadership, and in alignment with the fixed capacity model.

Responsible to evolve the craft towards improving automation, simplification, and innovative use of latest market trends.

Trusted advisor to the business. Work hand in hand with the Business, taking product programs from investment decisions, into design, specification, and solution phases, all the way to operations on the ground and securing support services from other teams.

Provide leadership and technical expertise for the subdomain to achieve goals and outcomes

Support respective businesses in the commercialisation of capabilities, bid teams, monitoring of usage, improving client experience, and collecting defects for future improvements.

Manage business partner expectations. Ensure delivery to business meeting time, cost and with high quality

Processes

Chapter Lead may vary based upon the specific chapter domain its leading.

Define standards to ensure that applications are designed with scale, resilience, and performance in mind

Enforce and streamline sound development practices and establish and maintain effective governance processes including training, advice, and support, to assure the platforms are developed, implemented, and maintained aligning with the Group's standards

Responsible for overall governance of the subdomain that includes risk management, representation in steering committee reviews and engagement with business for strategy, change management and timely course correction as required

Ensure compliance to the highest standards of business conduct, regulatory requirements and practices defined by internal and external requirements. This includes compliance with local banking laws and anti-money laundering stipulations

People & Talent

Accountable for people management and capability development of their Chapter members.

Reviews metrics on capabilities and performance across their area, has improvement backlog for their Chapters and drives continual improvement of their chapter.

Focuses on the development of people and capabilities as the highest priority.

Ensure that the organisation works in a proactive way to upgrade capacity well in advance and predict future capacity needs

Responsible for building an engineering culture where application and infrastructure scalability is paramount for on-going capacity management with an aim to reduce the need for capacity reviews using monitoring and auto-scale properties

Empower the engineers so that they can provide economy of scale focused on delivering value, speed to market, availability, monitoring & system management

Foster a culture of innovation, transparency, and accountability end to end in the subdomain while promoting a 'business-first' mentality at all levels

Develop and maintain a plan that provides for succession and continuity in the most critical delivery and management position

Risk Management

Responsible for effective capacity risk management across the Chapter with regards to attrition and leave plans.

Ensures the chapter follows the standards with respect to risk management as applicable to their chapter domain.

Adheres to common practices to mitigate risk in their respective domain.

Effectively and collaboratively identify, escalate, mitigate, and resolve risk, conduct and compliance matters.

Incident Response Planning – Develop incident response plans and procedures to effectively mitigate and manage risks when they materialize

Risk monitoring and alerting – Implement monitoring and alerting systems to detect early warning signs of potential risks

Root Cause analysis – Conduct thorough root cause analysis of incidents and outages to understand the underlying causes and contributing factors

Ensure that the organisation works in a proactive way to upgrade capacity well in advance and predict future capacity needs

Empower the engineers so that they can provide economy of scale focused on delivering value, speed to market, availability, monitoring & system management

Regulatory & Governance

Ensure all artefacts and assurance deliverables are as per the required standards and policies (e.g., SCB Governance Standards, ESDLC etc.).

Display exemplary conduct and live by the Group's Values and Code of Conduct.

Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct, across Standard Chartered Bank. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.

Key Stakeholders

WRB Application Teams

Chief Product Owner, Hive Lead, Product Owners, Engineering Leads

Other Responsibilities

Embed Here for Good and Group's brand and values in the digital sales / commerce team

Perform other responsibilities assigned under Group, Country, Business or Functional policies and procedures

Requirements & Skills

Bachelor's degree in computer science, Information Technology, or related field (or equivalent experience).

Proven experience (10+ years) as an SRE Engineer or in a similar role, with a proven track record of leadership.

Strong understanding of SRE principles and practices.

Proficiency in troubleshooting complex issues and exceptional problem-solving skills.

Deep knowledge of a wide array of software applications and infrastructure.

Experience with monitoring and observability tools (e.g., Prometheus, Grafana, AppDynamics, Splunk, PagerDuty).

Proficiency in scripting and automation (e.g., Python, Bash, Ansible).

Familiarity with cloud platforms (e.g., AWS, Azure) and containerization technologies (e.g., Docker, Kubernetes).

Excellent communication and collaboration skills.

Ability to work in a fast-paced, dynamic environment.

Strong attention to detail and a commitment to delivering high-quality results.

Ability to debug and troubleshoot Java applications.

Proficiency in using Splunk for log management and analysis.

Familiarity with CI / CD tools and practices.

Experience in the banking or financial services industry.

Certification in relevant technologies (e.g., AWS Certified Solutions Architect, Google Cloud Professional DevOps Engineer).

Knowledge of security best practices and compliance requirements.

Ability to articulate the overall vision for the Chapters and ensure upskilling of the organisation holistically

Experience in identifying skill gaps and mitigate risks to deliverables

Ensure all solutions are as per Architecture Standards

Strong experience in software development, system administration, or a related technical field.

Proficiency in programming / scripting languages such as Python, Go, Java, or Shell scripting.

Experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar.

Deep understanding of Linux / Unix systems and networking fundamentals.

Experience with cloud platforms such as AWS, GCP, or Azure.

Strong analytical and problem-solving skills, with a keen attention to detail.

Excellent communication and collaboration skills, with the ability to work effectively in a cross-functional team environment.

Prior experience with DevOps practices, continuous integration / continuous delivery (CI / CD) pipelines, and infrastructure as code (IaC) is a plus.

Role Specific Technical Competencies

Software Engineering

Systems Software Infrastructure

Platform Architecture

Programming & Scripting (Java / Python or Similar Programming Language)

Cloud (AWS, Azure, GCP)

Database Development

Service Excellence

Agile Application Delivery Process

Operating Systems

Network Fundamentals

Security Fundamentals

Core Banking Domain Knowledge

About Standard Chartered

We're an international bank, nimble enough to act, big enough for impact. For more than 170 years, we've worked to make a positive difference for our clients, communities, and each other. We question the status quo, love a challenge and enjoy finding new opportunities to grow and do better than before. If you're looking for a career with purpose and you want to work for a bank making a difference, we want to hear from you. You can count on us to celebrate your unique talents and we can't wait to see the talents you can bring us.

Our purpose, to drive commerce and prosperity through our unique diversity, together with our brand promise, to be here for good are achieved by how we each live our valued behaviours. When you work with us, you'll see how we value difference and advocate inclusion.

Together We

Do the right thing and are assertive, challenge one another, and live with integrity, while putting the client at the heart of what we do

Never settle, continuously striving to improve and innovate, keeping things simple and learning from doing well, and not so well

Are better together, we can be ourselves, be inclusive, see more good in others, and work collectively to build for the long term

What We Offer

In line with our Fair Pay Charter, we offer a competitive salary and benefits to support your mental, physical, financial and social wellbeing.

Core bank funding for retirement savings, medical and life insurance, with flexible and voluntary benefits available in some locations.

Time-off including annual leave, parental / maternity (20 weeks), sabbatical (12 months maximum) and volunteering leave (3 days), along with minimum global standards for annual and public holiday, which is combined to 30 days minimum.

Flexible working options based around home and office locations, with flexible working patterns.

Proactive wellbeing support through Unmind, a market-leading digital wellbeing platform, development courses for resilience and other human skills, global Employee Assistance Programme, sick leave, mental health first-aiders and all sorts of self-help toolkits

A continuous learning culture to support your growth, with opportunities to reskill and upskill and access to physical, virtual and digital learning.

Being part of an inclusive and values driven organisation, one that embraces and celebrates our unique diversity, across our teams, business functions and geographies - everyone feels respected and can realise their full potential.

Skills Required

System Administration, Software Development

Create a job alert for this search

Engineering Manager • Chennai, India

Related jobs

Promoted

Site Reliability Engineer

Tata Consultancy ServicesChennai, Tamil Nadu, India

Role : Site Reliability Engineer.Locations : Chennai / Pune / Kolkata.Show moreLast updated: 16 days ago

Promoted

Site Reliability Engineer

CapgeminiChennai, IN

Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues...Show moreLast updated: 17 days ago

Promoted

AWS Site Reliability Engineer

HTC Global ServicesChennai, Tamil Nadu, India

Troy, Michigan, is a leading global Information Technology solution and BPO provider.HTC assists clients across multiple industry verticals, offering turnkey project lifecycle in, e-business, data ...Show moreLast updated: 7 days ago

Promoted

Site Reliability Engineer

CodeKarmamount, India

Site Reliability Engineer (Multi-Cloud Deployments).CodeKarma is redefining how engineering teams understand and evolve complex systems — bringing production context directly into the developer’s w...Show moreLast updated: 22 days ago

Promoted

Senior Site Reliability Engineer

ConfidentialChennai, India

We're looking for an experienced Site Reliability Engineer to fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated, and designed to scale...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer- ELK Expert

iVedha Inc.Chennai, IN

Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago

Promoted

Staff Site Reliability Engineer

PoshmarkChennai, Tamil Nadu, India

We’re looking for an experienced.You will use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying ...Show moreLast updated: 20 days ago

Promoted

Senior Site Reliability Engineer

IntraEdgeChennai, IN

Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Strategic thinking with a focus on long-term operational excellence.Champion operation...Show moreLast updated: 20 days ago

Promoted

Senior Software Engineering Manager

OptumChennai, India

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives.The work you do with our team will directly improve health outcomes by connect...Show moreLast updated: 20 days ago

Promoted

Senior Site Reliability Engineer

Nebula Tech Solutionschennai, tamil nadu, in

SRE team supporting mission-critical applications for our.We’re now looking for engineers who can go beyond operations — those who can. Enhance application reliability through code.Add or modify cod...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer I

LexisNexis Legal & Professional®Chennai, Tamil Nadu, India

This job is with LexisNexis Legal & Professional®, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter ...Show moreLast updated: 7 days ago

Promoted

Athenahealth - Senior Site Reliability Engineer - On-Premises Infrastructure

athenaHealth Technology Private Limited.Chennai

Description : Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for al...Show moreLast updated: 29 days ago

Promoted

Site Reliability Engineering Manager

ConfidentialChennai, India

Canonical is a leading provider of open-source software and operating systems for global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initi...Show moreLast updated: 11 days ago

Promoted

Senior Site Reliability Engineer (SRE) – Datadog Observability

Jade Globalchennai, India

Senior Site Reliability Engineer (SRE) – Datadog Observability.SRE and Infrastructure Operations with minimum 3.Hyderabad preferable but open for Pune and remote. Site Reliability Engineer (SRE).SRE...Show moreLast updated: 7 days ago

Promoted

Miratech - Senior Site Reliability Engineer

MiratechChennai

Description : About Miratech : Miratech helps visionaries change the world.We are a global IT services and consulting company tha...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

ElgebraChennai

Role Overview : We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our c...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Zyoin GroupChennai

Description : MoneyForward is seeking a Site Reliability Engineer (SRE) to lead the reliability, scalability, and performance of our products.This role invol...Show moreLast updated: 6 days ago

Promoted

Sr Engineer, Site Reliability [T500-21295]

TMUS Global Solutionsmount, India

NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 4 days ago