Talent.com
Senior Manager - Site Reliability Engineering

Senior Manager - Site Reliability Engineering

ConfidentialChennai, India
11 days ago
Job description

Responsibilities

JOB SUMMARY

  • Lead the implementation and advocacy for SRE (Support Site Reliability Engineer) principles to improve the reliability and availability of our applications
  • Drive work on setting and maintaining SLI / SLO / Error budgets for our applications
  • Responsible for developing and executing on the Chapter Vision together with the other Chapter Leads
  • Drive technology strategy, technology stack selection, and implementation for a future-ready technology stack, to achieve outcomes of highly scalable, robust, resilient system.
  • Experienced former practitioner with leadership ability.
  • Oversees the execution of functional standards and best practices
  • Provide thought leadership on the craft, inspire, and retain talents by developing and nurturing an extensive internal and external network of practitioners.
  • This role is around capability building, it is not to own applications or delivery
  • Creates a strategy roadmap of technical work
  • Works to drive technology convergence and simplification across their chapter area

Technical Responsibilities

  • Service Reliability : Monitor and maintain the reliability, availability, and performance of production services and infrastructure.
  • Automation and Tooling : Develop and maintain automation tools and processes to streamline system provisioning, configuration management, deployment, and monitoring.
  • Incident Management : Respond to and troubleshoot incidents, outages, and performance issues in production environments, ensuring timely resolution and minimal impact on users.
  • Blameless Postmortems and Learning from Incidents – Participate in the wider root cause analysis and support & drive collaborative actions.
  • Capacity Planning : Analyse system performance and capacity trends to forecast future resource requirements and optimize infrastructure utilization.
  • Performance Optimization : Identify and address performance bottlenecks and optimization opportunities across the software stack, from application code to underlying infrastructure.
  • Security and Compliance : Implement security best practices and ensure compliance with regulatory requirements, collaborating with security and compliance teams as needed.
  • Continuous Improvement : Continuously evaluate and improve system reliability, scalability, and performance through automation, process refinement, and technology upgrades.
  • Documentation and Knowledge Sharing : Document system designs, configurations, and procedures, and share knowledge with team members through documentation, training, and mentoring.
  • Strategy

  • Reliability Engineering Strategy – Develop and execute a comprehensive reliability engineering strategy to ensure high availability, fault tolerance and disaster recovery capabilities for critical systems and services
  • Scalability Planning – Design and implement scalable architecture solution that can accommodate growth in user traffic and data volume over time
  • Monitoring and Alerting Strategy – Defining and implementing monitoring and alerting strategies to proactively identify and address issues before they reach the end users
  • Capacity Planning Strategies – Develop capacity planning strategies to ensure that systems have sufficient resources to handle current and future workloads
  • Business

  • Experienced practitioner and hands on contribution to the squad delivery for their craft (E.g. SRE).
  • Responsible for balancing skills and capabilities across teams (squads) and hives in partnership with the Chief Product Owner & Hive Leadership, and in alignment with the fixed capacity model.
  • Responsible to evolve the craft towards improving automation, simplification, and innovative use of latest market trends.
  • Trusted advisor to the business. Work hand in hand with the Business, taking product programs from investment decisions, into design, specification, and solution phases, all the way to operations on the ground and securing support services from other teams.
  • Provide leadership and technical expertise for the subdomain to achieve goals and outcomes
  • Support respective businesses in the commercialisation of capabilities, bid teams, monitoring of usage, improving client experience, and collecting defects for future improvements.
  • Manage business partner expectations. Ensure delivery to business meeting time, cost and with high quality
  • Processes

  • Chapter Lead may vary based upon the specific chapter domain its leading.
  • Define standards to ensure that applications are designed with scale, resilience, and performance in mind
  • Enforce and streamline sound development practices and establish and maintain effective governance processes including training, advice, and support, to assure the platforms are developed, implemented, and maintained aligning with the Group's standards
  • Responsible for overall governance of the subdomain that includes risk management, representation in steering committee reviews and engagement with business for strategy, change management and timely course correction as required
  • Ensure compliance to the highest standards of business conduct, regulatory requirements and practices defined by internal and external requirements. This includes compliance with local banking laws and anti-money laundering stipulations
  • People & Talent

  • Accountable for people management and capability development of their Chapter members.
  • Reviews metrics on capabilities and performance across their area, has improvement backlog for their Chapters and drives continual improvement of their chapter.
  • Focuses on the development of people and capabilities as the highest priority.
  • Ensure that the organisation works in a proactive way to upgrade capacity well in advance and predict future capacity needs
  • Responsible for building an engineering culture where application and infrastructure scalability is paramount for on-going capacity management with an aim to reduce the need for capacity reviews using monitoring and auto-scale properties
  • Empower the engineers so that they can provide economy of scale focused on delivering value, speed to market, availability, monitoring & system management
  • Foster a culture of innovation, transparency, and accountability end to end in the subdomain while promoting a 'business-first' mentality at all levels
  • Develop and maintain a plan that provides for succession and continuity in the most critical delivery and management position
  • Risk Management

  • Responsible for effective capacity risk management across the Chapter with regards to attrition and leave plans.
  • Ensures the chapter follows the standards with respect to risk management as applicable to their chapter domain.
  • Adheres to common practices to mitigate risk in their respective domain.
  • Effectively and collaboratively identify, escalate, mitigate, and resolve risk, conduct and compliance matters.
  • Incident Response Planning – Develop incident response plans and procedures to effectively mitigate and manage risks when they materialize
  • Risk monitoring and alerting – Implement monitoring and alerting systems to detect early warning signs of potential risks
  • Root Cause analysis – Conduct thorough root cause analysis of incidents and outages to understand the underlying causes and contributing factors
  • Ensure that the organisation works in a proactive way to upgrade capacity well in advance and predict future capacity needs
  • Responsible for building an engineering culture where application and infrastructure scalability is paramount for on-going capacity management with an aim to reduce the need for capacity reviews using monitoring and auto-scale properties
  • Empower the engineers so that they can provide economy of scale focused on delivering value, speed to market, availability, monitoring & system management
  • Regulatory & Governance

  • Ensure all artefacts and assurance deliverables are as per the required standards and policies (e.g., SCB Governance Standards, ESDLC etc.).
  • Display exemplary conduct and live by the Group's Values and Code of Conduct.
  • Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct, across Standard Chartered Bank. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.
  • Key Stakeholders

  • WRB Application Teams
  • Chief Product Owner, Hive Lead, Product Owners, Engineering Leads
  • Other Responsibilities

  • Embed Here for Good and Group's brand and values in the digital sales / commerce team
  • Perform other responsibilities assigned under Group, Country, Business or Functional policies and procedures
  • Requirements & Skills

  • Bachelor's degree in computer science, Information Technology, or related field (or equivalent experience).
  • Proven experience (10+ years) as an SRE Engineer or in a similar role, with a proven track record of leadership.
  • Strong understanding of SRE principles and practices.
  • Proficiency in troubleshooting complex issues and exceptional problem-solving skills.
  • Deep knowledge of a wide array of software applications and infrastructure.
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, AppDynamics, Splunk, PagerDuty).
  • Proficiency in scripting and automation (e.g., Python, Bash, Ansible).
  • Familiarity with cloud platforms (e.g., AWS, Azure) and containerization technologies (e.g., Docker, Kubernetes).
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, dynamic environment.
  • Strong attention to detail and a commitment to delivering high-quality results.
  • Ability to debug and troubleshoot Java applications.
  • Proficiency in using Splunk for log management and analysis.
  • Familiarity with CI / CD tools and practices.
  • Experience in the banking or financial services industry.
  • Certification in relevant technologies (e.g., AWS Certified Solutions Architect, Google Cloud Professional DevOps Engineer).
  • Knowledge of security best practices and compliance requirements.
  • Ability to articulate the overall vision for the Chapters and ensure upskilling of the organisation holistically
  • Experience in identifying skill gaps and mitigate risks to deliverables
  • Ensure all solutions are as per Architecture Standards
  • Strong experience in software development, system administration, or a related technical field.
  • Proficiency in programming / scripting languages such as Python, Go, Java, or Shell scripting.
  • Experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar.
  • Deep understanding of Linux / Unix systems and networking fundamentals.
  • Experience with cloud platforms such as AWS, GCP, or Azure.
  • Strong analytical and problem-solving skills, with a keen attention to detail.
  • Excellent communication and collaboration skills, with the ability to work effectively in a cross-functional team environment.
  • Prior experience with DevOps practices, continuous integration / continuous delivery (CI / CD) pipelines, and infrastructure as code (IaC) is a plus.
  • Role Specific Technical Competencies

  • Software Engineering
  • Systems Software Infrastructure
  • Platform Architecture
  • Programming & Scripting (Java / Python or Similar Programming Language)
  • Cloud (AWS, Azure, GCP)
  • Database Development
  • Service Excellence
  • Agile Application Delivery Process
  • Operating Systems
  • Network Fundamentals
  • Security Fundamentals
  • Core Banking Domain Knowledge
  • About Standard Chartered

    We're an international bank, nimble enough to act, big enough for impact. For more than 170 years, we've worked to make a positive difference for our clients, communities, and each other. We question the status quo, love a challenge and enjoy finding new opportunities to grow and do better than before. If you're looking for a career with purpose and you want to work for a bank making a difference, we want to hear from you. You can count on us to celebrate your unique talents and we can't wait to see the talents you can bring us.

    Our purpose, to drive commerce and prosperity through our unique diversity, together with our brand promise, to be here for good are achieved by how we each live our valued behaviours. When you work with us, you'll see how we value difference and advocate inclusion.

    Together We

  • Do the right thing and are assertive, challenge one another, and live with integrity, while putting the client at the heart of what we do
  • Never settle, continuously striving to improve and innovate, keeping things simple and learning from doing well, and not so well
  • Are better together, we can be ourselves, be inclusive, see more good in others, and work collectively to build for the long term
  • What We Offer

    In line with our Fair Pay Charter, we offer a competitive salary and benefits to support your mental, physical, financial and social wellbeing.

  • Core bank funding for retirement savings, medical and life insurance, with flexible and voluntary benefits available in some locations.
  • Time-off including annual leave, parental / maternity (20 weeks), sabbatical (12 months maximum) and volunteering leave (3 days), along with minimum global standards for annual and public holiday, which is combined to 30 days minimum.
  • Flexible working options based around home and office locations, with flexible working patterns.
  • Proactive wellbeing support through Unmind, a market-leading digital wellbeing platform, development courses for resilience and other human skills, global Employee Assistance Programme, sick leave, mental health first-aiders and all sorts of self-help toolkits
  • A continuous learning culture to support your growth, with opportunities to reskill and upskill and access to physical, virtual and digital learning.
  • Being part of an inclusive and values driven organisation, one that embraces and celebrates our unique diversity, across our teams, business functions and geographies - everyone feels respected and can realise their full potential.
  • Skills Required

    System Administration, Software Development

    Create a job alert for this search

    Engineering Manager • Chennai, India

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Tata Consultancy ServicesChennai, Tamil Nadu, India
    Role : Site Reliability Engineer.Locations : Chennai / Pune / Kolkata.Show moreLast updated: 16 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CapgeminiChennai, IN
    Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues...Show moreLast updated: 17 days ago
    • Promoted
    AWS Site Reliability Engineer

    AWS Site Reliability Engineer

    HTC Global ServicesChennai, Tamil Nadu, India
    Troy, Michigan, is a leading global Information Technology solution and BPO provider.HTC assists clients across multiple industry verticals, offering turnkey project lifecycle in, e-business, data ...Show moreLast updated: 7 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CodeKarmamount, India
    Site Reliability Engineer (Multi-Cloud Deployments).CodeKarma is redefining how engineering teams understand and evolve complex systems — bringing production context directly into the developer’s w...Show moreLast updated: 22 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ConfidentialChennai, India
    We're looking for an experienced Site Reliability Engineer to fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated, and designed to scale...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer- ELK Expert

    Senior Site Reliability Engineer- ELK Expert

    iVedha Inc.Chennai, IN
    Senior Site Reliability Engineer (SRE) – ELK Expert | Platform Engineering Practice.Must be available to work in the EST (US / Canada) Time Zone. Are you a Senior Site Reliability Engineer (SRE) with ...Show moreLast updated: 30+ days ago
    • Promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    PoshmarkChennai, Tamil Nadu, India
    We’re looking for an experienced.You will use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying ...Show moreLast updated: 20 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    IntraEdgeChennai, IN
    Strong leadership and people management skills.Exceptional technical proficiency in Pearson's technology stack.Strategic thinking with a focus on long-term operational excellence.Champion operation...Show moreLast updated: 20 days ago
    • Promoted
    Senior Software Engineering Manager

    Senior Software Engineering Manager

    OptumChennai, India
    Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives.The work you do with our team will directly improve health outcomes by connect...Show moreLast updated: 20 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Nebula Tech Solutionschennai, tamil nadu, in
    SRE team supporting mission-critical applications for our.We’re now looking for engineers who can go beyond operations — those who can. Enhance application reliability through code.Add or modify cod...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer I

    Senior Site Reliability Engineer I

    LexisNexis Legal & Professional®Chennai, Tamil Nadu, India
    This job is with LexisNexis Legal & Professional®, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter ...Show moreLast updated: 7 days ago
    • Promoted
    Athenahealth - Senior Site Reliability Engineer - On-Premises Infrastructure

    Athenahealth - Senior Site Reliability Engineer - On-Premises Infrastructure

    athenaHealth Technology Private Limited.Chennai
    Description : Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for al...Show moreLast updated: 29 days ago
    • Promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    ConfidentialChennai, India
    Canonical is a leading provider of open-source software and operating systems for global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initi...Show moreLast updated: 11 days ago
    • Promoted
    Senior Site Reliability Engineer (SRE) – Datadog Observability

    Senior Site Reliability Engineer (SRE) – Datadog Observability

    Jade Globalchennai, India
    Senior Site Reliability Engineer (SRE) – Datadog Observability.SRE and Infrastructure Operations with minimum 3.Hyderabad preferable but open for Pune and remote. Site Reliability Engineer (SRE).SRE...Show moreLast updated: 7 days ago
    • Promoted
    Miratech - Senior Site Reliability Engineer

    Miratech - Senior Site Reliability Engineer

    MiratechChennai
    Description : About Miratech : Miratech helps visionaries change the world.We are a global IT services and consulting company tha...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ElgebraChennai
    Role Overview : We are seeking a highly experienced and technically proficient Site Reliability Engineer (SRE) to join our team in support of our c...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Zyoin GroupChennai
    Description : MoneyForward is seeking a Site Reliability Engineer (SRE) to lead the reliability, scalability, and performance of our products.This role invol...Show moreLast updated: 6 days ago
    • Promoted
    Sr Engineer, Site Reliability [T500-21295]

    Sr Engineer, Site Reliability [T500-21295]

    TMUS Global Solutionsmount, India
    NASDAQ : TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mo...Show moreLast updated: 4 days ago