Talent.com
Reliability Architect - AIOps / MLOps

Reliability Architect - AIOps / MLOps

Growel Softech Pvt. Ltd.India
9 days ago
Job description

Description :

Location : Pan India Except Mumbai

About the Role :

We are looking for a highly experienced Reliability Architect with strong expertise in proactive monitoring, observability, automation, AIOps / MLOps, and large-scale infrastructure management.

The ideal candidate will drive system reliability, performance optimization, and cross-functional collaboration while leading incident response and mentoring support teams.

Key Responsibilities :

Monitoring & Automation :

  • Proactively monitor software systems to prevent incidents and reduce manual intervention.
  • Automate routine operational tasks to maximize operational efficiency.

Effective Monitoring & Alerting :

  • Design intelligent monitoring systems that trigger symptom-based alerts for early issue detection.
  • Configure alert thresholds, anomaly detection rules, and escalation workflows.
  • Application Performance Monitoring (APM) :

  • Implement and manage APM tools such as New Relic, Dynatrace, AppDynamics, etc.
  • Track application performance, identify bottlenecks, and optimize resource utilization.
  • Log Analysis & Troubleshooting :

  • Leverage Splunk (or similar tools) for log analysis, anomaly detection, and incident debugging.
  • Improve system reliability through continuous log insights and root cause analysis.
  • Dashboards & Reporting :

  • Build intuitive dashboards visualizing system health, KPIs, and operational metrics.
  • Automate scheduled reports for performance trends, reliability metrics, and risk indicators.
  • Reliability Metrics & Observability :

  • Define and track SLOs, SLIs, error budgets, and other reliability benchmarks.
  • Apply full-stack observability practices including logs, metrics, distributed tracing, and event correlation.
  • AI-Driven Monitoring (AIOps / MLOps) :

  • Use AIOps to detect anomalies, automate incident response, and build self-healing workflows.
  • Integrate ML models with observability tools for predictive insights and performance optimization.
  • Cross-Team Collaboration :

  • Collaborate with development, DevOps, and support teams to enhance service reliability.
  • Strengthen release processes through rigorous testing, reviews, and monitoring integration.
  • Capacity Planning & Performance :

  • Participate in architecture and design reviews.
  • Ensure systems are scalable, resilient, and optimized for peak performance.
  • Debugging, Incident Response & Rollbacks :

  • Lead major incident response efforts with structured troubleshooting and RCA.
  • Manage controlled rollbacks of faulty deployments and ensure minimal service impact.
  • Mentoring & Knowledge Sharing :

  • Mentor L1 / L2 support teams, establishing best practices for monitoring and observability.
  • Promote a culture of reliability engineering and continuous improvement.
  • Infrastructure & Tooling :

  • Manage infrastructure using tools like Chef, Ansible, Terraform, Kubernetes, GitLab CI / CD, etc.
  • Support automation, configuration management, and infrastructure-as-code workflows.
  • Documentation :

  • Maintain detailed documentation of processes, architectures, SOPs, and troubleshooting guides.
  • Proactive Mindset :

  • Drive reliability initiatives with ownership, enthusiasm, and a forward-thinking approach.
  • Desired Skills & Tools :

  • AIOps / MLOps platforms
  • Splunk, Grafana, Kibana, Prometheus
  • New Relic, Dynatrace, AppDynamics
  • Terraform, Ansible, Chef
  • GitLab CI / CD, Jenkins
  • Kubernetes, Docker
  • Strong debugging and RCA skills
  • Excellent communication and cross-functional collaboration
  • (ref : hirist.tech)

    Create a job alert for this search

    Reliability Architect • India

    Related jobs
    • Promoted
    ML Ops

    ML Ops

    EXLNagpur, IN
    Deploy, monitor, and scale ML models on.GCP (Vertex AI, GKE, Cloud Functions).GitHub Actions / Jenkins / cloud-native tools. Containerize and orchestrate workloads with.MLflow, Feast, Prometheus / Gra...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Datum Technologies GroupIndia
    Job Title : Site Reliability Engineer (SRE) – AWS.AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, Datadog.We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experi...Show moreLast updated: 8 days ago
    • Promoted
    Lead Engineer

    Lead Engineer

    HyqooNagpur, IN
    Design, deploy, and manage AWS cloud infrastructure, including EC2 instances, S3 buckets, VPCs, RDS databases, and Lambda functions. Assist in the design, implementation, and maintenance of backup, ...Show moreLast updated: 12 days ago
    • Promoted
    • New!
    Solutions Engineer - SRE - Remote

    Solutions Engineer - SRE - Remote

    datavrutiNagpur, IN
    Remote
    Role : Solutions Engineer (SRE / DevOps).A fast-growing AI-driven reliability engineering startup helping organizations reduce downtime by improving incident investigation, root-cause analysis, and ...Show moreLast updated: 9 hours ago
    • Promoted
    Senior DevOps & Database Reliability Engineer – 100% Remote

    Senior DevOps & Database Reliability Engineer – 100% Remote

    Hyly.AIIndia, India
    Remote
    AI, we’re building the first AI + Data Fabric for the multifamily industry, transforming how clients manage, secure, and scale their marketing and operational data. As the industry moves toward a co...Show moreLast updated: 9 days ago
    • Promoted
    AIML Architect

    AIML Architect

    ValueLabsnagpur, maharashtra, in
    We at ValueLabs have an Opening for AI / ML Architect role.At least 7+ years of relevant AI / ML experience or previous ML experience with strong engineering competencies and at least 2+ years in Gener...Show moreLast updated: 1 day ago
    • Promoted
    Founding MLOps Engineer

    Founding MLOps Engineer

    Vectorial AINagpur, IN
    Vectorial is a simulation engine platform powered by millions of synthetic users—state-of-the-art models that capture real human behavior—to deliver instant, nuanced validation across the entire pr...Show moreLast updated: 11 days ago
    • Promoted
    DevOps Architect -India (100% Remote)

    DevOps Architect -India (100% Remote)

    Connect Tech+TalentNagpur, IN
    Remote
    Not from Infrastructure side – Need Devops person.Must have prior experience working at one or more of the following companies' payroll / project is required : Microsoft, Oracle, SAP, Adobe, Salesforc...Show moreLast updated: 10 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PhonePeIndia
    SRE We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production ...Show moreLast updated: 1 day ago
    • Promoted
    MLOps Engineer

    MLOps Engineer

    Capgemininagpur, maharashtra, in
    Experience in developing MLOps framework cutting ML lifecycle : model development, training, evaluation, deployment, monitoring including Model Governance. Expert in Azure Databricks, Azure ML, Unity...Show moreLast updated: 15 days ago
    • Promoted
    Integration Architect

    Integration Architect

    Inforizon Corporate Services Pvt Ltdnagpur, maharashtra, in
    Duration – 6 months, can be extended.Lead the design and delivery of integrations using IBM App Connect Enterprise, IBM MQ,. Managed File Transfer platforms and SAP.Work across IT and business teams...Show moreLast updated: 1 day ago
    • Promoted
    Azure Kubernetes Service (AKS) Architect

    Azure Kubernetes Service (AKS) Architect

    CapgeminiNagpur, IN
    Azure Kubernetes Service (AKS) clusters.AKS cluster security, scalability, and performance optimization.AKS with CI / CD pipelines for automated deployments. RBAC, secrets management, and compliance s...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer (SRE) – Infrastructure & Automation

    Site Reliability Engineer (SRE) – Infrastructure & Automation

    InstaServiceNagpur, IN
    InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show moreLast updated: 15 days ago
    • Promoted
    • New!
    Senior Cloud Infrastrcuture Consultant with Openstack, Vmware, Linux, Kubernetes, Devops, Terraform, KVM background - 100% REMOTE

    Senior Cloud Infrastrcuture Consultant with Openstack, Vmware, Linux, Kubernetes, Devops, Terraform, KVM background - 100% REMOTE

    iShiftNagpur, IN
    Remote
    Job Title : Senior Cloud Infrastructure Consultant with strong Openstack, VVMware, Linux, Terraform, Kubernetes and DevOps. Location : India based 100% REMOTE.Employment Type : Contract Role.We are see...Show moreLast updated: 9 hours ago
    • Promoted
    Senior Solutions Architect (Data)

    Senior Solutions Architect (Data)

    Hillview Consulting SolutionsNagpur, IN
    If candidate is in Mumbai this would be onsite in Andheri East, Mumbai, Maharashtra.We’re looking for a senior, hands-on. You’ll own architecture for ETL / ELT, data warehousing, analytics pipelines, ...Show moreLast updated: 3 days ago
    • Promoted
    Oracle CPQ Architect

    Oracle CPQ Architect

    Avikal Solutionsnagpur, maharashtra, in
    This is a full-time remote position for an Oracle CPQ Architect.The Oracle CPQ Architect will lead the design, development, and implementation of Oracle Configure Price Quote (CPQ) solutions.The id...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    KarixIndia
    We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Serv...Show moreLast updated: 1 day ago
    • Promoted
    AIOps Architect (Observability Expert)

    AIOps Architect (Observability Expert)

    Tata Consultancy ServicesIndia
    AIOps Architect (Observability Expert).AIOps Architect (Observability Expert).Experience : 4 years to 11years.Mandatory Skills : AIOps / Monitoring / Observability / ELK). Strong hands-on in AIOps / Observa...Show moreLast updated: 1 day ago