Reliability Architect - AIOps / MLOps

Growel Softech Pvt. Ltd.India

9 days ago

Job description

Description :

Location : Pan India Except Mumbai

About the Role :

We are looking for a highly experienced Reliability Architect with strong expertise in proactive monitoring, observability, automation, AIOps / MLOps, and large-scale infrastructure management.

The ideal candidate will drive system reliability, performance optimization, and cross-functional collaboration while leading incident response and mentoring support teams.

Key Responsibilities :

Monitoring & Automation :

Proactively monitor software systems to prevent incidents and reduce manual intervention.
Automate routine operational tasks to maximize operational efficiency.

Effective Monitoring & Alerting :

Design intelligent monitoring systems that trigger symptom-based alerts for early issue detection.

Configure alert thresholds, anomaly detection rules, and escalation workflows.

Application Performance Monitoring (APM) :

Implement and manage APM tools such as New Relic, Dynatrace, AppDynamics, etc.

Track application performance, identify bottlenecks, and optimize resource utilization.

Log Analysis & Troubleshooting :

Leverage Splunk (or similar tools) for log analysis, anomaly detection, and incident debugging.

Improve system reliability through continuous log insights and root cause analysis.

Dashboards & Reporting :

Build intuitive dashboards visualizing system health, KPIs, and operational metrics.

Automate scheduled reports for performance trends, reliability metrics, and risk indicators.

Reliability Metrics & Observability :

Define and track SLOs, SLIs, error budgets, and other reliability benchmarks.

Apply full-stack observability practices including logs, metrics, distributed tracing, and event correlation.

AI-Driven Monitoring (AIOps / MLOps) :

Use AIOps to detect anomalies, automate incident response, and build self-healing workflows.

Integrate ML models with observability tools for predictive insights and performance optimization.

Cross-Team Collaboration :

Collaborate with development, DevOps, and support teams to enhance service reliability.

Strengthen release processes through rigorous testing, reviews, and monitoring integration.

Capacity Planning & Performance :

Participate in architecture and design reviews.

Ensure systems are scalable, resilient, and optimized for peak performance.

Debugging, Incident Response & Rollbacks :

Lead major incident response efforts with structured troubleshooting and RCA.

Manage controlled rollbacks of faulty deployments and ensure minimal service impact.

Mentoring & Knowledge Sharing :

Mentor L1 / L2 support teams, establishing best practices for monitoring and observability.

Promote a culture of reliability engineering and continuous improvement.

Infrastructure & Tooling :

Manage infrastructure using tools like Chef, Ansible, Terraform, Kubernetes, GitLab CI / CD, etc.

Support automation, configuration management, and infrastructure-as-code workflows.

Documentation :

Maintain detailed documentation of processes, architectures, SOPs, and troubleshooting guides.

Proactive Mindset :

Drive reliability initiatives with ownership, enthusiasm, and a forward-thinking approach.

Desired Skills & Tools :

AIOps / MLOps platforms

Splunk, Grafana, Kibana, Prometheus

New Relic, Dynatrace, AppDynamics

Terraform, Ansible, Chef

GitLab CI / CD, Jenkins

Kubernetes, Docker

Strong debugging and RCA skills

Excellent communication and cross-functional collaboration

(ref : hirist.tech)

Create a job alert for this search

Reliability Architect • India

Related jobs

Promoted

ML Ops

EXLNagpur, IN

Deploy, monitor, and scale ML models on.GCP (Vertex AI, GKE, Cloud Functions).GitHub Actions / Jenkins / cloud-native tools. Containerize and orchestrate workloads with.MLflow, Feast, Prometheus / Gra...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Datum Technologies GroupIndia

Job Title : Site Reliability Engineer (SRE) – AWS.AWS, Terraform, Kubernetes, Docker, Grafana, Prometheus, Datadog.We are looking for a skilled Site Reliability Engineer (SRE) with strong AWS experi...Show moreLast updated: 8 days ago

Promoted

Lead Engineer

HyqooNagpur, IN

Design, deploy, and manage AWS cloud infrastructure, including EC2 instances, S3 buckets, VPCs, RDS databases, and Lambda functions. Assist in the design, implementation, and maintenance of backup, ...Show moreLast updated: 12 days ago

Promoted
New!

Solutions Engineer - SRE - Remote

datavrutiNagpur, IN

Remote

Role : Solutions Engineer (SRE / DevOps).A fast-growing AI-driven reliability engineering startup helping organizations reduce downtime by improving incident investigation, root-cause analysis, and ...Show moreLast updated: 9 hours ago

Promoted

Senior DevOps & Database Reliability Engineer – 100% Remote

Hyly.AIIndia, India

Remote

AI, we’re building the first AI + Data Fabric for the multifamily industry, transforming how clients manage, secure, and scale their marketing and operational data. As the industry moves toward a co...Show moreLast updated: 9 days ago

Promoted

AIML Architect

ValueLabsnagpur, maharashtra, in

We at ValueLabs have an Opening for AI / ML Architect role.At least 7+ years of relevant AI / ML experience or previous ML experience with strong engineering competencies and at least 2+ years in Gener...Show moreLast updated: 1 day ago

Promoted

Founding MLOps Engineer

Vectorial AINagpur, IN

Vectorial is a simulation engine platform powered by millions of synthetic users—state-of-the-art models that capture real human behavior—to deliver instant, nuanced validation across the entire pr...Show moreLast updated: 11 days ago

Promoted

DevOps Architect -India (100% Remote)

Connect Tech+TalentNagpur, IN

Remote

Not from Infrastructure side – Need Devops person.Must have prior experience working at one or more of the following companies' payroll / project is required : Microsoft, Oracle, SAP, Adobe, Salesforc...Show moreLast updated: 10 days ago

Promoted

Site Reliability Engineer

PhonePeIndia

SRE We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production ...Show moreLast updated: 1 day ago

Promoted

MLOps Engineer

Capgemininagpur, maharashtra, in

Experience in developing MLOps framework cutting ML lifecycle : model development, training, evaluation, deployment, monitoring including Model Governance. Expert in Azure Databricks, Azure ML, Unity...Show moreLast updated: 15 days ago

Promoted

Integration Architect

Inforizon Corporate Services Pvt Ltdnagpur, maharashtra, in

Duration – 6 months, can be extended.Lead the design and delivery of integrations using IBM App Connect Enterprise, IBM MQ,. Managed File Transfer platforms and SAP.Work across IT and business teams...Show moreLast updated: 1 day ago

Promoted

Azure Kubernetes Service (AKS) Architect

CapgeminiNagpur, IN

Azure Kubernetes Service (AKS) clusters.AKS cluster security, scalability, and performance optimization.AKS with CI / CD pipelines for automated deployments. RBAC, secrets management, and compliance s...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer (SRE) – Infrastructure & Automation

InstaServiceNagpur, IN

InstaService is revolutionizing the home services industry through AI-driven technology, connecting customers with trusted professionals instantly. We’re growing fast across 23+ states and expanding...Show moreLast updated: 15 days ago

Promoted
New!

Senior Cloud Infrastrcuture Consultant with Openstack, Vmware, Linux, Kubernetes, Devops, Terraform, KVM background - 100% REMOTE

iShiftNagpur, IN

Remote

Job Title : Senior Cloud Infrastructure Consultant with strong Openstack, VVMware, Linux, Terraform, Kubernetes and DevOps. Location : India based 100% REMOTE.Employment Type : Contract Role.We are see...Show moreLast updated: 9 hours ago

Promoted

Senior Solutions Architect (Data)

Hillview Consulting SolutionsNagpur, IN

If candidate is in Mumbai this would be onsite in Andheri East, Mumbai, Maharashtra.We’re looking for a senior, hands-on. You’ll own architecture for ETL / ELT, data warehousing, analytics pipelines, ...Show moreLast updated: 3 days ago

Promoted

Oracle CPQ Architect

Avikal Solutionsnagpur, maharashtra, in

This is a full-time remote position for an Oracle CPQ Architect.The Oracle CPQ Architect will lead the design, development, and implementation of Oracle Configure Price Quote (CPQ) solutions.The id...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

KarixIndia

We are seeking an experienced professional Site Reliability Engineer who acts as a bridge between development and IT operations, taking operational tasks to ensure the efficient functioning of Serv...Show moreLast updated: 1 day ago

Promoted

AIOps Architect (Observability Expert)

Tata Consultancy ServicesIndia

AIOps Architect (Observability Expert).AIOps Architect (Observability Expert).Experience : 4 years to 11years.Mandatory Skills : AIOps / Monitoring / Observability / ELK). Strong hands-on in AIOps / Observa...Show moreLast updated: 1 day ago