Title : Senior Manager - Cloud SME "Redhat Open shift" (Private Cloud)
Location : Pune
Experience : 12+ years
Role Summary :
We are seeking a highly skilled Hybrid Cloud Platform Engineer to design, implement, and manage our consolidated platform built on Red Hat OpenShift Container Platform (RHOCP) . This unique role requires expertise in running both containerized workloads and traditional Virtual Machines (VMs) using OpenShift Virtualization . The ideal candidate will be a deep technical expert in cloud orchestration , VM lifecycle management , and deploying comprehensive observability and analytics solutions across the hybrid environment to ensure performance, reliability, and cost efficiency. The Senior NOC / SOC Operations Engineer will manage and operate Telco Cloud platforms built on Red Hat OpenShift and Private Cloud Virtualization environments supporting VNF / CNF workloads .
This role requires strong hands-on experience in VM lifecycle management , orchestration , cloud management , and AI / ML-driven analytics . The engineer will ensure 24x7 availability, proactive monitoring, fault management, and lifecycle support of cloud-native and virtualized network functions in a Telco-grade production setup
Key Responsibilities :
Cloud & Virtualization Operations
- Manage daily operations of Red Hat OpenShift (Kubernetes-based container platform) and RHOSP-based virtualization environments.
- Perform VM lifecycle operations (provisioning, scaling, migration, snapshot, decommissioning).
- Monitor and troubleshoot compute, storage, and network resources within Red Hat Private Cloud.
- Maintain and optimize hypervisors (KVM / QEMU), ensuring performance and availability SLAs.
- Manage tenant configurations, quotas, and multi-tenant isolation within RHOSP.
Orchestration & Automation
Operate and maintain Red Hat CloudForms / Ansible Automation Platform for orchestration workflows.Support Day-0 to Day-2 operations through policy-driven automation templates.Integrate orchestration with VNFM / NFVO components for VNF / CNF deployments and scaling.Ensure alignment of orchestration workflows with ITSM change management processes.VNF / CNF & Telco Cloud Operations
Perform lifecycle management of VNFs and CNFs (onboarding, instantiation, scaling, termination).Troubleshoot network function issues in coordination with Network Engineering and NFV Orchestration teams.Validate service chains, SDN underlay / overlay connectivity, and application availability.Coordinate with OEM vendors for updates, patches, and RCA of incidents.AI / ML-based Analytics & Observability
Utilize AI-ML analytics platforms for predictive fault detection, anomaly identification, and performance optimization.Support implementation of closed-loop automation through analytics-driven triggers.Participate in continuous improvement initiatives for automated RCA and alert correlation.Monitoring, Incident & Change Management :
Monitor infrastructure KPIs – CPU / memory utilization, pod / container health, network throughput, and storage latency.Respond to alerts from monitoring tools (Zabbix, Prometheus, Grafana, or OpenShift Console).Manage incidents, problems, and change activities following ITIL guidelines.Maintain configuration documentation, CMDB updates, and operational dashboards.Key Skills and Certifications :
Platform / Containerization - Red Hat OpenShift Container Platform (RHOCP) , Kubernetes, Operators, CRI-O, Pod Networking (SDN / CNI).Virtualization - OpenShift Virtualization (KubeVirt) , VM Lifecycle Management (provisioning, migration, snapshots), KVM, virt-launcher pods.Orchestration / Automation- Ansible (Playbooks, Roles, Automation Platform), GitOps ( ArgoCD or OpenShift GitOps ), Infrastructure-as-Code (IaC), Tekton or Jenkins.Observability / Analytics- Prometheus (Metrics), Grafana (Visualization), Loki / Vector / Fluentd (Logging), Jaeger / OpenTelemetry (Tracing), Data analysis for capacity planning.Networking / Storage - SDN, CNI, Load Balancing, Ingress / Egress, Red Hat OpenShift Data Foundation (ODF) or Ceph / NFS / iSCSI, Persistent Volume Claims (PVCs).Cloud Operations Requirements w.r.t RHOCP :
1. Cluster Management and Reliability
High Availability (HA) : Implement and maintain HA for the control plane (3 or 5 Master nodes) and worker nodes across availability zones / domains.Automated Lifecycle : Use OpenShift Operators and the Cluster Version Operator (CVO) for automated, non-disruptive upgrades, patches, and security fixes for the platform and its add-ons.Security : Proactive management of Red Hat Enterprise Linux CoreOS (RHCOS) for control plane nodes, using immutability to enhance security and simplify patches.Disaster Recovery (DR) : Implement robust backup and restore strategies for cluster configuration, etcd data, and critical workloads using tools like OpenShift APIs for Data Protection (OADP) or similar solutions.2. Virtualization Operations
Unified Management : Manage the VM lifecycle (provisioning, scale, decommission) using Kubernetes API objects (like VirtualMachine, VirtualMachineInstance) and the OpenShift Console, treating VMs as native cluster resources alongside containers.Workload Migration : Utilize the Migration Toolkit for Virtualization (MTV) to streamline the move of existing VMs from external virtualization platforms (like VMware or RHEV) to OpenShift Virtualization.Compute Consistency : Ensure consistent application of network and storage policies for both container pods and KubeVirt VMs.3. Observability and Analytics
Unified Monitoring : Consolidate metrics, logs, and traces from both container pods and virtual machines into a single platform (e.g., OpenShift's integrated Prometheus / Grafana stack).Proactive Alerting : Configure alerts based on predefined SLOs / SLIs for the health of the OpenShift cluster, underlying infrastructure, and key VM / application performance indicators (CPU, Memory, Disk IO).Capacity Planning : Regularly analyze historical usage data from OpenShift (including VM utilization) to predict future resource needs and optimize cost efficiency across the hybrid cloud footprint.Troubleshooting : Establish runbooks and procedures for utilizing the observability data to quickly isolate the root cause of issues, whether they stem from the container layer, the virtualization layer, or the underlying cloud / hardware.