Job Summary :
We are seeking a highly skilled and experienced Cloud Engineer with a strong Site Reliability Engineering (SRE) mindset to join our team. This role will be critical in ensuring the availability, reliability, and performance of our platform services and applications, particularly those supporting our Radio Access Network (RAN) and Core Network functions deployed on cloud infrastructure. The ideal candidate will possess deep expertise in Kubernetes, cloud operations, and a passion for optimizing complex distributed systems. You will be instrumental in running our production environment, responding to critical incidents, and driving continuous improvement in system reliability and efficiency across both RAN and Core cloud deployments.
Key Responsibilities :
Platform Reliability & Availability (SRE Focus) :
Run the production environment by proactively monitoring availability and taking a holistic view of system health for our cloud-based RAN and Core Network platforms.
Improve the reliability and quality of the system through automation, process refinement, and best practices for both RAN and Core cloud components.
Measure and optimize system performance to ensure efficient resource utilization and optimal user experience for network services.
Ensure services are available, the underlying infrastructure is properly functioning and monitor critical applications and related services to guarantee system availability for RAN and Core functions.
Cloud Operations & Kubernetes Management :
Design, deploy, and manage Kubernetes clusters and related cloud infrastructure for both RAN and Core Network application deployments.
Implement and maintain containerization strategies and orchestration best practices for telecom workloads.
Manage and troubleshoot Robin storage solutions within the Kubernetes environment, supporting the unique storage needs of RAN and Core applications.
Implement and manage CI / CD pipelines for cloud-native RAN and Core applications.
Responsible for cloud resource provisioning, scaling, and cost optimization for all deployed network functions.
Incident & Problem Management :
Collaborate for high-priority incident tickets (e.g., MIC Reported Incident, Serious / Medium / Small Network Incidents, RIUD Faults), ensuring rapid system recovery for both RAN and Core impacted services.
Be on standby to interface with developers when issues arise and get escalated, providing immediate technical insights and support for cloud-native network functions.
Lead Problem Management efforts, including Root Cause Analysis (RCA), for complex incidents affecting RAN and Core cloud deployments.
Identify bugs and work with development teams to prioritize and implement fixes for cloud-native network elements.
Monitoring & Alerting :
Implement and maintain robust monitoring, logging, and alerting solutions for cloud infrastructure and applications supporting RAN and Core services.
Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical RAN and Core services running in the cloud.
Automation & Tooling :
Develop and implement automation scripts and tools to streamline operational tasks, deployments, and incident response for cloud-native RAN and Core components.
Evaluate and integrate new tools and technologies to enhance operational efficiency.
Collaboration & Knowledge Sharing :
Support for Governance Reports, providing technical data and insights on cloud platform performance for RAN and Core .
Handle customer queries with technical expertise and provide timely resolutions related to cloud-deployed network services.
Provide training and mentorship to junior team members on cloud technologies and SRE practices, specifically in the context of telecom networks.
Work closely with development, network, and security teams to ensure seamless service delivery across the entire network architecture.
Technical Requirements (Most Visible) :
Deep expertise in Kubernetes :
Cluster deployment, management, and troubleshooting for high-performance telecom workloads.
Container orchestration, Pod lifecycle, Deployments, Services, Ingress.
Helm charts, Kustomize.
Advanced networking within Kubernetes (CNI, CoreDNS, service mesh concepts).
Security best practices in Kubernetes, especially for critical network functions.
Proficiency in Cloud Platforms : Experience with at least one major cloud provider (e.g., AWS, Azure, GCP) with focus on enterprise-grade infrastructure.
Containerization Technologies : Docker, container.
Robin Storage : Hands-on experience with Robin.io or similar distributed persistent storage solutions for Kubernetes, particularly for stateful RAN and Core applications.
Infrastructure as Code (IaC) : Terraform, Ansible, or similar tools for automating cloud and Kubernetes deployments.
Scripting & Automation : Strong proficiency in Python, Go, Bash, or similar for developing automation and tooling.
Monitoring & Logging Tools : Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog, or similar, with experience in large-scale data ingestion and analysis.
CI / CD Tools : Jenkins, GitLab CI / CD, Argo CD, or similar, for continuous deployment of network functions.
Operating Systems : Linux (e.g., CentOS, Ubuntu, RHEL) expert-level knowledge.
Networking Fundamentals : Deep understanding of TCP / IP, DNS, Load Balancing, Firewalls, VPNs, and advanced network concepts relevant to telecom (e.g., SRv6, Segment Routing, GTP-U / C).
Telecommunications Network Knowledge :
Strong understanding of Radio Access Network (RAN) architecture, components, and interfaces (e.g., O-RAN, vRAN concepts).
Strong understanding of Core Network (EPC / 5GC) architecture, functions (e.g., AMF, SMF, UPF, MME, SGW, PGW), and protocols.
Familiarity with network function virtualization (NFV) and software-defined networking (SDN) principles.
Qualifications :
Education : Bachelor’s degree in computer science, Engineering, or a related field.
Experience : Minimum of 5-6 years of experience in a Cloud Engineering, DevOps, or SRE role, with a significant focus on Kubernetes and cloud operations, ideally within a telecommunications or high-availability environment.
Problem-Solving : Exceptional analytical and problem-solving skills, with a methodical approach to debugging complex distributed systems.
Communication : Excellent verbal and written communication skills, capable of effectively collaborating with technical and non-technical stakeholders.
Proactive Mindset : Ability to anticipate issues, identify risks, and propose preventative solutions.
Incident Response : Proven experience in responding to and resolving critical production incidents in a fast-paced environment.
Continuous Improvement : A strong desire to learn, adapt, and drive continuous improvement in processes and systems.
Cloud Support Engineer • Bengaluru, Karnataka, India