Role Overview : This role will be instrumental in ensuring our academic institution’s IT infrastructure is secure, scalable, cost-effective, and aligned with our mission to support education and research excellence. The ideal candidate will bring deep technical expertise in cloud platforms, infrastructure automation, and CI / CD practices, with a growing focus on enabling AI / ML workloads in production environments. They will also possess strong skills in vendor and budget management, along with a clear commitment to operational excellence and service delivery in a complex, mission-driven environment.
Principal Accountabilities & Responsibilities
- Design, deploy, and manage secure, scalable cloud infrastructure (primarily AWS; GCP
exposure is a plus).
Oversee core IT infrastructure, including networking, server provisioning, storage, and backupsolutions.
Ensure adherence to institutional compliance standards and security best practices; conduct rootcause analysis for infrastructure incidents.
Administer and optimize containerized environments using Docker and Kubernetes (EKSpreferred)
Design and implement automated backup and disaster recovery strategies across cloud and on-premise environments, ensuring data resilience and compliance with RTO / RPO objectives.
Lead response to downtime events, developing proactive strategies to minimize system outages,optimize recovery time, and maintain high availability across critical services.
Monitor and analyse cloud workloads to ensure high performance and cost efficiency.Manage cloud infrastructure budgets and pricing strategies, ensuring optimal resource allocationand cost control.
Additionally, expertise in SRE and security may be essential.Serve as the single point of contact (SPOC) for vendor and license management (e.g., Zoom,Microsoft 365). Guide and mentor juniors in the team.
Lead digital transformation initiatives to modernize IT infrastructure in alignment withacademic and operational goals.
Skill and Ability Requirements
Minimum 5+ years of progressively responsible experience in cloud infrastructure andoperations.
Strong proficiency in Amazon Web Services (AWS), including EC2, S3, IAM, CloudWatch,and Lambda.
In-depth understanding of networking concepts such as VPC, DNS, Load Balancing, andVPN.
Expertise in containerization with Docker and orchestration using Kubernetes (EKSpreferred).
Experience with cloud security tools, including AWS Security Hub and CloudTrail.Hands-on experience with monitoring tools, such as Prometheus, Grafana, and Datadog.Strong experience in cloud cost optimization, pricing analysis, and budget managementProficiency with CI / CD tools and pipelines (e.g., Jenkins, GitHub Actions).Experience integrating CI / CD for infrastructure-as-code (IaC) and AI / ML workflows.Experience deploying AI models and solutions in cloud environments (e.g., AWS SageMaker,Azure ML, GCP Vertex AI).
Proven experience in Linux system administration and shell scripting.Demonstrated ability to manage vendor relationships and license agreements (Zoom,Microsoft 365).
Familiarity with Google Cloud Platform (GCP) is a plus.Excellent problem-solving skills and the ability to lead incident response and root causeanalysis.
Qualification & Experience
Bachelor’s degree / Master’s degree in Computer Science, Information Technology, or a relatedfield.
AWS certifications are highly desirable.Minimum 5+ years of progressively responsible experience in cloud infrastructure andoperations.