Job descriptionRole Overview:
We are seeking a highly skilled and motivated engineer to serve as a Team Lead in our Compute Chapter supporting Operating System Operations. This role is ideal for a seasoned professional with deep expertise in Linux and Windows operating systems supporting virtual servers across a large scale enterprise (10K+ servers) The ideal candidate will also guide junior engineers and drive operational excellence.
Key Responsibilities:
System Administration Install, configure, and maintain Linux and Windows operating systems across Private and Public Cloud environments Manage system lifecycle activities: provisioning, patching, upgrades, and decommissioning Monitor system health, performance, and availability through automation and enhanced Observability capabilities Perform troubleshooting and root cause analysis for OS‑level incidents Implement security hardening and compliance standards (e.g., CIS benchmarks) Manage user access, permissions, and authentication (SSH, sudo, LDAP/AD integration) Develop and maintain automation using scripting and configuration management tools Automation & Orchestration Build and maintain automation scripts and frameworks (e.g., using Terraform/Ansible) to streamline infrastructure provisioning, patching, and monitoring. Integrate Server environments with enterprise automation platforms, observability toolsets, and CI/CD pipelines. 24x7 Operations & Incident Management Build and manage a team responsible for 24x7 operations including software patching, technology currency, and vulnerability management. Oversee incident response, root cause analysis, and resolution processes to ensure infrastructure resilience and compliance. Mentorship & Collaboration Mentor junior engineers and foster a culture of continuous learning and technical excellence. Collaborate with cross-functional teams including network, security, and application teams to ensure seamless infrastructure operations. Operational Excellence Ensure high availability, scalability, and security of server environments. Participate in incident response and root cause analysis for infrastructure-related issues. Qualifications Strong Linux and Windows administration experience Shell scripting (Bash); Python preferred Configuration management and automation (Ansible, Terraform) System services and internals (systemd, cron, filesystems, LVM) Networking fundamentals (TCP/IP, DNS, NTP, firewalls) Logging and monitoring tools (Splunk, rsyslog, Prometheus, etc.) Experience with virtualization and/or cloud platforms (VMware, AWS, GCP)