We are seeking a highly skilled and proactive Server Management Lead to oversee end-to-end operations of data center infrastructure. This role involves ensuring high availability, optimal performance, cost efficiency, and adherence to security standards across our server ecosystem. The candidate will also lead a team of six Level 1 and Level 2 engineers to achieve near-zero downtime while maintaining industry-leading operational excellence.
Key Responsibilities
Infrastructure Management
- Design, deploy, and maintain golden images for consistent and secure server provisioning.
- Ensure standardized builds and configurations across all environments.
- Oversee hardware and OS lifecycle management, including patching and upgrades.
Security & Compliance
Conduct regular Vulnerability Assessment and Penetration Testing (VAPT).Remediate identified risks in line with security best practices and compliance requirements.Enforce access control, audit readiness, and adherence to organizational security policies.Performance & Capacity Planning
Develop and maintain a capacity planning framework to anticipate and scale resources proactively.Monitor system performance, troubleshoot bottlenecks, and optimize resource allocation.Partner with architecture teams to align capacity with business growth.Monitoring & Uptime
Implement and fine-tune end-to-end monitoring tools (infrastructure, application, and network layers).Establish escalation procedures and SLAs to maintain 99.99% uptime.Lead root cause analysis (RCA) for incidents and drive permanent corrective actions.Cost Optimization
Analyze server utilization trends to identify cost-saving opportunities (rightsizing, consolidation, cloud / hybrid strategies).Implement automation for provisioning, scaling, and decommissioning resources to reduce waste.Provide periodic reporting to leadership on cost-performance balance.Leadership & Team Management
Manage and mentor a team of 6 Level 1 & Level 2 engineers, fostering technical growth and operational discipline.Define KPIs for performance, ticket resolution, and uptime accountability.Promote a culture of continuous improvement, automation, and service excellence.________________________________________
Qualifications
Bachelor's degree in Computer Science, Information Technology, or related field.8–12 years of experience in server management / data center operations, including at least 3 years in a leadership role.Strong expertise in virtualization, server operating systems (Linux / Windows), storage, and networking fundamentals.Hands-on experience with monitoring platforms (Site 247, Patch Manager etc.) and automation tools (Ansible, Puppet, or similar) is added advantageProven track record of driving zero-downtime initiatives and cost optimization in enterprise environments.________________________________________
Key Competencies
Technical Excellence – deep understanding of server operations and best practices.Leadership – ability to lead and inspire a team, with strong decision-making skills.Analytical Thinking – capacity planning, problem-solving, and cost analysis.Resilience & Accountability – ensuring uptime and compliance under pressure.Communication – ability to work cross-functionally and present technical insights to leadership.________________________________________
Success Metrics
Consistent achievement of 99.99% uptime across server infrastructure.Successful closure of all VAPT findings within SLA.Demonstrated cost reduction in server operations through optimization initiatives.Improved incident resolution times and reduced recurring issues.High team engagement and skill growth within the engineering group.Skills Required
Linux, Ansible, Server Management, Data Center Operations, Windows, Puppet, Virtualization