ThoughtSpot is an AI-powered analytics platform that enables users to explore and analyse data through natural language queries, making insights accessible to all. Our mission is to deliver reliable, high-performing applications that empower our customers. We are seeking a Systems Reliability Engineer who excels at providing technical support for our customers, incident management and resolution, and cloud operations within a customer-centric environment. Role Overview :
We are seeking an experienced and highly motivated Senior Systems Reliability Engineer II (SRE) to join our growing team. This is a customer-facing leadership role that requires a unique combination of technical expertise, strong communication skills, and the ability to perform under pressure. The individual in this position will be responsible for ensuring the reliability, performance, and stability of mission-critical systems, while serving as a trusted technical partner to our customers. The ideal candidate is a self-driven professional who thrives in fast-paced environments, demonstrates maturity in handling complex situations, and is comfortable managing escalations with professionalism and composure.
Responsibilities :
- Act as the primary technical liaison for customers, handling escalations and fostering positive, professional relationships.
- Take ownership of end-to-end incident resolution, ensuring timely communication, troubleshooting, and root cause analysis for critical issues.
- Drive reliability and performance improvements across large-scale distributed systems, with a focus on Hadoop, HDFS, Zookeeper, and related technologies.
- Perform advanced Linux system administration and troubleshooting, including performance optimisation and debugging of system-level issues.
- Collaborate with cross-functional teams to design and maintain reliable, scalable infrastructure spanning on-premises, hybrid, and cloud environments (AWS, GCP, Kubernetes).
- Troubleshoot and resolve issues related to UI systems, private links, VPN configurations, and custom domain management.
- Partner with engineering and product teams to proactively identify risks, implement preventive measures, and drive long-term reliability strategies.
- Provide clear documentation, knowledge sharing, and guidance to both internal stakeholders and external customers.
- Uphold a balanced and professional demeanour, maintaining customer trust while delivering results during high-stakes and time-sensitive situations.
Requirements :
B. E. / B. Tech / B. Sc in computer science or relevant industry experience.Extensive hands-on experience with Linux OS troubleshooting, performance tuning, and system debugging.Proven expertise with distributed systems, particularly the Hadoop ecosystem (HDFS, Zookeeper, etc. ).Strong grounding in networking concepts, infrastructure design, and database systems.Demonstrated proficiency with cloud platforms (AWS, GCP) and container orchestration technologies such as Kubernetes.In-depth understanding of VPNs, private links, and custom domain configuration.Experience in UI application troubleshooting.Strong interpersonal, communication, and customer relationship management skills, with the ability to lead discussions and resolve conflicts in high-pressure scenarios.Track record of working effectively in fast-paced, high-demand environments as a self-learner and independent contributor.Preferred Qualifications :
Familiarity with data modelling and data management concepts.Experience with automation frameworks, CI / CD pipelines, and monitoring / observability tools (Prometheus, Grafana, ELK stack).Exposure to systems reliability engineering practices such as error budgets, SLAs / SLOs, and chaos engineering.(ref : hirist.tech)