What You'll Work On
- Design and develop a next-generation scalable observability platform for modern cloud-native and hybrid infrastructures that works in tandem with AI agents.
- Create intelligent AI agents to analyze logs, traces, and metrics in real time, delivering automated insights and remediation.
- Build scalable and fault tolerant AI agent frameworks
- Engineer and optimize large-scale analytics pipelines to process high-velocity telemetry data.
- Build resilient distributed systems with high reliability, performance, and fault tolerance.
- Implement and fine-tune LLMs for natural language querying and automated troubleshooting.
- Partner with ML engineers to streamline AI model deployment and management.
What We're Looking For
Strong programming skills in Python and Golang (experience with Rust is a plus)Track record of building distributed systems and large-scale analytics pipelinesHands-on experience with cloud infrastructure (AWS, GCP, or Azure) and KubernetesDeep understanding of observability technologies (Prometheus, OpenTelemetry, Grafana, Elastic, etc.)Knowledge of LLMs , AI agents , agent frameworks liks langchain, autogen is a plusExperience with stream processing and real-time data processing frameworksProficiency in database technologies (SQL & NoSQL, Time-Series DBs)5+ yearsof relevant experienceBachelor's degree in Computer Science, Engineering, or related field (Master's / PhD is a plus)Skills Required
Docker, Terraform, Networking, Linux, Kubernetes, Python, Aws, Microservices, Monitoring, Rust, Gcp, Ws