L2Observability / AIOps :
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.
SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
SRE is a mindset, and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation.
As a Site Reliability Engineer with focus on observability you will build and operate next generation observability platforms.
As an SRE with Observability focus you will :
- Explore the complex IT estates of our clients to understand their observability / AIOps opportunities, identify the areas to improvise.
- Collaborate to architect unified observability and AIOps strategies which employ leading AI technology.
- Implement enterprise observability / AIOps technology and processes.
- Amplify observability / AIOps outcomes by accelerating adoption across technology and business include :
- Architect observability solutions to address the gaps in order to reduce organizational MTTD and MTTR objectives.
- Developing API-driven micro-services that combine into large and complex platforms.
- Planning and executing highly parallel distributed object storage transformations and migrations.
- Maintaining automated test suites using CI / CD tools.
- Participating in collaborative projects with small software engineering teams.
- Develop automation, processes, and tools designed to make our services simpler and more robust.
- Participate in troubleshooting, capacity planning and analysis, performance analysis activities.
- Advise management on service onboarding strategies and execution.
What we are looking for :
Entrepreneurs who seek challenging problems to solve.Creativity, initiative and acute attention to detail.Thirst for innovation and solving problems at lightning speed.Passion for automating everything repetitive.Obsession with software scalability and performance under high loads.Love for using and contributing to open-source software.Please bring to the table :
Experience in architecting complex IT solutions.Understanding of observability dimensions(Metrics, logs, traces).Excellent communication and stakeholder management skills.Development experience, comfortable working in multiple languages(Python, Java, Go and Ruby a plus).Experience working in collaborative coding environments (peer review, continuous integration, etc).7+ years of application development.Experience working in distributed remote teams across multiple time zones.Experience in large scale operations environments.7+ years of experience with Linux / Unix development or systems administration.3+ years of experience with networking systems and technologies.Deep understanding of network performance and security.Ability to identify tasks which require automation and implement required automation.Configuration Management tools experience with Puppet, Chef, SaltStack.Hands-on operational experience in a high-volume or critical production service environment distributed systems, capacity planning, continuous deployment.BA / BS in Computer Science preferred, or equivalent experience (advanced degrees preferred).We have opportunities to work with and learn :
Object Storage Minio / S3 / etc.Data Collection OpenTelemetry / Grafana Alloy / etc.Message Bus Kafka / NSQ / etc.Scaling Databases Relational database technologies at large scale Scheduling & Orchestration Cloud Platforms AWS / Azure.(ref : hirist.tech)