Overview
Objective : This position will assist in performing implementation, operation, monitoring, recovery, and performance tuning for infrastructure and application services at symplr.
The SRE team augments the symplr Development, IT and DevOps teams by focusing on operating production systems using a software engineering approach.
SRE goals include improving system performance, increasing operational observability, enhancing system stability, and reducing time for software delivery.
Duties & Responsibilities
Duties and Responsibilities :
- Be a champion for department initiatives and values by ensuring all actions promote the department’s mission statement
- Participate in release cycles of product by closely working with Engineering Managers, Architects and Developers.
- Work with other Cloud Infrastructure Engineer and developers to ensure maximum performance, reliability and automation of our deployments and infrastructure.
- Work with, consult and influence developers on new features and software architecture to ensure scalability.
- Work towards automating the product deployment to various environments by integrating with continuous integration (CI) and continuous delivery (CD) tools, monitoring, and change management practices.
- Create and maintain standard operating procedures (SOPs) for performing maintenance tasks, applying configuration changes, and remediating problems in the environment.
- Identify single points of failure and other high-risk architecture issues and propose resilient resolutions to mitigate the risk thereby improving the system reliability.
- See opportunities of automation and reduce the operational workload, build scripts, introduce new tools and practices as needed
- Implement monitoring , alerting , notification and metrics collection forInfrastructure and application performance System uptimeError rate
- Monitor and continually improve the capacity and reliability of our production environments infrastructure.
- Investigate and fix performance and scalability bottlenecks, proactively identify issues and create work items to improve stability and performance.
- Respond to alerts from production systems, identify and resolve root causes in a timely fashion
Skills Required
Skills Required :
2 - 5 years of experience with any public cloud provider such as Microsoft Azure, Amazon Web Services (AWS) or Google Compute Engine (GCE)Solid understanding of standard TCP / IP networking, Load Balancing and common protocols like DNS, HTTPSMonitoring and Logging : Experience with any Application monitoring and logging tools (e.g. Datadog, New Relic, AppDynamics, Application Insight, ELK, Prometheus).Good understanding of Web Servers & DatabaseGood understanding in Docker and Kubernetes.Good Scripting knowledge & Software life cycles model.Good understanding of DevOps practices.(Optional) Should have worked on high traffic & highly scalable systems in pastKnowledge on fundamental aspects for release automation (packaging, dependencies, promotion, deployment, compliance)Excellent time management, resource organization and priority establishment skills, and ability to multi-task in a fast-paced environmentAbility to work quickly and efficiently with minimal supervisionExcellent communication skills with both written and verbalQualifications :
Have HEART. To work here, you must be : Humble – self-aware and respectfulEffective – measurably move the needle & immeasurably add valueAdaptable – innately curious and constantly changingRemarkable – stand out in some wayTransparent – openly and honestly sharing knowledge1 - 3 years of Any domain of Systems / DevOps / SRE Engineering experience in the following areas Cloud platforms (Azure, AWS)Windows and Linux ServersApplication Monitoring Tools ( Datadog , New Relic, AppDynamics, Application Insights)Any ( PowerShell, Bash, or Python scripting )Any CI / CD tools (Azure Pipelines, Jenkins, Octopus, etc.)Any Infrastructure management tools (Terraform, Ansible, etc.)Any Application Hosting (IIS, Apache, Tomcat, K8s)Bachelor’s degree or equivalent experience