Talent.com
This job offer is not available in your country.
Site Reliability Engineer

Site Reliability Engineer

Core Minds Tech SOlutionsPondiche
30+ days ago
Job description

Job Description :

  • Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions
  • Operate, monitor, and triage all aspects of our production and non-production environments
  • Collaborate with other engineers on code, infrastructure, design reviews, and process enhancements.
  • Evaluate and integrate new technologies to improve system reliability, security, and performance
  • Develop and implement automation to provision, configure, deploy, and monitor Apple services
  • Participate in an on-call rotation providing hands-on technical expertise during service-impacting events
  • Design, build, and maintain highly available and scalable infrastructure
  • Implement and improve monitoring, alerting, and incident response systems
  • Automate operations tasks and develop efficient workflows
  • Conduct system performance analysis and optimization
  • Collaborate with development teams to ensure smooth deployment and release processes
  • Implement and maintain security best practices and compliance standards
  • Troubleshoot and resolve system and application issues
  • Participate in capacity planning and scaling efforts
  • Stay up-to-date with the latest trends, technologies, and advancements in SRE practices
  • Contribute to capacity planning, scale testing, and disaster recovery exercises.
  • Approach operational problems with a software engineering mindset
  • BS degree in computer science or equivalent field with 5+ years of experience
  • 5+ years in an Infrastructure Ops, Site Reliability Engineering, or DevOps-focused role.
  • Knowledge of Linux operating system principles, networking fundamentals, and systems management.
  • Demonstrable fluency in at least one of the following languages : Java, Python, or Go
  • Experience managing and scaling distributed systems in a public, private, or hybrid cloud environment
  • Develop and implement automation tools and apply best practices for system reliability.
  • You will be responsible for the availability & scalability of our services and manage the disaster recovery and other operational tasks.
  • Collaborate with the development team to improve application codebase for logging, metrics and traces for observability.
  • Collaborate with data science teams and other business units to design, build and maintain the infrastructure that runs machine learning and generative AI workloads.
  • Influence architectural decisions with focus on security, scalability and performance.
  • Find and fix problems in production, and work to avoid them from happening again

Preferred Qualifications :

  • Familiarity with micro-services architecture and container orchestration with Kubernetes.
  • Awareness of key security principles including encryption, keys (types and exchange protocols).
  • Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.
  • Strong sense of ownership, with a desire to communicate and collaborate with other engineers and teams.
  • Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
  • ref : hirist.tech)

    Create a job alert for this search

    Site Reliability Engineer • Pondiche