We're seeking an experienced DevOps / Site Reliability Engineer to join our team and take ownership of our testing, deployment, and infrastructure operations for Octopos, our multi-platform point-of-sale SaaS solution. You'll be responsible for building robust CI / CD pipelines, managing our database infrastructure, and ensuring high availability for our retail customers who depend on us 24 / 7. This is a full REMOTE position.
CI / CD & Deployment Pipeline Design and implement comprehensive CI / CD pipelines for our diverse tech stack (React, Laravel, Node.js, React Native)
Manage multi-platform deployments including web, Android (Capacitor), Windows (Electron)
Manage Google Play Store releases including APK / AAB uploads, versioning, and staged rollouts
Handle App Store submissions and TestFlight distributions
Create and maintain staging environments that accurately mirror production
Implement automated testing strategies across all applications
Establish deployment rollback procedures and blue-green deployment strategies
Infrastructure & Monitoring Implement and maintain comprehensive monitoring using Grafana dashboards and alerting
Set up centralized logging infrastructure (ELK stack or similar) for all applications
Monitor and maintain production servers ensuring 99.9% uptime for POS operations
Design custom metrics and KPIs specific to POS operations (transaction success rates, hardware connectivity)
Manage incident response and on-call rotations
Optimize application performance and resource utilization
Ensure infrastructure security and PCI compliance requirements
Database Management Design and implement multi-node MySQL cluster for high availability
Create and manage automated backup strategies with point-in-time recovery
Monitor database performance and implement optimization strategies
Plan and execute database migrations with zero downtime
Implement disaster recovery procedures
Testing & Quality Assurance Build automated testing frameworks for React, Laravel, and Node.js applications
Implement E2E testing for critical POS workflows including payment processing
Create testing strategies for hardware integration (payment terminals, printers, scanners)
Establish code quality gates and coverage requirements
Documentation & Knowledge Transfer Create and maintain comprehensive documentation for all infrastructure, deployment processes, and runbooks
Develop disaster recovery playbooks and incident response procedures
Document monitoring alerts, thresholds, and escalation procedures
Maintain architectural diagrams and system dependencies documentation
Create video tutorials and guides for common operational tasks
Required QualificationsTechnical Skills 3+ years of DevOps / SRE experience with production systems
Strong experience with CI / CD tools (GitHub Actions, GitLab CI, Jenkins)
Hands-on experience with Grafana, Prometheus, and alerting systems
Experience with centralized logging solutions (ELK, Splunk, or similar)
Proficiency in containerization (Docker) and orchestration (Kubernetes / Docker Compose)
Expertise in MySQL administration including replication and clustering
Experience with Infrastructure as Code (Terraform, Ansible, or similar)
Solid understanding of Linux system administration
Proficiency in scripting (Bash, Python, or similar)
Application-Specific Experience Experience deploying React / Node.js applications at scale
Familiarity with Laravel deployment and optimization
Experience managing mobile app releases and versioning strategies
Understanding of Electron app packaging and distribution
Knowledge of WebSocket implementations and real-time systems
Soft Skills Excellent technical writing and documentation skills
Experience training and mentoring junior engineers
Strong communication skills for cross-functional collaboration
Ability to explain complex technical concepts to non-technical stakeholders
Work Schedule & On-Call RequirementsCore Hours Must be available during US Pacific Time business hours (9 AM - 5 PM PST / PDT)
This is a full remote position
On-Call Responsibilities As our POS platform serves retail businesses operating 7 days a week, this role includes participation in an on-call rotation to ensure 24 / 7 system reliability.
On-Call Structure :
Participate in rotating on-call schedule
Response time : 15-minute acknowledgment, 30-minute engagement during on-call periods
Average incident volume : 1 Incident every 2 months.
Severity-based response (P1 : immediate, P2 : 30 minutes, P3 : next business day)
On-Call Compensation :
Standby Pay : Additional compensation for on-call availability (paid whether or not incidents occur)
Incident Response Pay : 1.5x hourly rate for incident response during nights / weekends
Compensatory Time : Time off provided after significant weekend incidents
Company-provided phone and laptop dedicated for on-call use
Post-incident review process to minimize repeat issues and alert fatigue
Support Structure :
Comprehensive runbooks and automated remediation for common issues
Clear escalation procedures to senior leadership and vendor support
Robust monitoring to minimize false positives
Regular rotation reviews to ensure fair distribution
What We Offer Opportunity to architect infrastructure for a growing SaaS platform
Work with diverse, modern technology stack
Direct impact on system reliability affecting thousands of daily transactions
Competitive on-call compensation package
Professional development budget for certifications and training
12LPA plus salary
Requirements Strong written and verbal communication skills
Demonstrated experience in creating technical documentation
Ability to work during US Pacific Time business hours
Willingness to participate in compensated on-call rotation
Self-motivated with excellent troubleshooting skills
Experience working in fast-paced, agile environments
Commitment to knowledge sharing and team development
Engineer • Borivali, Maharashtra, India