We're seeking an experienced DevOps / Site Reliability Engineer to join our team and take ownership of our testing, deployment, and infrastructure operations for Octopos, our multi-platform point-of-sale SaaS solution. You'll be responsible for building robust CI / CD pipelines, managing our database infrastructure, and ensuring high availability for our retail customers who depend on us 24 / 7. This is a full REMOTE position.
CI / CD & Deployment Pipeline
- Design and implement comprehensive CI / CD pipelines for our diverse tech stack (React, Laravel, Node.js, React Native)
- Manage multi-platform deployments including web, Android (Capacitor), Windows (Electron)
- Manage Google Play Store releases including APK / AAB uploads, versioning, and staged rollouts
- Handle App Store submissions and TestFlight distributions
- Create and maintain staging environments that accurately mirror production
- Implement automated testing strategies across all applications
- Establish deployment rollback procedures and blue-green deployment strategies
Infrastructure & Monitoring
Implement and maintain comprehensive monitoring using Grafana dashboards and alertingSet up centralized logging infrastructure (ELK stack or similar) for all applicationsMonitor and maintain production servers ensuring 99.9% uptime for POS operationsDesign custom metrics and KPIs specific to POS operations (transaction success rates, hardware connectivity)Manage incident response and on-call rotationsOptimize application performance and resource utilizationEnsure infrastructure security and PCI compliance requirementsDatabase Management
Design and implement multi-node MySQL cluster for high availabilityCreate and manage automated backup strategies with point-in-time recoveryMonitor database performance and implement optimization strategiesPlan and execute database migrations with zero downtimeImplement disaster recovery proceduresTesting & Quality Assurance
Build automated testing frameworks for React, Laravel, and Node.js applicationsImplement E2E testing for critical POS workflows including payment processingCreate testing strategies for hardware integration (payment terminals, printers, scanners)Establish code quality gates and coverage requirementsDocumentation & Knowledge Transfer
Create and maintain comprehensive documentation for all infrastructure, deployment processes, and runbooksDevelop disaster recovery playbooks and incident response proceduresDocument monitoring alerts, thresholds, and escalation proceduresMaintain architectural diagrams and system dependencies documentationCreate video tutorials and guides for common operational tasksRequired QualificationsTechnical Skills
3+ years of DevOps / SRE experience with production systemsStrong experience with CI / CD tools (GitHub Actions, GitLab CI, Jenkins)Hands-on experience with Grafana, Prometheus, and alerting systemsExperience with centralized logging solutions (ELK, Splunk, or similar)Proficiency in containerization (Docker) and orchestration (Kubernetes / Docker Compose)Expertise in MySQL administration including replication and clusteringExperience with Infrastructure as Code (Terraform, Ansible, or similar)Solid understanding of Linux system administrationProficiency in scripting (Bash, Python, or similar)Application-Specific Experience
Experience deploying React / Node.js applications at scaleFamiliarity with Laravel deployment and optimizationExperience managing mobile app releases and versioning strategiesUnderstanding of Electron app packaging and distributionKnowledge of WebSocket implementations and real-time systemsSoft Skills
Excellent technical writing and documentation skillsExperience training and mentoring junior engineersStrong communication skills for cross-functional collaborationAbility to explain complex technical concepts to non-technical stakeholdersWork Schedule & On-Call RequirementsCore Hours
Must be available during US Pacific Time business hours (9 AM - 5 PM PST / PDT)This is a full remote positionOn-Call Responsibilities
As our POS platform serves retail businesses operating 7 days a week, this role includes participation in an on-call rotation to ensure 24 / 7 system reliability.
On-Call Structure :
Participate in rotating on-call scheduleResponse time : 15-minute acknowledgment, 30-minute engagement during on-call periodsAverage incident volume : 1 Incident every 2 months.Severity-based response (P1 : immediate, P2 : 30 minutes, P3 : next business day)On-Call Compensation :
Standby Pay : Additional compensation for on-call availability (paid whether or not incidents occur)Incident Response Pay : 1.5x hourly rate for incident response during nights / weekendsCompensatory Time : Time off provided after significant weekend incidentsCompany-provided phone and laptop dedicated for on-call usePost-incident review process to minimize repeat issues and alert fatigueSupport Structure :
Comprehensive runbooks and automated remediation for common issuesClear escalation procedures to senior leadership and vendor supportRobust monitoring to minimize false positivesRegular rotation reviews to ensure fair distributionWhat We Offer
Opportunity to architect infrastructure for a growing SaaS platformWork with diverse, modern technology stackDirect impact on system reliability affecting thousands of daily transactionsCompetitive on-call compensation packageProfessional development budget for certifications and training12LPA plus salaryRequirements
Strong written and verbal communication skillsDemonstrated experience in creating technical documentationAbility to work during US Pacific Time business hoursWillingness to participate in compensated on-call rotationSelf-motivated with excellent troubleshooting skillsExperience working in fast-paced, agile environmentsCommitment to knowledge sharing and team development