About PlayerZero
PlayerZero is building a self‑healing system for software—automating defect detection, diagnosis, and remediation so developers ship with confidence. Teams use PlayerZero to spot issues before customers do, pinpoint root causes fast, and close the loop from incident to fix.
Our platform includes capabilities like Agentic Debugging and Code Simulations that let engineers reproduce complex scenarios, reason about failures, and validate fixes safely and quickly.
About the role:
You’ll be the heartbeat of our production stack, owning everything that happens after the merge button is clicked. From CI pipelines to load balancers, you’ll automate, monitor, and harden the infrastructure that supports billions of events and hundreds of distributed agents. If you love shaving seconds off deploys, sleeping soundly on-call because your graphs are flat, and treating infrastructure as a product—not just plumbing—this role is for you.
What you’ll do
- Design & maintain production environments across AWS/GCP, EC2/managed services, and occasional on-prem GPU nodes.
- Automate end-to-end delivery with hermetic CI/CD pipelines, golden images, and zero-downtime rollouts (blue/green, canary).
- Own observability: metrics, traces, and logs wired into actionable SLOs and pager rules—catch issues before customers do.
- Implement robust networking & security: VPC design, Layer-7 routing, WAFs, IAM least-privilege, and secrets management.
- Capacity planning & cost optimization: forecast growth, right-size compute/storage, and negotiate reserved instances.
- Run game-day drills: chaos testing, failover simulations, disaster-recovery runbooks with RTO/RPO targets.
- Partner with platform & ML teams to tune data pipelines, search clusters, and GPU workloads for performance and reliability.
You might thrive in this role if:
- 3–7+ years operating large-scale production systems (DevOps, SRE, or Infra Engineering).
- Deep experience with Linux systems, networking, and cloud infrastructure (AWS or GCP).
- Proven track record building repeatable CI/CD pipelines (GitHub Actions, Buildkite, Jenkins, or similar).
- Strong grasp of monitoring & incident response—Prometheus/Alertmanager, Datadog, Grafana, or Honeycomb.