About PlayerZero
PlayerZero is building a self-healing system for software that automates defect resolution and development. We are used by engineering and support teams to:
- autonomously debug problems in the software (technical support)
- fix issues directly in the code
- prevent these problems from recurring
PlayerZero is backed by leading investors such as Foundation Capital, WndrCo, and Green Bay Ventures — and operators like Matei Zaharia, Drew Houston, Dylan Field, Guillermo Rauch, among others.
We believe that as software development speeds up, engineering and support teams face greater challenges maintaining software for their customers. We see this as an opportunity to reinvent how software is supported.
About the role:
You’ll be the heartbeat of our production stack, owning everything that happens after the merge button is clicked. From CI pipelines to load balancers, you’ll automate, monitor, and harden the infrastructure that supports billions of events and hundreds of distributed agents. If you love shaving seconds off deploys, sleeping soundly on-call because your graphs are flat, and treating infrastructure as a product—not just plumbing—this role is for you.
What you’ll do
- Design & maintain production environments across AWS/GCP, EC2/managed services, and occasional on-prem GPU nodes.
- Automate end-to-end delivery with hermetic CI/CD pipelines, golden images, and zero-downtime rollouts (blue/green, canary).
- Own observability: metrics, traces, and logs wired into actionable SLOs and pager rules—catch issues before customers do.
- Implement robust networking & security: VPC design, Layer-7 routing, WAFs, IAM least-privilege, and secrets management.
- Capacity planning & cost optimization: forecast growth, right-size compute/storage, and negotiate reserved instances.
- Run game-day drills: chaos testing, failover simulations, disaster-recovery runbooks with RTO/RPO targets.
- Partner with platform & ML teams to tune data pipelines, search clusters, and GPU workloads for performance and reliability.
You might thrive in this role if:
- 3–7+ years operating large-scale production systems (DevOps, SRE, or Infra Engineering).
- Deep experience with Linux systems, networking, and cloud infrastructure (AWS or GCP).
- Proven track record building repeatable CI/CD pipelines (GitHub Actions, Buildkite, Jenkins, or similar).
- Strong grasp of monitoring & incident response—Prometheus/Alertmanager, Datadog, Grafana, or Honeycomb.