About the Role:
We’re looking for a CloudOps Engineer to join our fast-growing CloudOps team focused on Developer Experience, SRE, and FinOps. In this role, you’ll be responsible for the reliability, performance, and observability of CloudZero’s infrastructure — empowering engineering teams to ship features that help customers understand and optimize their cloud spend.
CloudZero processes billions of events daily across AWS, Azure, and GCP. Our customers rely on real-time, accurate cost data to make business-critical decisions — and any instability in our system impacts their planning. Built entirely on a unique serverless architecture (no EC2s or containers), our platform demands infrastructure that scales gracefully, fails predictably, and recovers automatically.
The problems are interesting: handling massive data volumes efficiently, ensuring sub-second query performance across terabytes of data, and scaling systems to support customers spending millions monthly — all in a modern, event-driven environment.
You Will:
Infrastructure as Code everything. Design and maintain Pulumi modules that provision reliable, cost-efficient cloud resources. No clicking through consoles.
Build observability into everything. Instrument systems so that failures surface quickly and debugging happens with data, not guesswork. You'll know about problems before customers do.
Automate the boring stuff. Deployments, scaling, backups, and changing limits; if humans are doing it repeatedly, you'll build systems to automate it instead.
Partner with product engineering. Help teams design resilient services, review architectures for operational complexity, and build deployment pipelines that enable safe and fast shipping.
Optimize for cost and performance. CloudZero's business is helping others optimize cloud costs. We should be exemplars of efficient cloud usage ourselves.
Requirements:
3–5+ years of experience building and operating distributed systems in AWS
Strong skills in Python, Infrastructure as Code (e.g., Pulumi or Terraform), and Kubernetes
Hands-on experience with monitoring tools such as Prometheus or DataDog
Proven ability to debug production issues under pressure
Values thoughtful, reliable system design over reactive “hero” efforts
Balances automation intelligently — builds solutions to real problems, not automation for its own sake
Able to clearly explain complex technical issues to non-technical stakeholders
Strong documentation habits to support long-term team clarity and system stability
Excited to take ownership of infrastructure and solve operational challenges at scale
Please note: CloudZero is unable to sponsor employment visas or provide immigration-related support now or in the future. All candidates must have current, unrestricted authorization to work in the United States permanently.