About Clay
Clay is a creative tool for growth. Our mission is to help businesses grow — without huge investments in tooling or manual labor. We’re already helping over 100,000 people grow their business with Clay. From local pizza shops to enterprises like Anthropic and Notion, our tool lets you instantly translate any idea that you have for growing your company into reality.
We believe that modern GTM teams win by finding GTM alpha—a unique competitive edge powered by data, experimentation, and automation. Clay is the platform they use to uncover hidden signals, build custom plays, and launch faster than their competitors. We’re looking for sharp, low-ego people to help teams find their GTM alpha.
Why is Clay the best place to work?
Customers love the product (100K+ users and growing)
We’re growing a lot (6x YoY last year, and 10x YoY the two years before that)
Incredible culture (our customers keep applying to work here)
Well-resourced (raised a Series B expansion in January 2025 from investors like Sequoia and Meritech)
Read more about why people love working at Clay here and explore our wall of love to learn more about the product.
Data Engineering, Search @ Clay
As a Senior Data Engineer on the Search team, you'll be responsible for building and maintaining the data pipelines that power Clay's comprehensive datasets of companies, people, and job postings. You'll be tackling fundamental challenges in entity resolution—matching millions of records across datasets without common identifiers—while building the foundation for next-generation natural language search capabilities. Our team is scaling from processing millions to billions of records, requiring innovative approaches to data quality, validation, and infrastructure. Strong candidates will have experience building production data pipelines at scale and a deep understanding of search infrastructure.
What You'll Do
Design and implement robust entity resolution systems that match and merge records from multiple providers using advanced matching algorithms, enabling large-scale enrichment of customer data
Build scalable data pipelines that process billions of profiles while maintaining data accuracy through sophisticated validation and quarantine frameworks
Implement modern data architecture patterns that enable point-in-time recovery, analytics at scale, and real-time data quality monitoring
Develop systems to normalize and standardize messy real-world data (like locations, company names, and job titles) across billions of records
Create intelligent data validation systems that prevent bad data from reaching customers while providing feedback loops for continuous improvement
Collaborate with ML engineers to build the data foundation for embedding-based search, enabling users to describe what they're looking for in natural language
What You'll Bring
Experience building and maintaining production data pipelines that process millions of records daily
Strong proficiency in Python and SQL, with experience in data processing frameworks (Apache Airflow, Prefect, Dagster, or similar)
Hands-on experience with search engines (Elasticsearch, OpenSearch, Solr) including data modeling and indexing strategies
Understanding of entity resolution, record linkage, and deduplication techniques at scale
Experience with both batch and streaming data processing patterns
Familiarity with cloud data platforms (AWS, GCP, or Azure) and their data services
Strong problem-solving skills with the ability to debug complex data issues across distributed systems
Nice To Haves
Experience with workflow orchestration using Dagster or similar modern data orchestration tools
Knowledge of ML approaches to entity resolution and experience with embedding pipelines
Familiarity with Apache Iceberg or similar table formats for data versioning and time travel
Experience with geocoding and location normalization at scale
Background in building data platforms that dramatically scale processing capabilities
Exposure to our current tech stack:
Orchestration: Dagster
Search: OpenSearch
Databases: PostgreSQL (Aurora), Redis
Cloud: AWS (S3, Lambda, ECS)
Languages: Python, TypeScript
Infrastructure as Code: Terraform
Data Validation: Pydantic