Blog
Data insights, methodology deep dives, and industry analysis.
These posts are written by the same engineers who build the enrichment pipeline. Most go deeper than a product blog normally would: real numbers from production data, the trade-offs we considered, and the failure modes we hit along the way.
Topics cluster around three themes. Pipeline engineering covers how raw postings get cleaned, deduplicated, and merged into canonical records. Enrichment models covers the NLP work behind SOC classification, salary prediction, and skills extraction. Market analysis covers what the resulting data tells us about remote work, pay transparency, and hiring trends.
For accuracy benchmarks behind every number cited in these posts, see the data quality page. For the full pipeline write-up, see the methodology.
ATS vs Job Boards: Data Quality Across 200K Sources
Pipeline data comparing ATS feeds to job boards across 907M records: Indeed dedup hits 89%, ATS stays under 2%.
How We Classify 900M+ Jobs: Inside Our NLP Enrichment System
Six ML models enrich 907M job records with SOC codes, skills from 37K+ terms, salary predictions under 15% MAPE, and more.
Parsing 400M Job Locations from Messy Hiring Data
907 million job records parsed to 95.1% state coverage and 92.5% city coverage. Here is how location parsing works at scale.
Job Data Deduplication: Why Hash Matching Fails at Scale
4.47 billion raw observations become 907 million unique jobs. A 79.7% dedup rate requires more than hashing.
Remote Work Data: What 907M Job Postings Actually Show
Remote postings fell from 3.5% to 2.3% then rebounded, while hybrid doubled to 3.9M, across 907M job records from 2022 to 2025.
37,000 Skills: Extracting Signal from Job Description Noise
How a two-stage pipeline extracts 37,000+ skills from 907M job records with 84.6% coverage using high-speed dictionary matching and NLP filtering.
Building a Salary Prediction Model on 50M Observations
How we built a salary prediction model with MAPE under 15% using 50M observations and why returning -1 beats a bad guess.
Salary Transparency Laws and Job Market Data Quality
Analysis of salary disclosure rates across 865M job observations shows transparency-law states reach 24.6% coverage vs 21.4% without mandates.
Anatomy of 400 Million Job Postings: A Dataset Guide
A field-by-field breakdown of 907M enriched job records across 22 sources, with 82 fields and real coverage rates by source and vintage.
Why Raw Job Data Is Worthless Without Enrichment
Raw job postings give you 5-7 fields. Enrichment produces 85+. Real coverage numbers from 907M records show a 3.8x seniority multiplier.
Want to see the data behind these insights?
Get 5,000 enriched records tailored to your criteria. Free, no commitment.