Research-Grade Data Quality
Every record in Canaria's dataset passes through a multi-stage enrichment and validation pipeline. This page documents the coverage benchmarks, accuracy metrics, and known limitations that quant funds, enterprise teams, and academic researchers need to evaluate before using the data.
At a Glance
1B+
Unique postings after deduplication
56%
Average dedup rate across all sources
Near-real-time
Data acquisition cadence
100+
Enriched fields per record
Enrichment Field Accuracy
Benchmarks are measured on production data. Coverage rates are reported for 2023+ data unless otherwise noted. Pre-2020 coverage is lower by design.
| Field | Coverage | Accuracy | Notes |
|---|---|---|---|
| SOC code (6-digit) | >90% (2023+) | 94% top-5, 73% top-1 | Uses title plus description context. Measured on a held-out gold set across 867 BLS 2018 codes. |
| Predicted salary | >85% non-null (2023+) | 44,000+ cells, 95% CI | Trained on 50M+ salary observations with three-source fusion (posted + reported + BLS OES). Requires valid state, zip code, and SOC. |
| Skills extraction | >80% coverage | F1 75-87% | 40,000+ skills taxonomy. Two-step extraction: dictionary matching followed by NLP relevance filtering. |
| Seniority level | ~75% non-null in delivery | 100% complete | Model always returns a classification (Entry, Mid, Senior, Lead, Manager, Director, Executive). Delivery null rates reflect pre-2020 vintage data. |
| Work mode (remote/hybrid/on-site) | >85% (2022+) | 92-97% | NULL is expected and correct for pre-2020 data. Structured work mode fields did not exist before 2021. |
| Location (city/state) | >95% parse success | City: 85-93%, State: 92-97% | Geocoded to lat/lng. Parse failures are tracked and reported separately. |
| Title normalization | >90% mapped (coverage) | 83% top-5, 66.5% top-1 | Canonicalizes abbreviations, Roman numerals, and level indicators across 99,938 canonical roles. Confidence scores included for downstream filtering. |
See the classifier accuracy breakdown for top-1 and top-5 numbers per model, held-out gold set sizes, and class counts.
Deduplication Methodology
We ingest 1B+ raw postings and apply a three-stage deduplication pipeline: vector similarity matching on job descriptions, near-duplicate detection for company name variants, and graph-based transitive closure to merge job families across sources and reposting cycles. The result is 1B+ semantically unique postings.
Deduplication is configurable per delivery: clients can adjust similarity thresholds (0.90 to 0.95), posting windows (1 to 6 months), and geographic radius (10 to 50 miles). No duplicate job ID appears in any delivery file.
| Source | Dedup Rate |
|---|---|
| Overall (all sources) | 40-60% |
| ATS portals (single-source) | <2% |
| <10% | |
| Indeed | <15% |
| Aggregators (e.g., Google Jobs) | 60-70% |
| Duplicate job ID in delivery | 0% (enforced) |
High dedup rates on aggregator sources (60-70%) reflect legitimate cross-posting behavior. ATS portal feeds show near-zero internal duplication (<2%) because they are primary sources.
Acquisition and Delivery Cadence
Sources are crawled continuously: near-real-time data acquisition across ATS portals and major job boards. ATS portal postings are captured within hours of going live on the employer career page. Customer deliveries are then packaged on a daily, weekly, or monthly cadence to match your pipeline. Snapshots used in the interactive explorer are refreshed weekly. Full-history flat file deliveries are available on request.
Data Vintage and Coverage Notes
Coverage improves significantly for 2022+ data. Many enrichment fields did not exist before 2021-2022:
Work mode (remote / hybrid / on-site)
NULL values in pre-2020 data are expected and correct. Fewer than 1% of postings mentioned remote work before 2020. Structured work mode fields did not emerge until 2021. Coverage reaches 85%+ for 2022+ data.
Stated salary
Salary transparency laws (Colorado 2021, NYC 2022, California and Washington 2023) drove a step-change improvement in stated salary coverage. Pre-2022 null rates of 70-95% are expected, not a data gap. Predicted salary is available for all records with valid state, zip code, and SOC code.
SOC code and skills
Enrichment model accuracy is highest for 2023+ data where job descriptions are longer and more structured. Pre-2020 SOC coverage is 70-85% and skills coverage 50-70% due to shorter descriptions and format differences in older postings.
Researcher note: Null rates for pre-2020 data are expected and correct. Comparing null rates across vintages without accounting for structural breaks (remote work adoption, salary transparency laws) will produce misleading conclusions.
Evaluate quality firsthand
Request a sample dataset to run your own validation before committing. We provide sample files with full enrichment fields so you can test against your own benchmarks.