Skip to content
Canaria

Research-Grade Data Quality

Every record in Canaria's dataset passes through a multi-stage enrichment and validation pipeline. This page documents the coverage benchmarks, accuracy metrics, and known limitations that quant funds, enterprise teams, and academic researchers need to evaluate before using the data.

At a Glance

1B+

Unique postings after deduplication

56%

Average dedup rate across all sources

Near-real-time

Data acquisition cadence

100+

Enriched fields per record

Enrichment Field Accuracy

Benchmarks are measured on production data. Coverage rates are reported for 2023+ data unless otherwise noted. Pre-2020 coverage is lower by design.

FieldCoverageAccuracyNotes
SOC code (6-digit)>90% (2023+)94% top-5, 73% top-1Uses title plus description context. Measured on a held-out gold set across 867 BLS 2018 codes.
Predicted salary>85% non-null (2023+)44,000+ cells, 95% CITrained on 50M+ salary observations with three-source fusion (posted + reported + BLS OES). Requires valid state, zip code, and SOC.
Skills extraction>80% coverageF1 75-87%40,000+ skills taxonomy. Two-step extraction: dictionary matching followed by NLP relevance filtering.
Seniority level~75% non-null in delivery100% completeModel always returns a classification (Entry, Mid, Senior, Lead, Manager, Director, Executive). Delivery null rates reflect pre-2020 vintage data.
Work mode (remote/hybrid/on-site)>85% (2022+)92-97%NULL is expected and correct for pre-2020 data. Structured work mode fields did not exist before 2021.
Location (city/state)>95% parse successCity: 85-93%, State: 92-97%Geocoded to lat/lng. Parse failures are tracked and reported separately.
Title normalization>90% mapped (coverage)83% top-5, 66.5% top-1Canonicalizes abbreviations, Roman numerals, and level indicators across 99,938 canonical roles. Confidence scores included for downstream filtering.

See the classifier accuracy breakdown for top-1 and top-5 numbers per model, held-out gold set sizes, and class counts.

Deduplication Methodology

We ingest 1B+ raw postings and apply a three-stage deduplication pipeline: vector similarity matching on job descriptions, near-duplicate detection for company name variants, and graph-based transitive closure to merge job families across sources and reposting cycles. The result is 1B+ semantically unique postings.

Deduplication is configurable per delivery: clients can adjust similarity thresholds (0.90 to 0.95), posting windows (1 to 6 months), and geographic radius (10 to 50 miles). No duplicate job ID appears in any delivery file.

SourceDedup Rate
Overall (all sources)40-60%
ATS portals (single-source)<2%
LinkedIn<10%
Indeed<15%
Aggregators (e.g., Google Jobs)60-70%
Duplicate job ID in delivery0% (enforced)

High dedup rates on aggregator sources (60-70%) reflect legitimate cross-posting behavior. ATS portal feeds show near-zero internal duplication (<2%) because they are primary sources.

Acquisition and Delivery Cadence

Sources are crawled continuously: near-real-time data acquisition across ATS portals and major job boards. ATS portal postings are captured within hours of going live on the employer career page. Customer deliveries are then packaged on a daily, weekly, or monthly cadence to match your pipeline. Snapshots used in the interactive explorer are refreshed weekly. Full-history flat file deliveries are available on request.

Data Vintage and Coverage Notes

Coverage improves significantly for 2022+ data. Many enrichment fields did not exist before 2021-2022:

  • Work mode (remote / hybrid / on-site)

    NULL values in pre-2020 data are expected and correct. Fewer than 1% of postings mentioned remote work before 2020. Structured work mode fields did not emerge until 2021. Coverage reaches 85%+ for 2022+ data.

  • Stated salary

    Salary transparency laws (Colorado 2021, NYC 2022, California and Washington 2023) drove a step-change improvement in stated salary coverage. Pre-2022 null rates of 70-95% are expected, not a data gap. Predicted salary is available for all records with valid state, zip code, and SOC code.

  • SOC code and skills

    Enrichment model accuracy is highest for 2023+ data where job descriptions are longer and more structured. Pre-2020 SOC coverage is 70-85% and skills coverage 50-70% due to shorter descriptions and format differences in older postings.

Researcher note: Null rates for pre-2020 data are expected and correct. Comparing null rates across vintages without accounting for structural breaks (remote work adoption, salary transparency laws) will produce misleading conclusions.

Evaluate quality firsthand

Request a sample dataset to run your own validation before committing. We provide sample files with full enrichment fields so you can test against your own benchmarks.