How many unique job postings does Canaria have after deduplication?

Canaria ingests 1B+ raw postings and applies a multi-stage semantic deduplication pipeline (vector similarity, near-duplicate detection for company-name variants, and graph-based transitive closure), delivering 1B+ semantically unique postings.

What is the average deduplication rate across sources?

The average deduplication rate across all sources is about 56%. Single-source ATS portal feeds show near-zero internal duplication (<2%), while aggregator sources such as Google Jobs run 60-70% because of legitimate cross-posting.

How accurate is Canaria's SOC occupation classification?

SOC codes are assigned using the job title plus description context, reaching 94% top-5 and 73% top-1, measured on a held-out gold set across 867 BLS 2018 codes, with coverage above 90% for 2023+ data.

Why do older postings have lower field coverage?

Coverage improves significantly for 2022+ data because many enrichment fields did not exist earlier. NULL values in pre-2020 work-mode data are expected and correct, and stated-salary coverage rose in step-changes as US pay-transparency laws took effect (Colorado 2021, NYC 2022, California and Washington 2023).

How often is the data updated?

Sources are crawled continuously with near-real-time acquisition; ATS portal postings are captured within hours of going live. Customer deliveries are packaged daily, weekly, or monthly, and the interactive explorer snapshots are refreshed weekly.

Research-Grade Data Quality

Q: How accurate is Canaria's SOC occupation classification?

SOC codes are assigned using the job title plus description context, reaching 94% top-5 and 73% top-1, measured on a held-out gold set across 867 BLS 2018 codes, with coverage above 90% for 2023+ data.

Q: Why do older postings have lower field coverage?

Coverage improves significantly for 2022+ data because many enrichment fields did not exist earlier. NULL values in pre-2020 work-mode data are expected and correct, and stated-salary coverage rose in step-changes as US pay-transparency laws took effect (Colorado 2021, NYC 2022, California and Washington 2023).

Q: How often is the data updated?

Sources are crawled continuously with near-real-time acquisition; ATS portal postings are captured within hours of going live. Customer deliveries are packaged daily, weekly, or monthly, and the interactive explorer snapshots are refreshed weekly.

Every record in Canaria's dataset passes through a multi-stage enrichment and validation pipeline. This page documents the coverage benchmarks, accuracy metrics, and known limitations that quant funds, enterprise teams, and academic researchers need to evaluate before using the data.

At a Glance

1B+

Unique postings after deduplication

56%

Average dedup rate across all sources

Near-real-time

Data acquisition cadence

100+

Enriched fields per record

Enrichment Field Accuracy

Benchmarks are measured on production data. Coverage rates are reported for 2023+ data unless otherwise noted. Pre-2020 coverage is lower by design.

Field	Coverage	Accuracy	Notes
SOC code (6-digit)	>90% (2023+)	94% top-5, 73% top-1	Uses title plus description context. Measured on a held-out gold set across 867 BLS 2018 codes.
Predicted salary	>85% non-null (2023+)	44,000+ cells, 95% CI	Trained on 50M+ salary observations with three-source fusion (posted + reported + BLS OES). Requires valid state, zip code, and SOC.
Skills extraction	>80% coverage	F1 75-87%	40,000+ skills taxonomy. Two-step extraction: a high-recall dictionary scan, then a title-aware relevance filter that removes off-role mentions. Tuned conservatively to keep legitimate skills: an independent multi-judge audit confirms about 99% of removed tags are false positives, with fewer than 1 in 500 legitimate skills affected. Concentrates the average from roughly 10 to roughly 8 skill tags per posting.
Seniority level	~75% non-null in delivery	100% complete	Model always returns a classification (Entry, Mid, Senior, Lead, Manager, Director, Executive). Delivery null rates reflect pre-2020 vintage data.
Work mode (remote/hybrid/on-site)	>85% (2022+)	92-97%	NULL is expected and correct for pre-2020 data. Structured work mode fields did not exist before 2021.
Location (city/state)	>95% parse success	City: 85-93%, State: 92-97%	Geocoded to lat/lng. Parse failures are tracked and reported separately.
Title normalization	>90% mapped (coverage)	83% top-5, 66.5% top-1	Canonicalizes abbreviations, Roman numerals, and level indicators across 99,938 canonical roles. Confidence scores included for downstream filtering.

See the classifier accuracy breakdown for top-1 and top-5 numbers per model, held-out gold set sizes, and class counts.

Deduplication Methodology

We ingest 1B+ raw postings and apply a multi-stage semantic deduplication pipeline: vector similarity matching on job descriptions, near-duplicate detection for company name variants, and graph-based transitive closure to merge job families across sources and reposting cycles. The result is 1B+ semantically unique postings.

Deduplication is configurable per delivery: clients can adjust similarity thresholds (0.90 to 0.95), posting windows (1 to 6 months), and geographic radius (10 to 50 miles). No duplicate job ID appears in any delivery file.

Source	Dedup Rate
Overall (all sources)	40-60%
ATS portals (single-source)	<2%
LinkedIn	<10%
Indeed	<15%
Aggregators (e.g., Google Jobs)	60-70%
Duplicate job ID in delivery	0% (enforced)

High dedup rates on aggregator sources (60-70%) reflect legitimate cross-posting behavior. ATS portal feeds show near-zero internal duplication (<2%) because they are primary sources.

Acquisition and Delivery Cadence

Sources are crawled continuously: near-real-time data acquisition across ATS portals and major job boards. ATS portal postings are captured within hours of going live on the employer career page. Customer deliveries are then packaged on a daily, weekly, or monthly cadence to match your pipeline. Snapshots used in the interactive explorer are refreshed weekly. Full-history flat file deliveries are available on request.

Data Vintage and Coverage Notes

Coverage improves significantly for 2022+ data. Many enrichment fields did not exist before 2021-2022:

Work mode (remote / hybrid / on-site)
NULL values in pre-2020 data are expected and correct. Fewer than 1% of postings mentioned remote work before 2020. Structured work mode fields did not emerge until 2021. Coverage reaches 85%+ for 2022+ data.
Stated salary
Salary transparency laws (Colorado 2021, NYC 2022, California and Washington 2023) drove a step-change improvement in stated salary coverage. Pre-2022 null rates of 70-95% are expected, not a data gap. Predicted salary is available for all records with valid state, zip code, and SOC code.
SOC code and skills
Enrichment model accuracy is highest for 2023+ data where job descriptions are longer and more structured. Pre-2020 SOC coverage is 70-85% and skills coverage 50-70% due to shorter descriptions and format differences in older postings.

Researcher note: Null rates for pre-2020 data are expected and correct. Comparing null rates across vintages without accounting for structural breaks (remote work adoption, salary transparency laws) will produce misleading conclusions.

Evaluate quality firsthand

Request a sample dataset to run your own validation before committing. We provide sample files with full enrichment fields so you can test against your own benchmarks.

Request a Sample Full Methodology