The Raw Data Problem
A typical raw job posting, scraped from a job board or employer career page, gives you something like this:
- A job title (free text, unstandardized: "Sr. SW Eng II - Platform")
- A company name (maybe "AMZN," maybe nothing)
- A location string (free text: "Greater NYC Area," "Remote - US," "Springfield")
- A posting date (in whatever format the source prefers)
- A description (HTML blob, 100 to 2,000+ words)
- Sometimes a stated salary. Usually not.
That is roughly five to seven usable fields, most of them unstructured text. If you need to answer "how many remote senior software engineering roles were posted in Q3 2024 with salaries above $150K," raw data cannot get you there. You lack standardized titles, reliable seniority classification, structured work mode fields, and normalized salary figures.
This is not a niche problem. It is the central problem of job market data. To understand what structured enrichment actually produces, take a look at how raw and enriched records compare side by side.
What Enrichment Actually Produces
Every record that enters the Canaria pipeline exits with up to 82 structured fields. The enrichment layer applies a sequence of NLP models, classification algorithms, parsing logic, and entity resolution systems to transform raw text into structured, queryable data. For the full field list, see the data schema.
Here is what that looks like across 907.5 million records:
Classification Fields
| Field | Records Covered | Coverage Rate |
|---|---|---|
| Normalized title | 907.5M | 99.97% |
| SOC code (6-digit) | 722.7M | 79.6% |
| Seniority level | 646.1M | 71.2% |
| Employment type | 637.6M | 70.2% |
| Remote/work mode | 722.7M | 79.6% |
Extraction Fields
| Field | Records Covered | Coverage Rate |
|---|---|---|
| Skills (non-empty array) | 767.7M | 84.6% |
| Soft skills | 671.1M | 73.9% |
| Certifications | 82.3M | 9.1% |
| Benefits | 466.3M | 51.4% |
Location Parsing
| Field | Records Covered | Coverage Rate |
|---|---|---|
| City | 839.8M | 92.5% |
| State | 862.9M | 95.1% |
| Country | 839.3M | 92.5% |
| Zipcode | 831.5M | 91.6% |
Salary
- Stated salary (greater than zero): 203.9M records (22.5%)
- Predicted salary: MAPE under 15%, trained on 50M+ compensation observations, available where state, zip, and SOC code are present. More on how the prediction model works in Salary Prediction from 50 Million Observations.
These are measured coverage rates from the production database as of early 2026.
The Multiplier Effect: Raw vs. Enriched
The most striking way to understand enrichment is to compare what sources give natively versus what the NLP pipeline produces.
Seniority coverage jumps from 18.9% to 71.2% after enrichment, a 3.8x multiplier. Across 907.5 million records, only 171.3 million (18.9%) arrive with any seniority information from the source. After NLP classification, 646.1 million records (71.2%) have a seniority label. That takes you from roughly one in five records to nearly three in four.
Employment type follows a similar pattern: 26.2% raw to 70.2% enriched, a 2.7x improvement. Sources provide it on 238.0 million records. After enrichment: 637.6 million records.
If you are buying raw scraped data, you are working with 19% seniority coverage. With enriched data, you have 71%. For compensation benchmarking, talent supply modeling, or career progression analysis, that gap is the difference between a viable product and a broken one.
Skills extraction tells an even sharper story. Raw postings carry skills as unstructured text buried in job descriptions. The enrichment pipeline extracts structured, taxonomy-mapped skill arrays from 767.7 million records (84.6% coverage) using a 37,000+ skill taxonomy. The difference between "this description mentions Python somewhere" and "this role requires Python, listed with a relevance score and taxonomy ID" is the difference between keyword search and structured analysis. For a deeper look at how extraction works at that scale, see 37,000 Skills: Extracting Signal from Noise.
Walking Through the Enrichment Stack
Here is how a single record moves through the pipeline.
Ingestion and deduplication. A raw posting arrives from one of 22 sources. It receives a content hash and a dedup key. Multi-signal entity resolution (vector similarity on descriptions, MinHash/Jaccard on company names, title similarity modeling, geo-clustering, graph-based transitive matching) identifies duplicates across sources. A single Fortune 500 posting can appear on five to ten channels simultaneously, and without dedup, your counts are inflated by 40-60%. The pipeline processes over 900 million unique records from billions of raw observations. The details of how dedup works at this scale are covered in Deduplication Is Harder Than You Think.
NLP classification. Title normalization maps the raw title to a canonical form with a confidence score. SOC classification uses both title and description context, aligned with the BLS taxonomy. Seniority, employment type, and remote work status are classified from title patterns and description text. More detail on the models behind this is available on the methodology page.
Location parsing. Raw location strings like "Greater New York Area" or "Springfield, IL 62701" are parsed into structured city, state, country, and zipcode fields. This involves address parsing, country inference, state normalization, zipcode lookup, and special handling for "Remote" as a non-geographic designation. Location parsing across 907M records reveals its own set of interesting patterns, covered in The Hidden Geography of Hiring.
Salary normalization. Stated salaries are normalized to annualized USD with min/max/average values. When salary is not stated, a prediction model (MAPE under 15%) provides an estimate where sufficient context exists. Salary stated coverage has improved dramatically since 2022 due to US transparency laws, a trend explored in The Salary Transparency Data Shift.
Skills extraction. An Aho-Corasick dictionary matches against 37,000+ technical skills, 3,000+ certifications, and 400+ soft skill categories. An NLP relevance filter removes false positives. "Java" mentioned in a barista job description gets filtered out; "Java" in a backend engineer description gets kept. The result is structured arrays, not keyword hits.
Company enrichment. Company names are matched to canonical entities using fuzzy matching (MinHash/Jaccard similarity), with profile data joined from a 28-million-record company master database covering size, industry, founding year, and headquarters.
Final merge. All enrichments join back to the base record with priority logic resolving conflicts (parsed location beats scraped location, CDP industry beats job posting metadata). The output: a single record with 82 structured fields.
Why This Matters for Your Use Case
For ML training data. Raw data gives you labels on 19-26% of records for key fields. Enriched data gives you 70-85%. That is not a marginal improvement. That is the difference between a model that trains and one that does not have enough signal.
For analytics and dashboards. If your "salary insights" feature only has data on 22.5% of postings, you need predicted salary to fill the gap. If your "skills demand" feature depends on structured arrays, you need extraction covering 84.6% of records, not keyword grep over raw descriptions.
For investment signals. Raw data has structural breaks: salary transparency laws in 2022-2023, remote work fields emerging in 2020-2021, SOC code revisions in 2018. An enrichment layer that retroactively applies modern classification to historical data is what makes longitudinal analysis possible.
For deduplication. Without multi-signal entity resolution, your job counts are inflated by 40-60%. If you are reporting "500,000 new software engineering postings this quarter" but half are duplicates across sources, your number is meaningless.
What Enrichment Does Not Fix
Enrichment is not magic. Some limitations are structural:
- Certification coverage sits at 9.1%. This is not a pipeline failure. Most job postings simply do not require specific certifications. Among roles that do (nursing, IT security, finance, trades), extraction rates are substantially higher.
- Stated salary is only 22.5%. Transparency laws are pushing this number up year over year, but enrichment cannot invent salary data that employers did not disclose. Prediction fills some of the gap, but predicted and stated salary are different signals.
- SOC classification at 79.6% means 20.4% of records lack an occupational code. Ambiguous titles ("Manager," "Associate," "Specialist" without description context) and new-economy roles ("Growth Hacker," "Prompt Engineer") are the primary gaps.
Knowing where coverage drops, and why, is as important as knowing where it is high. The glossary covers the terminology behind these classifications.
The Numbers
99.97% title normalization. 79.6% SOC classification. 84.6% skills extraction. 92.5% location parsing. A 3.8x multiplier on seniority coverage. Across 907.5 million records and 82 fields.
Raw data is a commodity. Measured enrichment at scale is the hard part.
If you want to see what this looks like on actual records, request a sample.
Want to see the data for yourself?
Get a free sample of 5,000 enriched job records, delivered within 24 hours.