Job Data Deduplication: Why Hash Matching Fails at Scale

The One Job, Ten Listings Problem

When a Fortune 500 company posts a Software Engineer opening, that single job appears on LinkedIn within hours, shows up on Indeed the same day, populates Workday and Greenhouse career portals by the end of the week, and eventually surfaces on Google Jobs, SimplyHired, and a handful of regional aggregators. One job. Ten listings. Each with slightly different formatting, company name variations, and location strings.

At Canaria, we process 4.47 billion raw job observations drawn from Indeed, LinkedIn, 200,000+ employer ATS career portals, and dozens of additional sources. After deduplication, those observations resolve to 907 million unique job postings with 82 structured fields per record. That means 79.7% of everything we ingest is a duplicate. Getting that number right is one of the hardest problems in job market data, and simple hash-based matching cannot solve it.

The Naive Approach: Hash Everything

The most intuitive strategy is straightforward. Concatenate a few key fields, hash them, and treat identical hashes as duplicates. Pick job title, company name, and location. Run SHA-256. Done.

This works surprisingly well for a single source. Until you add a second one.

Field	LinkedIn	Indeed	ATS Feed
Company	Amazon Web Services	AWS	Amazon.com, Inc.
Title	Software Development Engineer II	SDE II	Software Dev Engineer 2
Location	Seattle, WA	Seattle, Washington, 98109	Greater Seattle Area

A naive hash of title + company + location produces three entirely different hashes. Three records in your dataset. One real job. If your analytics team reports that Amazon is hiring three software engineers when there is actually one, your demand signal is inflated by 200%.

This problem compounds at scale. With 20+ source families feeding into a single pipeline, even small per-source false negative rates (failing to match true duplicates) accumulate into millions of phantom records.

What We Actually See in Production

Our pipeline ingests observations from over 20 distinct source families. The deduplication rate varies dramatically by source type, and understanding why tells you a lot about the structure of the job posting ecosystem.

Source	Raw Observations	Unique Jobs	Dedup Rate
indeed	2,064M	225.9M	89.1%
linkedin	987.8M	176.4M	82.1%
myworkdayjobs	256.5M	11.8M	95.4%
simplyhired	207.8M	105.1M	49.4%
careerbuilder	105.9M	58.1M	45.1%
greenhouse	11.3M	278K	97.5%
lever	35.4M	853K	97.6%

Three patterns emerge from this data.

High-volume aggregators show the highest duplication. Indeed, with over 2.06 billion observations, resolves to 225.9 million unique jobs. Each unique Indeed job has been observed an average of 9.1 times across our scraping history. LinkedIn follows a similar pattern at 5.6 observations per posting. These rates reflect continuous scraping against platforms that host listings for weeks or months. The ratio between raw volume and unique count is a useful proxy for how aggressively a platform recycles and re-surfaces listings.

Direct ATS feeds sit at the opposite extreme. Greenhouse (97.5%), Lever (97.6%), and Workday (95.4%) all exceed 95% deduplication. The mechanism differs from aggregator duplication. These are small, stable inventories scraped at high frequency. Greenhouse's 11.3 million observations resolve to just 278,000 unique jobs, roughly 40 observations per posting. For data buyers evaluating ATS-sourced feeds, this pattern is worth understanding: high dedup rates in ATS data do not indicate poor sourcing. They indicate thorough longitudinal tracking of a stable inventory. The ATS and aggregator comparison page shows this distinction in more detail.

The middle ground reveals structural differences. SimplyHired (49.4%) and CareerBuilder (45.1%) show lower dedup rates than the major platforms, suggesting key instability where each scrape generates a new identifier for the same position. This makes within-source dedup via identifier impossible and forces reliance on content-level matching.

Why Hash Matching Breaks Down

The fundamental limitation of hash-based deduplication is that it treats the problem as exact match. In reality, it is an entity resolution problem. Four categories of variation make exact matching fail systematically.

Company name variations are pervasive. "Macy's" and "Macys Inc" and "Macy's LLC" are the same employer. Our company canonicalization pipeline processes 1.8 million company name variants using MinHash/Jaccard similarity. Without this step, every name variant creates a false split. The glossary defines the entity resolution terminology used throughout our documentation.

Job titles are unstable across platforms. "Software Engineer II" on LinkedIn becomes "SDE II" on the company's career page. Title normalization requires NLP, not string comparison. We map raw titles to canonical forms so that variations like "SW Eng II" and "Software Engineer, Level 2" resolve to the same standardized representation.

Location representations are inconsistent. "New York, NY" and "NYC" and "Greater New York Area" all refer to the same metro area. Some platforms include zip codes, others use metro area names. Our geography of hiring post covers the location parsing challenge in depth, including how ambiguous city names like Portland (Oregon vs. Maine) and the 26 US Springfields complicate geographic analysis downstream of dedup.

Descriptions get reformatted. Indeed reformats employer-submitted descriptions. LinkedIn applies its own rendering. The same content can differ by hundreds of characters while conveying identical information.

Multi-Signal Entity Resolution

Our deduplication pipeline uses multi-signal entity resolution rather than relying on any single matching strategy. The methodology page covers the technical details.

Semantic vector similarity for descriptions. Even when descriptions are reformatted, their semantic content remains stable. Dense vector embeddings and cosine similarity identify postings that describe the same role regardless of exact wording.
MinHash and Jaccard similarity for company names. Locality-sensitive hashing efficiently identifies name pairs that are variations of the same entity.
Title normalization through NLP. Our model maps raw titles to canonical forms, collapsing surface-level variation into standardized representations.
Geographic clustering. Location strings are parsed into structured components and clustered by proximity. The full field definitions are in our data schema.
Graph-based transitive matching. If posting A matches B, and B matches C, our graph processing unifies all three, even if A and C would not have matched directly. This transitive closure step is where multi-signal approaches pull furthest ahead of hash-based methods. Isolated pairwise matching misses chains of similarity that only become visible in a graph.

The Dedup Key Design

Our primary deduplication key, jtsId, is a SHA-256 hash of four components: the job key, the job title, the source platform, and the source region. Company name is deliberately excluded because company name inconsistency would cause false splits.

This is a design tradeoff worth explaining. Including company name in the hash would increase precision (fewer false merges) but dramatically reduce recall (more false splits). Given that company name is the single most inconsistent field across sources, excluding it from the dedup key and handling company resolution separately produces better results.

Across all 907 million records in our delivery table, the count of duplicate jtsId values is exactly zero. This is enforced by a LIMIT 1 BY jtsId clause in the merge pipeline and verified by automated QA on every run. You can inspect the output schema, including jtsId, in the schema documentation.

Why This Matters for Analytics

The consequences of poor deduplication cascade through every downstream analysis.

Inflated demand signals. If the same job appears five times, your hiring demand estimates are 5x too high for that position. At 907 million unique records, even a 1% false negative rate means 9 million ghost postings distorting demand models.
Distorted source comparisons. Comparing Indeed job counts to LinkedIn counts is meaningless without cross-source dedup. The table above shows raw Indeed volume at 2,064M vs. LinkedIn at 987.8M, but unique counts narrow to 225.9M vs. 176.4M. The raw ratio overstates the gap by roughly 60%.
Incorrect time-series analysis. A job that persists for three months and gets re-scraped weekly appears as 12 new postings. Your time-series shows 12 hiring events. Reality shows one persistent opening.
Skewed geographic analysis. Multi-location postings may appear as separate jobs in two cities, doubling apparent demand in both locations. This interacts with the location parsing challenge described in our geography of hiring post.

At the scale Canaria operates, 400 million unique postings sourced from 8 billion scraped URLs, even small percentage errors translate to millions of ghost records. When clients build hiring models, compensation benchmarks, or labor market indices, the accuracy of the underlying dedup determines whether those products reflect reality or amplify noise.

If you want to see what deduplicated, structured job data looks like across 82 fields, request a sample or explore available datasets. For the full field-by-field breakdown of the 907M-record dataset, see the Anatomy of 400 Million Job Postings; for why ATS portals carry lower duplication than aggregators, see ATS vs Job Boards: Data Quality.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records.

Request a Free Sample Back to Blog