Parsing 400M Job Locations from Messy Hiring Data

Five Strings, One Place

Consider these five location strings, all referring to the same place:

New York, NY 10001
NYC
Greater New York Area
New York, New York
New York, NY or San Francisco, CA

Now consider that your dataset contains 907 million job postings with 100+ fields each, where the location field was entered by a different employer, scraped from a different platform, and formatted according to whatever conventions that platform happened to use at the time. Some include zip codes. Some use state abbreviations. Some reference metro areas that do not correspond to any single city. And some simply say "Remote."

Location parsing is not a preprocessing step. It is a core data engineering challenge that determines whether anyone downstream can accurately answer "Where is hiring happening?" and "How does compensation vary by market?"

The Scale of the Problem

Our delivery table contains 907 million job records drawn from Indeed, LinkedIn, 200,000+ employer ATS career portals, and dozens of additional sources spanning 2022 to the present. Each record carries a raw scrapedLocation string. Our location parsing pipeline processes these strings into structured geographic components.

Component	Records Parsed	Coverage Rate
Country	839.3M	92.5%
State	862.9M	95.1%
City	839.8M	92.5%
Zip Code	831.5M	91.6%

State resolution leads at 95.1% because US states are relatively unambiguous (50 states, well-known abbreviations). City parsing trails at 92.5% because cities are the most ambiguous component. There are 26 Springfields in the United States alone, and "Portland" requires state context to resolve.

These numbers represent honest measurement against a real-world dataset, not cherry-picked accuracy on clean test data. For a deeper look at what our structured fields contain, explore the data schema.

The Priority Logic

Not all location data is created equal. A parsed location derived from structured address components is more reliable than a raw scraped string, which is more reliable than a calculated inference. Our merge pipeline implements a priority waterfall:

finalCity = parsedCity OR scrapedCity OR calcCity
finalState = parsedState OR scrapedState OR calcState
finalCountry = parsedCountry OR scrapedCountry OR calcCountry

The parsed values come from a dedicated location parsing pipeline that uses the usaddress library and supplementary parsers to decompose location strings. The calculated values are inferences: if we know the zip code is 10001, we can calculate that the city is New York. This priority logic favors precision over recall. A record with no parsed city will fall back to the scraped string rather than guessing.

The methodology page covers how this priority waterfall interacts with the broader enrichment pipeline.

Where the Jobs Are

The geographic distribution across 907.5 million records reveals the structure of the labor market as captured by digital job postings.

The United States dominates with 810.6 million records, reflecting both the maturity of US job boards and the depth of source coverage in our pipeline. International presence is significant and growing. Canada follows at 8.6 million, with the United Kingdom, France, Germany, India, and Brazil each contributing over 1 million records.

Within the US, the geographic distribution of postings mirrors economic geography.

State	Records	State	Records
California	71.5M	Virginia	26.3M
Texas	61.3M	Georgia	24.8M
Florida	46.5M	Michigan	23.7M
New York	36.6M	Massachusetts	22.0M
Pennsylvania	33.6M	Washington	20.6M

California leads at 71.5 million records, roughly 8.8% of US postings, followed by Texas at 61.3 million. The top five states account for roughly 31% of all US job postings.

The City-Level Surprise

City-level data is where the real insights live. Houston edges out New York for the top spot at 5.4 million records each. Texas places three cities in the top six (Houston, Austin, Dallas) and four in the top ten including San Antonio. This concentration reflects the state's diversified economy: energy in Houston, tech in Austin, finance and logistics in Dallas, military and healthcare in San Antonio.

The presence of Charlotte (3.1M), Nashville (2.8M), and Indianapolis (3.0M) in the top 15 highlights the continued rise of mid-tier metros as hiring centers. These are not cities that dominated hiring a decade ago. Their prominence here reflects real geographic shifts in where companies are building teams. For data buyers doing site selection or talent market analysis, the gap between perceived and actual hiring geography is often wider than expected.

The Failure Modes

Our QA system continuously monitors parsing quality and has identified several systematic failure patterns. Understanding these is useful for anyone working with geographic job data, whether from our pipeline or another provider.

Ambiguous city names remain the hardest parsing problem. Portland, Oregon and Portland, Maine are two entirely different labor markets separated by 3,000 miles. Springfield exists in 26 US states. Without state or country context, these cities are genuinely ambiguous. Our pipeline resolves ambiguity through contextual clues: state when present, zip code lookup when available, source-level metadata as a fallback. The glossary defines terms like "geo-clustering" and "metro area normalization" that appear throughout our documentation.

Metro area strings require a dedicated mapping layer. Job boards frequently use metro area descriptions instead of specific cities: "Greater New York Area", "San Francisco Bay Area", "Dallas-Fort Worth Metroplex." These strings do not correspond to any single city. Parsing them requires a mapping layer that translates informal metro names to canonical geographic identifiers. We maintain these mappings and update them as platforms introduce new metro area labels.

Multi-location postings lose secondary locations. "New York, NY or San Francisco, CA" is a single posting for a role in either city. Our pipeline assigns the first parseable location as the primary, which means secondary locations are lost unless the downstream consumer examines the raw string. This is a known tradeoff: attempting to create two records from one posting introduces duplication, which interacts with the deduplication problem in ways that are difficult to control.

The "Remote" problem illustrates a category boundary. 4.9 million records have "Remote" as their country value. Remote is not a geographic location. It is a work modality. But it appears in the location field because "Where is this job?" followed by "Remote" is a legitimate answer from a job seeker's perspective. The root cause traces to a legacy mapping where "Remote" exists as a valid entry in the country inference module.

This illustrates a broader challenge: the boundary between "where" and "how" is not always clear in job posting data. "Remote - US" is both a work mode and a geographic constraint. The growth of remote and hybrid work since 2020 has made this overlap increasingly common. Our remote work trends data tracks how this category has evolved.

The International Parsing Gap

While US location parsing benefits from well-structured address conventions and mature libraries like usaddress, international location strings present additional challenges. Address formats vary by country (Japan uses block-level addressing, Germany uses street-then-number). Transliteration of non-Latin scripts introduces further variation. Our current pipeline achieves 92.5% country-level coverage, but city-level accuracy on international postings lags behind US accuracy by roughly 5 to 8 percentage points. This is an area of active development.

Why Accurate Location Data Matters

Every downstream application depends on parsing accuracy.

Site selection and expansion planning. If 8% of software engineering jobs in Austin are misclassified, the talent supply estimate is systematically understated.
Compensation benchmarking by market. A software engineer salary of $180,000 is below market in San Francisco and above market in Omaha. Wrong location means wrong benchmark. The salary transparency dataset depends directly on location accuracy for jurisdiction assignment.
Labor market indices. Time-series analyses require consistent classification over time. If the parser changes behavior between 2023 and 2024, the apparent geographic shift is a parser artifact, not a real trend.
Regulatory compliance. Salary transparency laws are jurisdiction-specific. Analyzing compliance rates requires knowing which jobs fall under which jurisdiction.

Location parsing improves incrementally rather than being solved definitively. Each new source introduces new formatting conventions. Each year brings new patterns. The approach is layered infrastructure: a primary parsing pipeline for the 92%+ case, supplementary parsers for edge cases, a priority waterfall selecting the highest-confidence resolution, and continuous QA monitoring parsing rates across every run.

For data buyers building labor market products, the quality of your location data is the quality of your geographic analysis. To see how raw and enriched records compare side by side, or to explore what is available, check our sample data or browse datasets by provider.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records.

Request a Free Sample Back to Blog