Salary Forecasting Model: Lessons From 50M Observations

The Problem: 77% of Job Postings Have No Salary

Salary is the most commercially valuable field in a job posting, and it is missing from roughly three out of every four records.

Across Canaria's dataset of 865 million job observations, approximately 201.7 million records include a stated salary. That is a 23.3% disclosure rate. The remaining 663 million records have no employer-provided compensation information. Each record carries 82 structured fields, but the one field buyers ask about first is empty 77% of the time. For anyone building compensation benchmarks, labor cost models, or talent market intelligence, this gap is the central data quality problem.

We built a salary prediction model to address it, trained on over 50 million salary observations from Glassdoor and Indeed. This post covers what we learned about salary data quality, compensation distributions, and the design choices that separate reliable predictions from confident noise. For background on how transparency laws are changing disclosure rates, see our analysis of the regulatory landscape.

What 203.9 Million Salary Records Look Like

Before building a model, we needed to understand the stated salary data already in the pipeline. Our delivery table contains 203.9 million records with populated salary fields, distributed as follows:

Salary Range	Records	Share
Under $30,000	24.3M	11.9%
$30,000 - $50,000	94.3M	46.2%
$50,000 - $75,000	39.5M	19.4%
$75,000 - $100,000	19.1M	9.4%
$100,000 - $150,000	16.7M	8.2%
$150,000 - $200,000	6.3M	3.1%
Over $200,000	3.7M	1.8%

The distribution is heavily right-skewed, with the $30,000-$50,000 bucket containing nearly half of all salary-populated postings. Jobs most likely to include stated salaries are hourly and entry-level positions in retail, healthcare, food service, and logistics, where ranges are standardized. Knowledge worker roles are underrepresented in the stated-salary subset. This selection bias is the first thing any prediction model must account for. The full field structure is documented in our schema.

The Seniority Gradient

The compensation gradient across seniority levels shows expected patterns with one notable compression:

Seniority Level	Average Salary	Records
Executive	$120,835	2.5M
Director	$109,546	3.4M
Mid-Senior	$79,287	46.2M
Associate	$68,040	2.2M
Internship	$53,623	2.4M
Entry Level	$49,059	95.3M

The gap between Entry Level ($49,059) and Executive ($120,835) is only a 2.46x multiplier. That is notably compressed compared to actual total compensation. Executive compensation increasingly comes through equity, bonuses, and deferred comp that do not appear in posting salary fields. The Mid-Senior level ($79,287, 46.2M records) is where prediction is most commercially valuable: large enough for market analysis, variable enough across location and industry, and dense enough for reliable modeling.

Seniority classification itself comes from our NLP enrichment pipeline, which assigns levels based on title patterns and job description analysis. See the glossary for definitions of SOC codes, seniority levels, and other classification terms used throughout.

The Outlier Problem

Raw salary data from job postings is noisy. In our 203.9 million salary records:

65,429 records exceed $1 million per year. Most are parsing errors where an hourly rate was interpreted as millions annually.
283,505 records fall below $10,000 per year. For full-time roles, this typically means an annual salary was captured as hourly or monthly.
731 records have salary_min exceeding salary_max. This is always a bug, indicating a swap error or partial extraction.

Together these represent about 0.17% of records. That sounds small, but 349,665 bad records can meaningfully distort aggregates. A single $45 million parsing error among 1,000 Denver software engineering salaries shifts the average by $45,000.

Our pipeline enforces hard invariants: salary_min must not exceed salary_max, and all values are annualized to USD. Catching subtler errors (hourly vs. annual confusion, currency misattribution) requires statistical outlier detection beyond simple rules. You can compare raw vs enriched records to see how the pipeline transforms and validates salary fields.

Data Challenges That Shaped Model Design

Building a prediction model on job posting data is different from working with clean survey data. Four structural problems constrained our approach.

Hourly vs. annual confusion. "$25" could mean $25/hour ($52,000 annualized) or $25,000/year. The difference is a factor of 2,000. When the salary period is not stated, inference relies on job type, magnitude, and textual cues. Our pipeline annualizes everything, but the inference is imperfect.

Currency confusion. "50,000" in a UK posting means GBP. "3,000,000" in a Korean posting means KRW (roughly $2,200 USD). We have identified over 564,000 records where non-USD salaries are stored without conversion, a known issue under active remediation.

Range parsing failures. "$80,000 - $120,000" should produce min of $80K and max of $120K. Variations in formatting ("$80-120K", "80000-120000") each need different parsing logic. Truncation errors capture only the first number, understating ranges.

Selection bias in training data. The 50 million Glassdoor and Indeed observations overrepresent industries where disclosure is common. The model must generalize to underrepresented occupations and geographies using features that carry enough signal to extrapolate.

Model Design: Three Required Inputs

The model requires three inputs: state, zip code, and SOC code (Standard Occupational Classification). When any input is missing, it returns -1 rather than producing an estimate.

This is deliberate. We could build a model that estimates salary from job title alone, covering more records with less accuracy. We chose precision over coverage because downstream use cases (compensation benchmarking, labor cost modeling) are more damaged by confidently wrong predictions than by missing ones.

The rationale for each input:

State and zip code capture geographic compensation variation. A software engineer in San Francisco commands a different salary than the same role in Des Moines. Zip code adds granularity within states, distinguishing suburban from urban markets.
SOC code is the strongest single predictor. A registered nurse and a data scientist in the same zip code have fundamentally different compensation profiles. The 6-digit code distinguishes "Software Developers" (15-1252) from "Database Administrators" (15-1242).
Why not more features? We tested company size, NAICS code, seniority, and skills as additional inputs. Some improved training accuracy but increased missing-data rates in production. A model requiring six inputs that scores 30% of records is less useful than one requiring three inputs that scores 70%. Our methodology describes the full feature evaluation process and validation framework.

Performance: Where the Model Is Strong, and Where It Abstains

The model's error distribution is not uniform:

Performance is best on high-volume occupations in major metros (nurses in Houston, developers in Seattle) where training data is dense.
It is worst on niche occupations in small markets and on roles with highly variable compensation (sales with commission, executives with equity).
Retraining frequency matters: salary distributions shift with inflation, remote work normalization, and industry demand cycles. A model trained on 2023 data will drift by 2025 without updates.

The model's reliability reflects disciplined feature selection and a willingness to return "I don't know" rather than an unreliable estimate. That -1 return value is not an error. It is the model telling you that it lacks the inputs to make a prediction it would stand behind.

Filling the Historical Gap

The prediction model transforms our dataset from 23% salary coverage to majority coverage for records with valid state, zip code, and SOC inputs. This matters most for historical analysis. Salary transparency laws are recent, and pre-2022 stated rates were even lower. For any analysis requiring salary across time, whether tracking occupation-level trends, modeling labor costs by geography, or building inflation-adjusted indices, the prediction model makes the work feasible.

The combination of stated and predicted salary also enables continuous validation. In segments with high stated coverage, we compare stated to predicted values, producing an evolving accuracy benchmark that flags model drift before it becomes a problem.

Lessons After Three Years in Production

Currency detection should have been a first-class pipeline component from the start. The hourly vs. annual inference problem deserves its own dedicated model rather than rule-based heuristics. And the training data needs to be refreshed more frequently than we initially planned, because salary distributions are not stationary.

Salary prediction at scale is not a machine learning problem. It is a data quality problem that happens to require machine learning. Prediction quality is a function of training data quality, feature selection discipline, and that willingness to return -1. If the model cannot be confident, silence is better than noise.

Explore how stated and predicted salary fields look in practice at compare raw vs enriched, browse available datasets, or request a sample to see coverage for your target occupations and geographies.

Want to see the data for yourself?

Get a free sample of 5,000 enriched job records.

Request a Free Sample Back to Blog