Skills & Occupation Taxonomy
Comprehensive skills and occupation taxonomy extracted from 1B+ job postings. Includes 40,000+ technical skills, 3,400+ professional certifications, and 260+ soft skills identified through a two-step process: a high-recall dictionary scan followed by a title-aware relevance filter that removes off-role mentions. The filter is tuned conservatively to keep legitimate skills, and an independent multi-judge audit confirms about 99% of removed tags are false positives, with fewer than 1 in 500 legitimate skills affected. Coverage exceeds >85% for descriptions over 200 characters, with 5-15 skills per posting on average (2023+). SOC classification uses title + description context (not title-only matching), achieving 94% top-5 accuracy (73% top-1) across 867 BLS 2018 codes. Taxonomy match rate >90%.
All taxonomy fields are derived from our NLP enrichment pipeline. Skills, SOC codes, seniority, and normalized titles are available on every job posting record.
Key Highlights
- 40,000+ skills, 3,400+ certifications, 260+ soft skills. One of the largest commercial skills taxonomies available
- Two-step extraction: high-recall dictionary scan + title-aware relevance filter that removes off-role mentions. An independent audit confirms about 99% of removed tags are false positives, with fewer than 1 in 500 legitimate skills affected
- Separate hard skills, soft skills, and certification fields for clean downstream use
- SOC classification using title + description context: 94% top-5, 73% top-1 across 867 BLS 2018 codes
- F1 score: 85-92% on bulleted/structured sections, 65-78% on narrative text
- Taxonomy match rate >90%, continuously updated as new postings are processed
Use Cases
- Skills gap analysis and workforce development
- Curriculum alignment for educational institutions
- Talent matching and job recommendation engines
- Emerging skills detection and trend forecasting
Sample FieldsView full schema
nlpSkillsnlpSoftSkillsnlpCertificationsnlpQualificationsnlpSocCodenlpSocTitlenlpNormalizedTitlenlpNormalizedTitleScorenlpSeniorityDelivery Formats
See This Data Live
Interactive charts from our 1B+ deduplicated job postings, updated daily.
Sample Records
A preview of real records from this dataset. Unlock all fields by requesting a free sample.
| Job Title | Company | City | State | Seniority | Work Mode | Min Salary | Max Salary | SOC Code | SOC Title | +6 more |
|---|---|---|---|---|---|---|---|---|---|---|
| Full Stack Developer | Shopify | Ottawa | ON | Mid | Remote | 120,000 | 165,000 | 15-1252 | Software Developers | … |
| Cybersecurity Analyst | CrowdStrike | Austin | TX | Mid | Hybrid | 105,000 | 145,000 | 15-1212 | Information Security Analysts | … |
| Data Engineer | Snowflake | San Mateo | CA | Senior | Hybrid | 165,000 | 225,000 | 15-1252 | Software Developers | ... |
| Cloud Architect | Accenture | Atlanta | GA | Lead | Remote | 175,000 | 240,000 | 15-1244 | Network Architects | ... |
| AI Research Scientist | OpenAI | San Francisco | CA | Senior | On-site | 250,000 | 400,000 | 15-2051 | Data Scientists | ... |
Illustrative sample records showing the delivered schema. Real records are delivered via sample request.
Relevant Solutions
Job Market Data for HR Tech Platforms
Add salary benchmarking and skills intelligence to your platform without building ML
Job Market Training Data for AI & ML Teams
Pre-enriched, deduplicated job market training data. Skip 6 months of pipeline building.
Job Market Data for Academic Research
Longitudinal dataset for labor economics: wage dynamics, skill demand, remote work adoption
Job Market Data for Consulting Firms
Project-ready labor market data with no annual contract required.
Job Market Data for Healthcare Workforce Planning
Track clinical hiring pipelines with degree-level granularity: 338M+ degree requirement records across nursing, allied health, and physician roles.
More Datasets
Canaria delivers five integrated datasets that join cleanly with each other.
Job Postings Data
1B+ deduplicated job postings from Indeed, LinkedIn, ATS, and 15+ sources
Company Profiles
28.5M company profiles with firmographics, hiring signals, and industry classification
Salary Data
AI-predicted salaries with three-source fusion (posted, reported, BLS OES) plus parsed posted salaries and ~11M Glassdoor reports
Google Maps Business Data
52M detailed + 193M basic business records with ratings, reviews, and geocoordinates
Frequently Asked Questions
Common questions about the Skills & Occupation Taxonomy dataset.
- How large is the skills taxonomy?
- 40,000+ technical skills, 3,400+ certifications, and 260+ soft skills, extracted from 1B+ job postings.
- How are skills extracted?
- A two-step process: a high-recall dictionary scan tags candidate skills, then a title-aware relevance filter removes off-role mentions. An independent multi-judge audit confirms about 99% of removed tags are false positives, with fewer than 1 in 500 legitimate skills affected.
- What is the extraction accuracy?
- F1 of 85-92% on bulleted or structured sections and 65-78% on narrative text, with a taxonomy match rate of >90%.
- Are SOC codes and titles included?
- Yes. SOC classification uses title plus description context (94% top-5, 73% top-1 across 867 BLS 2018 codes), alongside normalized titles.
- How is the taxonomy delivered?
- As CSV or Parquet via S3, GCS, Snowflake, or SFTP, with separate hard-skill, soft-skill, and certification fields for clean downstream use.