Overview

OpenAI introduced GDPval as a benchmark for evaluating model performance on realistic occupational tasks rather than narrowly scoped academic or synthetic problems.

The benchmark spans 44 occupations across the nine largest sectors of U.S. GDP, motivating its name. These occupations were selected to cover a broad range of economically relevant work, with tasks intended to reflect real professional workflows rather than benchmark-style abstractions.

A central motivation for GDPval is that many existing evaluations measure component capabilities, such as reasoning, coding, or retrieval, but do not directly evaluate end-to-end performance on work tasks. GDPval is designed to help bridge that gap by evaluating models on tasks that resemble practical assignments professionals perform.

Tasks often require combining several capabilities within a single evaluation instance, including planning, information synthesis, multimodal reasoning, computer use, and deliverable generation. Many tasks are also intentionally long horizon, with some corresponding to assignments estimated to take humans multiple hours to complete.

Unlike traditional benchmarks that rely primarily on exact-match scoring, GDPval uses a combination of expert judgments, structured grading criteria, and automated evaluators to assess output quality.

Figure: Example task categories from GDPval.

Benchmark Structure

GDPval contains 1,320 tasks in the full benchmark, with a 220-task gold subset publicly released through Hugging Face.

Rather than organizing evaluation around isolated skills, GDPval is organized around occupational tasks. Each task typically includes:

  • a task prompt describing the objective
  • supporting materials required to perform the task
  • explicit deliverable requirements
  • metadata describing occupation and sector
  • evaluation instructions and grading criteria

Supporting artifacts are often integral to task execution and may include:

  • spreadsheets
  • PDFs
  • images
  • datasets
  • forms
  • reference documents

This makes tasks closer to structured work assignments than benchmark questions.

According to the paper, tasks were sourced and validated with domain professionals, with contributors averaging roughly 14 years of experience.

Task Design

A defining characteristic of GDPval is that outputs are evaluated as work products.

Rather than returning a short answer, models may be asked to produce:

  • analyses
  • reports
  • recommendations
  • structured plans
  • spreadsheets
  • multimodal artifacts

Tasks frequently require multi-stage execution. A single task may involve interpreting materials, extracting information, reasoning over constraints, using tools, and assembling a final deliverable.

For example, a released task from the GDPval dataset includes an auditing task requiring spreadsheet-based sample selection and supporting deliverables:

{
    "task_id": "83d10b06-26d1-4636-a32c-23f92c57f30b",
    "occupation": "Accountants and Auditors",
    "sector": "Professional, Scientific, and Technical Services",
    "prompt": "...",
    "reference_files": [
        "Population.xlsx"
    ],
    "deliverable_files": [
        "Sample.xlsx",
        "sample_distribution.png"
    ]
}

This illustrates that GDPval tasks can include both input artifacts and expected output artifacts, not just text prompts.

Dataset Structure

The public release on Hugging Face exposes GDPval as structured task instances.

Examples of released fields include:

{
    "task_id": ...,
    "sector": ...,
    "occupation": ...,
    "prompt": ...,
    "reference_files": ...,
    "reference_file_hf_uris": ...,
    "deliverable_text": ...,
    "deliverable_files": ...
}

This schema supports both benchmarking and analysis across occupations, sectors, and task types.

Example occupational task types represented in the benchmark include:

examples = [
    "Audit sampling with spreadsheets",
    "Support transcript analysis",
    "Stage plot generation",
    "Operational planning tasks",
]

The benchmark is also mapped to occupational activities using U.S. Bureau of Labor Statistics work activity coverage, described in the paper, which is part of how task coverage was constructed.

Evaluation Methodology

Because many tasks do not have a single correct answer, GDPval does not rely solely on conventional benchmark accuracy.

Evaluation combines expert review with comparative judgments. A major component is pairwise preference evaluation, where judges compare candidate outputs and determine which better satisfies task requirements, as described in GDPval.

This is used because many outputs are open-ended deliverables where quality cannot be reduced to exact-match correctness.

Judgments may consider:

  • correctness
  • completeness
  • usefulness
  • instruction adherence
  • domain appropriateness
  • professional quality

Many tasks are also associated with structured grading criteria used to support more systematic evaluation.

Performance is therefore often expressed through comparative win rates rather than simple accuracy.

AutoRaters

A major technical contribution in GDPval is its use of automated evaluators, or autoraters, to scale assessment.

Because expert grading is expensive, the benchmark uses model-based evaluators calibrated against human judgments to score outputs.

These autoraters are used to assess open-ended deliverables through structured comparisons and rubric-guided evaluation, rather than simple correctness checks.

The paper emphasizes two central properties of these evaluators:

  1. Agreement with expert human judgments
  2. Scalability for evaluating large numbers of open-ended task outputs

This is important because evaluating realistic work products is itself a difficult problem.

Conceptually, the evaluation pipeline can be thought of as:

task -> model deliverable
     -> autorater assessment
     -> pairwise or rubric-based judgment
     -> benchmark score

In GDPval, evaluation infrastructure is treated as part of benchmark design, not just an auxiliary scoring mechanism.

What GDPval Measures

GDPval is designed to evaluate combinations of capabilities within realistic tasks, including:

  • long-horizon reasoning
  • planning
  • tool use
  • multimodal understanding
  • professional deliverable generation

Because these are measured jointly within task execution, benchmark performance reflects compound task performance rather than isolated skill scores.

Reported Results

The paper reports that frontier model performance on GDPval has improved over time and that performance is sensitive to additional reasoning effort and task context.

The benchmark is also used to study:

  • model progress over time
  • effects of inference scaling
  • effects of additional context
  • human-AI comparison on occupational tasks

These analyses are a substantial part of the benchmark’s contribution beyond the task set itself.

Scope and Limitations

As with any open-ended benchmark, results should be interpreted with several constraints in mind.

Performance depends partly on evaluator consistency, including autorater-human agreement. Subjective tasks also introduce dependence on rubric coverage and judgment criteria.

Occupational coverage is broad but not exhaustive, and benchmark tasks are sampled instances rather than full representations of occupations.

Prompting, scaffolding, and tool configurations can also materially affect performance.

These considerations are important when comparing results across models.

Relation to Other Benchmarks

GDPval differs from many commonly cited evaluations in the level at which it measures performance.

  • MMLU evaluates academic knowledge
  • SWE-bench evaluates software issue resolution
  • OSWorld evaluates computer-use interaction
  • GDPval evaluates broader occupational task performance

Its focus is on realistic work assignments that combine multiple capabilities in a single evaluation.