Evaluation Results β Full results for three LLMs across 2,262 parallel code translation tasks, including pass rates, direction analysis, failure taxonomy, and per-kernel breakdowns.
ParBench is a curated meta-benchmark for evaluating how well large language models translate parallel source code across different programming APIs (CUDA, HIP, SYCL, OpenMP, etc.). It aggregates kernels from multiple existing benchmark suites and wraps each one in a machine-readable specification that drives automated build, run, verification, and LLM evaluation workflows.
parbench/
βββ README.md # This file
βββ GUIDE.md # Complete guide: pipeline, commands, adding benchmarks
βββ manifest.jsonl # Level 1: Master index (one JSON object per line)
β
βββ schema/ # JSON Schemas (draft-07)
β βββ manifest_schema.json # Schema for manifest.jsonl entries
β βββ spec_schema.json # Schema for Level 2 kernel spec files
β βββ reference_platform.json # Shared hardware reference platform
β
βββ specs/ # Level 2 spec files (one per kernel variant)
β βββ <suite>-<kernel>-<api>.json
β
βββ templates/
β βββ spec_template.json # Blank spec with all fields & placeholder values
β
βββ harness/ # Python harness: build, run, verify automation
β βββ cli.py # Command-line interface
β βββ builder.py # Compilation logic
β βββ runner.py # Execution logic
β βββ verifier.py # Verification strategies
β βββ spec_loader.py # JSON loading & path resolution
β βββ reporter.py # Output formatting
β βββ models.py # Result data classes
β
βββ scripts/ # Utility scripts
β βββ validate_schema.py # Schema + cross-cutting validator
β βββ generate_paper_figures.py # Generate publication figures
β βββ generate_viz_data.py # Generate visualization data
β
βββ examples/ # Reference data
β βββ example_178_kernels.json # 178 kernels in single-file format
β
βββ analysis/ # All analysis outputs
β βββ visualizations/ # PNG charts and network graphs
β βββ reports/ # Markdown reports, presentations
β βββ data/ # CSV matrices, Excel workbooks
β
βββ config/ # Machine-specific config (git-ignored)
βββ paths.json # Maps downloads_root to local path
See GUIDE.md for the complete pipeline walkthrough, all commands, and instructions for adding new benchmarks.
All schemas use JSON Schema draft-07.
Each line is a self-contained JSON object that indexes one kernel variant:
| Field | Type | Required | Description |
|---|---|---|---|
kernel_name |
string | β | Logical kernel name (pairing key) |
parallel_api |
enum | β | API: serial, cuda, hip, sycl, β¦ |
source_suite |
string | β | Origin benchmark suite (lowercase) |
spec_file |
string | β | Relative path to Level 2 spec JSON |
source_dir |
string | β | Relative path to kernel source files |
category |
enum | β | Domain: ml, graph, physics, β¦ |
A comprehensive specification that drives the full evaluation pipeline. Key sections:
- identity β unique_id, kernel name, API, source suite
- provenance β repository URL, pinned commit, license
- files β
prompt_payload(LLM sees),support_files(build only),verification_only(never shown to LLM) - implementation β language, API details
- build β environment, build system, commands, outputs
- run β executable, arguments, input configurations, timeout
- verification β method, strategies, floating-point tolerance
- performance β optional metric extraction
- hardware β target device, requirements, reference platform
- baseline_results β populated after benchmarking (starts null)
- metadata β description, domain, tags
The files section is critical for LLM evaluation integrity:
prompt_payloadβ Files the LLM receives for translation. ONLY these go in the prompt.support_filesβ Makefiles, shared headers needed for compilation but NOT sent to the LLM.verification_onlyβ Reference implementations and test harnesses. NEVER shown to the LLM.
No file may appear in both prompt_payload and verification_only.
# Install dependency
python3 -m pip install jsonschema
# Validate the manifest
python3 scripts/validate_schema.py --manifest manifest.jsonl
# Validate a single spec
python3 scripts/validate_schema.py --spec specs/rodinia-bfs-cuda.json
# Validate everything (manifest + all specs)
python3 scripts/validate_schema.py --allThe validator checks:
- JSON Schema conformance (draft-07)
unique_idmatches the spec filenameunique_idformat matches{source_suite}-{kernel_name}-{api}implementation.apimatchesidentity.parallel_api- All files listed in
files.*exist on disk - No file appears in both
prompt_payloadandverification_only
- All paths in specs and the manifest are relative to
parbench/(the project root). downloads_rootequalsproject_rootβ downloaded benchmark repos live inside the project directory itself.- Machine-specific path configuration lives in
config/paths.json(git-ignored).
- Python 3.12+
jsonschema(python3 -m pip install jsonschema)
- For reproducing paper tables/figures (Tier 2): Any x86_64 machine, no GPU needed (~4 GB RAM)
- For running CUDA/OpenCL specs (Tier 3): NVIDIA GPU (compute capability β₯ 7.0, e.g., RTX 3060+)
- For OpenMP-only specs: Any multi-core x86_64 CPU (no GPU required)
- Tested platform: NVIDIA RTX 4070, AMD Ryzen 9 7900X, Ubuntu 24.04, NVIDIA HPC SDK 24.3
Python 3.12 or later is required. Clone the repository, create a virtual environment, and install dependencies:
git clone <repository-url>
cd parbench
python3 -m venv env_parbench
source env_parbench/bin/activate
# Core dependencies (harness, schema validation, augmentation)
python3 -m pip install -r requirements.txt
# Or for exact pinned versions (reproducible environment)
python3 -m pip install -r requirements-lock.txtOptional dependency groups can be installed via pyproject.toml:
# LLM evaluation pipeline (anthropic, openai clients)
python3 -m pip install ".[eval]"
# Analysis and figure generation (matplotlib, numpy)
python3 -m pip install ".[analysis]"
# Development tools (pytest, ruff)
python3 -m pip install ".[dev]"
# Everything
python3 -m pip install ".[all]"The build-run-verify harness also requires compilers for the target parallel APIs (e.g., nvcc for CUDA, g++ with -fopenmp for OpenMP, OpenCL headers and runtime libraries for OpenCL). See config/compiler_inventory.txt for the tested compiler versions.
-
Activate the virtual environment:
source env_parbench/bin/activate -
Validate the spec and manifest schemas:
python3 scripts/validate_schema.py --all
-
Run the full build-run-verify pipeline on a single kernel spec:
python3 -m harness verify specs/rodinia-bfs-cuda.json
-
List all available translation pairs (e.g., CUDA to OpenMP):
python3 -m harness pairs
Results reproduction is tiered by what you want to verify:
Confirm the harness and analysis pipeline work on your machine:
source env_parbench/bin/activate
python3 scripts/validate_schema.py --all # Schema validation (expect ~15 known errors from phantom specs)
python3 -m harness info specs/rodinia-bfs-cuda.json # Inspect a specAll 2,344 per-task result JSONs are included in results/evaluation/. Regenerate every
table and figure in the paper from these raw results:
# Docker (recommended β exact environment):
cd artifact && docker build -t parbench . && docker run --rm -v $(pwd)/../output:/app/output parbench ./reproduce.sh
# Or without Docker:
bash artifact/reproduce.shOutput lands in output/ β 5 LaTeX tables (T1βT5) and 16 figures (F2βF7, C.3βC.4, per-model variants).
Deterministic table values can be diffed against expected_outputs/ for bit-exact verification.
See artifact/README.md for full details.
Re-run all 2,262 LLM evaluations from scratch. Requires:
- NVIDIA GPU (CUDA 12.x) for build-run-verify of CUDA/OpenCL specs
- API keys: Together AI (Qwen), Azure OpenAI (GPT-5.4, GPT-5.3-Codex)
- Estimated cost: ~$150β200 in API credits (as of May 2026)
# Example: re-run Qwen on CUDA-to-OpenMP direction
python3 scripts/evaluation/run_eval_batch.py \
--suite rodinia --direction cuda-to-omp \
--models together-qwen-3.5-397b-a17b \
--project-root . --resume -vSee the paper's Appendix J for full campaign configuration and cost details.
Pre-computed evaluation results for all three models are in results/evaluation/:
results/evaluation/
βββ together-qwen-3.5-397b-a17b/ # 708 result JSONs
βββ azure-gpt-5.4/ # 822 result JSONs
βββ azure-gpt-5.3-codex/ # 814 result JSONs
Each JSON file represents one translation task and contains:
overall_status: PASS, BUILD_FAIL, RUN_FAIL, VERIFY_FAIL, or EXTRACTION_FAILtranslation_code: The LLM-generated translated source codebuild_result,run_result,verification_result: Per-stage outcomes with stdout/stderrmodel,direction,source_spec,target_spec: Task metadataaugmentation_level,sample_id: Experiment design coordinates
After KNOWN_FAIL exclusion, 2,262 records are eval-eligible (the denominator for all paper statistics).
Inspect a spec without building or running:
python3 -m harness info specs/rodinia-hotspot-omp.jsonView the LLM prompt payload (the source files an LLM would receive for translation):
python3 -m harness prompt specs/rodinia-nw-cuda.jsonBuild and run a kernel with verbose output:
python3 -m harness -v verify specs/rodinia-bfs-cuda.jsonRun an LLM evaluation batch (CUDA to OpenMP, with resume support):
python3 scripts/evaluation/run_eval_batch.py \
--suite rodinia \
--direction cuda-to-omp \
--models <model-name> \
--project-root /path/to/parbench \
--resume -vAnalyze evaluation results and generate a summary:
python3 scripts/evaluation/analyze_eval.py \
--project-root /path/to/parbench \
--results-dir results/evaluationParBench aggregates kernels from five HPC benchmark suites, covering 90 unique kernels across 206 spec files and four parallel APIs (CUDA, OpenMP, OpenCL, OpenMP target offload).
| Suite | Kernels | Spec Files | APIs | Source |
|---|---|---|---|---|
| Rodinia | 22 | 60 | CUDA, OpenMP, OpenCL | Git submodule (rodinia/rodinia-src/, commit 9c10d3ea) |
| HeCBench | 65 | 135 | CUDA, OpenMP, OpenMP target | Cloned locally (HeCBench-master/, gitignored) |
| XSBench | 1 | 4 | CUDA, OpenMP, OpenCL, OpenMP target | Git submodule (xsbench-src/) |
| RSBench | 1 | 4 | CUDA, OpenMP, OpenCL, OpenMP target | Git submodule (rsbench-src/) |
| mixbench | 1 | 3 | CUDA, OpenMP, OpenCL | Git submodule (mixbench-src/) |
Each kernel variant is fully described by a JSON spec file in specs/ that drives the
build-run-verify pipeline. The append-only manifest (manifest.jsonl, 211 entries) indexes
all spec files and enables automatic discovery of translation pairs across APIs.
The c_augmentation/ package provides AST-driven, semantics-preserving code transforms
powered by libclang. These transforms create diverse LLM input variants to evaluate
translation robustness -- the transformed code compiles and runs identically to the original.
Transforms available:
| Transform | Description |
|---|---|
ArithmeticTransform |
Expands compound operators (e.g., x += 1 to x = x + 1) |
SwapCondition |
Flips comparison operands (e.g., x < y to y > x) |
PointerArithmeticToArrayIndex |
Converts pointer arithmetic to array indexing (e.g., *(arr + i) to arr[i]) |
TypedefExpansion |
Inlines typedef aliases with their underlying types |
ChangeNames |
Renames local variables to neutral identifiers |
ChangeFunctionNames |
Renames non-entry-point functions |
Augmentation levels (L1--L4) control the fraction of eligible candidates each transform modifies, from a single candidate (L1) to all candidates (L4). Level L0 is the unaugmented original source.
Run augmentation on a single spec:
python3 scripts/augmentation/augment_verify.py specs/rodinia-bfs-cuda.json \
--augment_level 2 --seed 42 --project-root /path/to/parbenchRun augmentation unit tests (15 tests, all must pass before commit):
python3 -m pytest c_augmentation/test_transforms.py -vParBench uses pytest for its test suite. The primary tests cover the code augmentation transforms:
# Run all augmentation transform tests
python3 -m pytest c_augmentation/test_transforms.py -v
# Run schema validation across all specs and the manifest
python3 scripts/validate_schema.py --allApproximately 15 schema validation errors are expected from phantom spec entries in the
append-only manifest -- these are not bugs. See GUIDE.md for details.
ParBench is under review at NeurIPS 2026 (Evaluations & Datasets Track). Citation guidance will be added upon publication.