Remove the internal TOMLChar wrapper by tfoutrein · Pull Request #492 · python-poetry/tomlkit

tfoutrein · 2026-06-06T21:10:06Z

Stacked on #489, #490 and #491 — the capstone of that series; best reviewed/merged after them.
This also supersedes #488 (interning TOMLChar): per @dimbleby's suggestion on that PR, removing the wrapper entirely is the better end-state, so I'd close #488 in favour of this.

What

After the bulk run-scans (#490/#491), the parser only constructs a TOMLChar (a str subclass) at run boundaries and uses a handful of its is_*() helpers. This removes the class entirely:

Source yields plain str characters; inc() / advance_* read self[i] directly.
End-of-input is detected positionally (_idx >= len / Source.end()) instead of an identity sentinel.
The remaining character-class checks use module-level frozensets.

A real NUL byte is still rejected as an invalid control char and is never mistaken for end-of-input, since EOF is now positional rather than a value/identity comparison.

Benchmarks

Median, interleaved A/B vs master (includes #489–#491):

document	speedup
large flat, single-line strings (~90 KB)	5.8×
poetry.lock-like (~64 KB)	2.4×
pyproject.toml	1.9×
typical mixed (~4 KB)	1.6×

The removal itself adds ~1.1–1.18× over #491. No regression on any shape.

Tests

Full suite passes (972, incl. the toml-test conformance submodule). On top of that, an 11.5k-input adversarial differential — EOF/truncation at every prefix length, real-NUL placement in every position, empty/whitespace/BOM, and structural fuzz — is byte-identical in output and exception type to master. No public API change (TOMLChar was not exported).

`Source.__init__` built `iter([(i, TOMLChar(c)) for i, c in enumerate(self)])`, allocating one tuple and one TOMLChar per character of the whole input up front. Track an integer index into the underlying string instead: `inc()` bumps the index and reads `self[idx]`, and state save/restore snapshots the index rather than copying an iterator. Construction is O(1) and per-character work is deferred to the read. No behaviour change (full suite incl. the toml-test conformance submodule passes); ~1.07-1.14x faster parsing across document sizes.

The parser advanced one character at a time through runs of whitespace, bare-key and number characters, paying a `Source.inc()` call (attribute lookups + a `TOMLChar` build + bounds check) for every character. Add `Source.advance_while(charset)` / `advance_until(stopset)`, which scan the underlying string in a single pass and update the index and current character only once, and use them for the leading-whitespace, bare-key and number/date runs. Same value contract as the `while ... and self.inc()` loops they replace. No behaviour change (full suite incl. the toml-test conformance submodule passes; round-trip output byte-identical on a varied corpus). ~1.05-1.32x faster parsing depending on shape (e.g. ~1.26x on a poetry.lock-like file).

Parsing a single-line string appended its body one character at a time (`value += current; inc()`). For long string values this dominates. Scan the run of ordinary characters up to the next delimiter, backslash or control character in a single pass (`Source.advance_until`) and append the whole slice at once; the stop character is then handled by the existing branch on the next iteration. Multiline strings keep the per-character loop (CRLF handling). The stop-set is exactly the control characters the per-character loop rejects, so InvalidControlChar / escape / delimiter handling is unchanged. No behaviour change (972 tests incl. the toml-test conformance submodule; plus a 4135-input adversarial differential — output and error-type byte-identical to the per-char loop). Up to ~5x faster parsing on string-heavy single-line documents.

After the bulk run-scans, the parser only built a `TOMLChar` (a `str` subclass) at run boundaries and used a handful of its `is_*()` helpers. Drop the class entirely: `Source` now yields plain `str` characters and detects end-of-input positionally (`_idx >= len` / `Source.end()`) instead of an identity sentinel, and the remaining character-class checks use module-level frozensets. A real NUL byte is still rejected as an invalid control char and is never mistaken for end-of-input, since EOF is now positional rather than a sentinel comparison. No behaviour change (972 tests incl. the toml-test conformance submodule; plus an 11.5k-input adversarial differential over EOF/truncation, real-NUL placement, empty/whitespace and structural fuzz — output and error-type byte-identical to master). Removes the per-character object construction and method dispatch (~1.1-1.18x over the previous step).

tfoutrein added 4 commits June 5, 2026 16:19

tfoutrein force-pushed the perf/remove-tomlchar branch from b4509b9 to 1c43d4d Compare June 6, 2026 21:12

tfoutrein mentioned this pull request Jun 6, 2026

Speed up parsing by interning TOMLChar instances #488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove the internal TOMLChar wrapper#492

Remove the internal TOMLChar wrapper#492
tfoutrein wants to merge 4 commits into
python-poetry:masterfrom
AstekGroup:perf/remove-tomlchar

tfoutrein commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tfoutrein commented Jun 6, 2026

What

Benchmarks

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant