Skip to content

Remove the internal TOMLChar wrapper#492

Open
tfoutrein wants to merge 4 commits into
python-poetry:masterfrom
AstekGroup:perf/remove-tomlchar
Open

Remove the internal TOMLChar wrapper#492
tfoutrein wants to merge 4 commits into
python-poetry:masterfrom
AstekGroup:perf/remove-tomlchar

Conversation

@tfoutrein

Copy link
Copy Markdown
Contributor

Stacked on #489, #490 and #491 — the capstone of that series; best reviewed/merged after them.
This also supersedes #488 (interning TOMLChar): per @dimbleby's suggestion on that PR, removing the wrapper entirely is the better end-state, so I'd close #488 in favour of this.

What

After the bulk run-scans (#490/#491), the parser only constructs a TOMLChar (a str subclass) at run boundaries and uses a handful of its is_*() helpers. This removes the class entirely:

  • Source yields plain str characters; inc() / advance_* read self[i] directly.
  • End-of-input is detected positionally (_idx >= len / Source.end()) instead of an identity sentinel.
  • The remaining character-class checks use module-level frozensets.

A real NUL byte is still rejected as an invalid control char and is never mistaken for end-of-input, since EOF is now positional rather than a value/identity comparison.

Benchmarks

Median, interleaved A/B vs master (includes #489#491):

document speedup
large flat, single-line strings (~90 KB) 5.8×
poetry.lock-like (~64 KB) 2.4×
pyproject.toml 1.9×
typical mixed (~4 KB) 1.6×

The removal itself adds ~1.1–1.18× over #491. No regression on any shape.

Tests

Full suite passes (972, incl. the toml-test conformance submodule). On top of that, an 11.5k-input adversarial differential — EOF/truncation at every prefix length, real-NUL placement in every position, empty/whitespace/BOM, and structural fuzz — is byte-identical in output and exception type to master. No public API change (TOMLChar was not exported).

tfoutrein added 4 commits June 5, 2026 16:19
`Source.__init__` built `iter([(i, TOMLChar(c)) for i, c in enumerate(self)])`,
allocating one tuple and one TOMLChar per character of the whole input up
front. Track an integer index into the underlying string instead: `inc()`
bumps the index and reads `self[idx]`, and state save/restore snapshots the
index rather than copying an iterator. Construction is O(1) and per-character
work is deferred to the read.

No behaviour change (full suite incl. the toml-test conformance submodule
passes); ~1.07-1.14x faster parsing across document sizes.
The parser advanced one character at a time through runs of whitespace,
bare-key and number characters, paying a `Source.inc()` call (attribute
lookups + a `TOMLChar` build + bounds check) for every character.

Add `Source.advance_while(charset)` / `advance_until(stopset)`, which scan
the underlying string in a single pass and update the index and current
character only once, and use them for the leading-whitespace, bare-key and
number/date runs. Same value contract as the `while ... and self.inc()`
loops they replace.

No behaviour change (full suite incl. the toml-test conformance submodule
passes; round-trip output byte-identical on a varied corpus). ~1.05-1.32x
faster parsing depending on shape (e.g. ~1.26x on a poetry.lock-like file).
Parsing a single-line string appended its body one character at a time
(`value += current; inc()`). For long string values this dominates.

Scan the run of ordinary characters up to the next delimiter, backslash
or control character in a single pass (`Source.advance_until`) and append
the whole slice at once; the stop character is then handled by the
existing branch on the next iteration. Multiline strings keep the
per-character loop (CRLF handling). The stop-set is exactly the control
characters the per-character loop rejects, so InvalidControlChar / escape
/ delimiter handling is unchanged.

No behaviour change (972 tests incl. the toml-test conformance submodule;
plus a 4135-input adversarial differential — output and error-type
byte-identical to the per-char loop). Up to ~5x faster parsing on
string-heavy single-line documents.
After the bulk run-scans, the parser only built a `TOMLChar` (a `str`
subclass) at run boundaries and used a handful of its `is_*()` helpers.
Drop the class entirely: `Source` now yields plain `str` characters and
detects end-of-input positionally (`_idx >= len` / `Source.end()`) instead
of an identity sentinel, and the remaining character-class checks use
module-level frozensets.

A real NUL byte is still rejected as an invalid control char and is never
mistaken for end-of-input, since EOF is now positional rather than a
sentinel comparison.

No behaviour change (972 tests incl. the toml-test conformance submodule;
plus an 11.5k-input adversarial differential over EOF/truncation, real-NUL
placement, empty/whitespace and structural fuzz — output and error-type
byte-identical to master). Removes the per-character object construction
and method dispatch (~1.1-1.18x over the previous step).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant