Enhance WebPageLoader for markdown content extraction by karthiknadig · Pull Request #320146 · microsoft/vscode

karthiknadig · 2026-06-05T17:18:48Z

Implement support for extracting markdown content from web pages by modifying the WebPageLoader to prefer markdown responses during content negotiation. Update tests to verify the new functionality and ensure proper handling of markdown content types.

Copilot

Pull request overview

This PR updates the WebPageLoader (Electron main process) to prefer text/markdown during HTTP content negotiation and to treat text/markdown main-frame responses as directly-extractable content, with accompanying unit test updates.

Changes:

Add Accept header for mainFrame requests to prefer text/markdown over text/html.
Detect text/markdown responses and switch extraction to a simplified “document text content” path.
Extend/adjust unit tests to cover the new request/response behavior.

Show a summary per file

File	Description
src/vs/platform/webContentExtractor/electron-main/webPageLoader.ts	Adds markdown-preferred negotiation + markdown response detection/extraction path.
src/vs/platform/webContentExtractor/test/electron-main/webPageLoader.test.ts	Updates header-modification tests and adds markdown negotiation/extraction tests.

Copilot's findings

Comments suppressed due to low confidence (1)

src/vs/platform/webContentExtractor/test/electron-main/webPageLoader.test.ts:1026

The simulated sub-resource request omits resourceType, but Electron provides a concrete non-main-frame value for these (e.g. stylesheet). Passing an explicit non-mainFrame type makes the test more realistic and avoids the test depending on resourceType being missing.

		// Simulate a sub-resource request (no resourceType)
		callback(
			{
				url: 'https://example.com/style.css',
				requestHeaders: {
					'TestHeader': 'TestValue'
				}
			},

Files reviewed: 2/2 changed files
Comments generated: 3

+			// Detect markdown responses for direct extraction without DOM rendering
+			if (details.resourceType === 'mainFrame' && contentType?.split(';')[0].trim() === 'text/markdown') {
+				this._receivedMarkdown = true;
+				this.trace(`Received text/markdown response, will extract raw content`);
+			}


+					// If the server returned text/markdown, extract the raw body directly
+					// without DOM-based extraction — the content is already in the ideal format.
+					if (this._receivedMarkdown) {
+						this.trace(`Extracting raw markdown content from response body`);
+						result = await this._window.webContents.executeJavaScript('document.body?.innerText ?? document.documentElement?.textContent ?? ""') ?? '';
+						return;


+		const uri = URI.parse('https://learn.microsoft.com/en-us/docs');
+
+		const loader = createWebPageLoader(uri);
+		setupDebuggerMock();


jruales · 2026-06-05T18:02:56Z

Verified this is working well for the MS Learn site:

and it's not affecting browser, so it's working as expected

- Reset _receivedMarkdown on every mainFrame response instead of latching - Use textContent instead of innerText to avoid forced layout - Strengthen test with long AX content to ensure markdown branch is exercised

karthiknadig added 2 commits June 4, 2026 09:27

feat: enhance WebPageLoader to support markdown content extraction

0aadb69

test: enhance WebPageLoader tests for markdown content negotiation

b17eef3

Copilot AI review requested due to automatic review settings June 5, 2026 17:18

karthiknadig self-assigned this Jun 5, 2026

Copilot started reviewing on behalf of karthiknadig June 5, 2026 17:19 View session

karthiknadig requested review from TylerLeonhardt, jruales and kycutler June 5, 2026 17:19

Copilot AI reviewed Jun 5, 2026

View reviewed changes

jruales previously approved these changes Jun 5, 2026

View reviewed changes

fix: address PR review comments for markdown content negotiation

f693b71

- Reset _receivedMarkdown on every mainFrame response instead of latching - Use textContent instead of innerText to avoid forced layout - Strengthen test with long AX content to ensure markdown branch is exercised

karthiknadig dismissed jruales’s stale review via f693b71 June 6, 2026 03:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance WebPageLoader for markdown content extraction#320146

Enhance WebPageLoader for markdown content extraction#320146
karthiknadig wants to merge 3 commits into
mainfrom
karthiknadig/markdown-mime

karthiknadig commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jruales commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

karthiknadig commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

jruales commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jruales commented Jun 5, 2026 •

edited

Loading