Skip to content

Enhance WebPageLoader for markdown content extraction#320146

Draft
karthiknadig wants to merge 3 commits into
mainfrom
karthiknadig/markdown-mime
Draft

Enhance WebPageLoader for markdown content extraction#320146
karthiknadig wants to merge 3 commits into
mainfrom
karthiknadig/markdown-mime

Conversation

@karthiknadig
Copy link
Copy Markdown
Member

Implement support for extracting markdown content from web pages by modifying the WebPageLoader to prefer markdown responses during content negotiation. Update tests to verify the new functionality and ensure proper handling of markdown content types.

Copilot AI review requested due to automatic review settings June 5, 2026 17:18
@karthiknadig karthiknadig self-assigned this Jun 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the WebPageLoader (Electron main process) to prefer text/markdown during HTTP content negotiation and to treat text/markdown main-frame responses as directly-extractable content, with accompanying unit test updates.

Changes:

  • Add Accept header for mainFrame requests to prefer text/markdown over text/html.
  • Detect text/markdown responses and switch extraction to a simplified “document text content” path.
  • Extend/adjust unit tests to cover the new request/response behavior.
Show a summary per file
File Description
src/vs/platform/webContentExtractor/electron-main/webPageLoader.ts Adds markdown-preferred negotiation + markdown response detection/extraction path.
src/vs/platform/webContentExtractor/test/electron-main/webPageLoader.test.ts Updates header-modification tests and adds markdown negotiation/extraction tests.

Copilot's findings

Comments suppressed due to low confidence (1)

src/vs/platform/webContentExtractor/test/electron-main/webPageLoader.test.ts:1026

  • The simulated sub-resource request omits resourceType, but Electron provides a concrete non-main-frame value for these (e.g. stylesheet). Passing an explicit non-mainFrame type makes the test more realistic and avoids the test depending on resourceType being missing.
		// Simulate a sub-resource request (no resourceType)
		callback(
			{
				url: 'https://example.com/style.css',
				requestHeaders: {
					'TestHeader': 'TestValue'
				}
			},
  • Files reviewed: 2/2 changed files
  • Comments generated: 3

Comment on lines +197 to +201
// Detect markdown responses for direct extraction without DOM rendering
if (details.resourceType === 'mainFrame' && contentType?.split(';')[0].trim() === 'text/markdown') {
this._receivedMarkdown = true;
this.trace(`Received text/markdown response, will extract raw content`);
}
Comment on lines +446 to +451
// If the server returned text/markdown, extract the raw body directly
// without DOM-based extraction — the content is already in the ideal format.
if (this._receivedMarkdown) {
this.trace(`Extracting raw markdown content from response body`);
result = await this._window.webContents.executeJavaScript('document.body?.innerText ?? document.documentElement?.textContent ?? ""') ?? '';
return;
const uri = URI.parse('https://learn.microsoft.com/en-us/docs');

const loader = createWebPageLoader(uri);
setupDebuggerMock();
@jruales
Copy link
Copy Markdown
Contributor

jruales commented Jun 5, 2026

Verified this is working well for the MS Learn site:

image

and it's not affecting browser, so it's working as expected

image

jruales
jruales previously approved these changes Jun 5, 2026
- Reset _receivedMarkdown on every mainFrame response instead of latching
- Use textContent instead of innerText to avoid forced layout
- Strengthen test with long AX content to ensure markdown branch is exercised
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants