feat(baoyu-url-to-markdown): add browser fallback strategy, content cleaner, and data URI support

- Browser strategy: headless first with automatic retry in visible Chrome on failure
- New --browser auto|headless|headed flag with --headless/--headed shortcuts
- Content cleaner module for HTML preprocessing (remove ads, base64 images, scripts)
- Media localizer now handles base64 data URIs alongside remote URLs
- Capture finalUrl from browser to track redirects for output path
- Agent quality gate documentation for post-capture validation
- Upgrade defuddle ^0.12.0 → ^0.14.0
- Add unit tests for content-cleaner, html-to-markdown, legacy-converter, media-localizer
This commit is contained in:
Jim Liu 宝玉 2026-03-24 22:39:17 -05:00
parent 40f9f05c22
commit e99ce744cd
12 changed files with 909 additions and 67 deletions

View File

@ -118,6 +118,7 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
## Features
- Chrome CDP for full JavaScript rendering
- Browser strategy fallback: default headless first, then visible Chrome on technical failure
- URL-specific parser layer for sites that need custom HTML rules before generic extraction
- Two capture modes: auto or wait-for-user
- Save rendered HTML as a sibling `-captured.html` file
@ -137,6 +138,12 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
# Auto mode (default) - capture when page loads
${BUN_X} {baseDir}/scripts/main.ts <url>
# Force headless only
${BUN_X} {baseDir}/scripts/main.ts <url> --browser headless
# Force visible browser
${BUN_X} {baseDir}/scripts/main.ts <url> --browser headed
# Wait mode - wait for user signal before capture
${BUN_X} {baseDir}/scripts/main.ts <url> --wait
@ -158,6 +165,9 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
| `-o <path>` | Output file path — must be a **file** path, not directory (default: auto-generated) |
| `--output-dir <dir>` | Base output directory — auto-generates `{dir}/{domain}/{slug}.md` (default: `./url-to-markdown/`) |
| `--wait` | Wait for user signal before capturing |
| `--browser <mode>` | Browser strategy: `auto` (default), `headless`, or `headed` |
| `--headless` | Shortcut for `--browser headless` |
| `--headed` | Shortcut for `--browser headed` |
| `--timeout <ms>` | Page load timeout (default: 30000) |
| `--download-media` | Download image/video assets to local `imgs/` and `videos/`, and rewrite markdown links to local relative paths |
@ -165,7 +175,7 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
| Mode | Behavior | Use When |
|------|----------|----------|
| Auto (default) | Capture on network idle | Public pages, static content |
| Auto (default) | Try headless first, then retry in visible Chrome if needed | Public pages, static content, unknown pages |
| Wait (`--wait`) | User signals when ready | Login-required, lazy loading, paywalls |
**Wait mode workflow**:
@ -173,6 +183,43 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
2. Ask user to confirm page is ready
3. Send newline to stdin to trigger capture
**Default browser fallback**:
1. Auto mode starts with headless Chrome and captures on network idle
2. If headless capture fails technically, retry with visible Chrome
3. If a shared Chrome session for this profile already exists, reuse it instead of launching a new browser
4. The script does not hard-code login or paywall detection; the agent must inspect the captured markdown or HTML and decide whether to rerun with `--browser headed --wait`
## Agent Quality Gate
**CRITICAL**: The agent must treat headless capture as provisional. Some sites render differently in headless mode and can silently return an error shell, partially hydrated page, or low-quality extraction **without** causing the CLI to fail.
After every run that used `--browser auto` or `--browser headless`, the agent **MUST** inspect the saved markdown first, and inspect the saved `-captured.html` when the markdown looks suspicious.
### Quality checks the agent must perform
1. Confirm the markdown title matches the target page, not a generic site shell
2. Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
3. Watch for obvious failure signs such as:
- `Application error`
- `This page could not be found`
- login, signup, subscribe, or verification shells
- extremely short markdown for a page that should be long-form
- raw framework payloads or mostly boilerplate content
4. If the result is low quality, incomplete, or clearly wrong, do **not** accept the run as successful just because the CLI exited with code 0
### Recovery workflow the agent must follow
1. First run with default `auto` unless there is already a clear reason to use wait mode
2. Review markdown quality immediately after the run
3. If the content is low quality, rerun locally with visible Chrome:
- `--browser headed` for ordinary rendering issues
- `--browser headed --wait` when the page may need login, anti-bot interaction, cookie acceptance, or extra hydration time
4. If `--wait` is used, tell the user exactly what to do:
- if login is required, ask them to sign in
- if the page needs time to hydrate, ask them to wait until the full content is visible
- once ready, ask them to press Enter so capture can continue
5. Only fall back to hosted `defuddle.md` after the local browser strategies have failed or are clearly lower fidelity
## Output Format
Each run saves two files side by side:
@ -211,8 +258,9 @@ Conversion order:
2. If no specialized parser matches, try Defuddle
3. For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline
4. If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
5. If the entire local browser capture flow fails before markdown can be produced, try the hosted `https://defuddle.md/<url>` API and save its markdown output directly
6. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
5. If the agent determines the captured result is a login screen, verification screen, or paywall shell, rerun locally with `--browser headed --wait` and ask the user to complete access before capture
6. If the entire local browser capture flow still fails before markdown can be produced, try the hosted `https://defuddle.md/<url>` API and save its markdown output directly
7. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
CLI output will show:

View File

@ -6,7 +6,7 @@
"dependencies": {
"@mozilla/readability": "^0.6.0",
"baoyu-chrome-cdp": "file:./vendor/baoyu-chrome-cdp",
"defuddle": "^0.12.0",
"defuddle": "^0.14.0",
"jsdom": "^24.1.3",
"linkedom": "^0.18.12",
"turndown": "^7.2.2",
@ -61,7 +61,7 @@
"decimal.js": ["decimal.js@10.6.0", "", {}, "sha512-YpgQiITW3JXGntzdUmyUR1V812Hn8T1YVXhCu+wO3OpS4eU9l4YdD3qjyiKdV6mvV29zapkMeD390UVEf2lkUg=="],
"defuddle": ["defuddle@0.12.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "peerDependencies": { "jsdom": "^24.0.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-Y/WgyGKBxwxFir+hWNth4nmWDDDb8BzQi3qASS2NWYPXsKU42Ku49/3M5yFYefnRef9prynnmasfnXjk99EWgA=="],
"defuddle": ["defuddle@0.14.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "linkedom": "^0.18.12", "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-btavZGd1WgiVqrVM62WGRXMUi/aU7ckTZiq0xXWLZMHvzIqNZjwIFQEDRx8MarD7fIgsB90NXZ9xHJkKtapt2Q=="],
"delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],

View File

@ -0,0 +1,55 @@
import assert from "node:assert/strict";
import test from "node:test";
import { cleanContent } from "./content-cleaner.js";
const SAMPLE_HTML = `<!doctype html>
<html>
<head>
<title>Example Story</title>
<style>.cookie-banner { position: fixed; }</style>
<script>window.__noise = true;</script>
</head>
<body>
<!-- comment that should be removed -->
<header>
<nav>
<a href="/home">Home</a>
<a href="/topics">Topics</a>
</nav>
</header>
<div class="cookie-banner">Accept cookies</div>
<aside>Sidebar links</aside>
<main>
<article class="content">
<h1>Actual Story Title</h1>
<p>
This is the first paragraph of the real story body, and it is intentionally long enough
to survive the cleaner's main-content heuristics without being mistaken for navigation.
</p>
<p>
This is the second paragraph with more useful detail, a
<a href="/read-more">supporting link</a>, and a normal image.
</p>
<img src="/images/cover.jpg" alt="Cover">
<img src="data:image/png;base64,AAAA" alt="Inline data">
</article>
</main>
<footer>Footer boilerplate</footer>
</body>
</html>`;
test("cleanContent keeps the article body and removes obvious boilerplate", () => {
const cleaned = cleanContent(SAMPLE_HTML, "https://example.com/posts/story");
assert.match(cleaned, /Actual Story Title/);
assert.match(cleaned, /https:\/\/example\.com\/read-more/);
assert.match(cleaned, /https:\/\/example\.com\/images\/cover\.jpg/);
assert.doesNotMatch(cleaned, /Accept cookies/);
assert.doesNotMatch(cleaned, /Sidebar links/);
assert.doesNotMatch(cleaned, /Footer boilerplate/);
assert.doesNotMatch(cleaned, /window\.__noise/);
assert.doesNotMatch(cleaned, /comment that should be removed/);
assert.doesNotMatch(cleaned, /data:image\/png;base64/);
});

View File

@ -0,0 +1,432 @@
import { parseHTML } from "linkedom";
export interface CleaningOptions {
removeAds?: boolean;
removeBase64Images?: boolean;
onlyMainContent?: boolean;
includeTags?: string[];
excludeTags?: string[];
}
const ALWAYS_REMOVE_SELECTORS = [
"script",
"style",
"noscript",
"link[rel='stylesheet']",
"[hidden]",
"[aria-hidden='true']",
"[style*='display: none']",
"[style*='display:none']",
"[style*='visibility: hidden']",
"[style*='visibility:hidden']",
"svg[aria-hidden='true']",
"svg.icon",
"svg[class*='icon']",
"template",
"meta",
"iframe",
"canvas",
"object",
"embed",
"form",
"input",
"select",
"textarea",
"button",
];
const OVERLAY_SELECTORS = [
"[class*='modal']",
"[class*='popup']",
"[class*='overlay']",
"[class*='dialog']",
"[role='dialog']",
"[role='alertdialog']",
"[class*='cookie']",
"[class*='consent']",
"[class*='gdpr']",
"[class*='privacy-banner']",
"[class*='notification-bar']",
"[id*='cookie']",
"[id*='consent']",
"[id*='gdpr']",
"[style*='position: fixed']",
"[style*='position:fixed']",
"[style*='position: sticky']",
"[style*='position:sticky']",
];
const NAVIGATION_SELECTORS = [
"header",
"footer",
"nav",
"aside",
".header",
".top",
".navbar",
"#header",
".footer",
".bottom",
"#footer",
".sidebar",
".side",
".aside",
"#sidebar",
".modal",
".popup",
"#modal",
".overlay",
".ad",
".ads",
".advert",
"#ad",
".lang-selector",
".language",
"#language-selector",
".social",
".social-media",
".social-links",
"#social",
".menu",
".navigation",
"#nav",
".breadcrumbs",
"#breadcrumbs",
".share",
"#share",
".widget",
"#widget",
".cookie",
"#cookie",
];
const FORCE_INCLUDE_SELECTORS = [
"#main",
"#content",
"#main-content",
"#article",
"#post",
"#page-content",
"main",
"article",
"[role='main']",
".main-content",
".content",
".post-content",
".article-content",
".entry-content",
".page-content",
".article-body",
".post-body",
".story-content",
".blog-content",
];
const AD_SELECTORS = [
"ins.adsbygoogle",
".google-ad",
".adsense",
"[data-ad]",
"[data-ads]",
"[data-ad-slot]",
"[data-ad-client]",
".ad-container",
".ad-wrapper",
".advertisement",
".sponsored-content",
"img[width='1'][height='1']",
"img[src*='pixel']",
"img[src*='tracking']",
"img[src*='analytics']",
];
function getLinkDensity(element: Element): number {
const text = element.textContent || "";
const textLength = text.trim().length;
if (textLength === 0) return 1;
let linkLength = 0;
element.querySelectorAll("a").forEach((link: Element) => {
linkLength += (link.textContent || "").trim().length;
});
return linkLength / textLength;
}
function getContentScore(element: Element): number {
let score = 0;
const text = element.textContent || "";
const textLength = text.trim().length;
score += Math.min(textLength / 100, 50);
score += element.querySelectorAll("p").length * 3;
score += element.querySelectorAll("h1, h2, h3, h4, h5, h6").length * 2;
score += element.querySelectorAll("img").length;
score -= element.querySelectorAll("a").length * 0.5;
score -= element.querySelectorAll("li").length * 0.2;
const linkDensity = getLinkDensity(element);
if (linkDensity > 0.5) score -= 30;
else if (linkDensity > 0.3) score -= 15;
const className = typeof element.className === "string" ? element.className : "";
const classAndId = `${className} ${element.id || ""}`;
if (/article|content|post|body|main|entry/i.test(classAndId)) score += 25;
if (/comment|sidebar|footer|nav|menu|header|widget|ad/i.test(classAndId)) score -= 25;
return score;
}
function looksLikeNavigation(element: Element): boolean {
const linkDensity = getLinkDensity(element);
if (linkDensity > 0.5) return true;
const listItems = element.querySelectorAll("li");
const links = element.querySelectorAll("a");
return listItems.length > 5 && links.length > listItems.length * 0.8;
}
function removeElements(document: Document, selectors: string[]): void {
for (const selector of selectors) {
try {
document.querySelectorAll(selector).forEach((element: Element) => element.remove());
} catch {
// Ignore unsupported selectors from linkedom/jsdom differences.
}
}
}
function removeWithProtection(
document: Document,
selectorsToRemove: string[],
protectedSelectors: string[]
): void {
for (const selector of selectorsToRemove) {
try {
document.querySelectorAll(selector).forEach((element: Element) => {
const isProtected = protectedSelectors.some((protectedSelector) => {
try {
return element.matches(protectedSelector);
} catch {
return false;
}
});
if (isProtected) return;
const containsProtected = protectedSelectors.some((protectedSelector) => {
try {
return element.querySelector(protectedSelector) !== null;
} catch {
return false;
}
});
if (containsProtected) return;
element.remove();
});
} catch {
// Ignore unsupported selectors from linkedom/jsdom differences.
}
}
}
function findMainContent(document: Document): Element | null {
const isValidContent = (element: Element | null): element is Element => {
if (!element) return false;
const text = element.textContent || "";
if (text.trim().length < 100) return false;
return !looksLikeNavigation(element);
};
const main = document.querySelector("main");
if (isValidContent(main) && getLinkDensity(main) < 0.4) return main;
const roleMain = document.querySelector('[role="main"]');
if (isValidContent(roleMain) && getLinkDensity(roleMain) < 0.4) return roleMain;
const articles = document.querySelectorAll("article");
if (articles.length === 1 && isValidContent(articles[0] ?? null)) {
return articles[0] ?? null;
}
const contentSelectors = [
"#content",
"#main-content",
"#main",
".content",
".main-content",
".post-content",
".article-content",
".entry-content",
".page-content",
".article-body",
".post-body",
".story-content",
".blog-content",
];
for (const selector of contentSelectors) {
try {
const element = document.querySelector(selector);
if (isValidContent(element) && getLinkDensity(element) < 0.4) {
return element;
}
} catch {
// Ignore invalid selectors.
}
}
const candidates: Array<{ element: Element; score: number }> = [];
const containers = document.querySelectorAll("div, section, article");
containers.forEach((element: Element) => {
const text = element.textContent || "";
if (text.trim().length < 200) return;
const score = getContentScore(element);
if (score > 0) {
candidates.push({ element, score });
}
});
candidates.sort((left, right) => right.score - left.score);
if ((candidates[0]?.score ?? 0) > 20) {
return candidates[0]?.element ?? null;
}
return null;
}
function removeBase64ImagesFromDocument(document: Document): void {
document.querySelectorAll("img[src^='data:']").forEach((element: Element) => {
element.remove();
});
document.querySelectorAll("[style*='data:image']").forEach((element: Element) => {
const style = element.getAttribute("style");
if (!style) return;
const cleanedStyle = style.replace(
/background(-image)?:\s*url\([^)]*data:image[^)]*\)[^;]*;?/gi,
""
);
if (cleanedStyle.trim()) {
element.setAttribute("style", cleanedStyle);
} else {
element.removeAttribute("style");
}
});
document.querySelectorAll("source[src^='data:'], source[srcset*='data:']").forEach((element: Element) => {
element.remove();
});
}
function makeAbsoluteUrl(value: string, baseUrl: string): string | null {
try {
return new URL(value, baseUrl).toString();
} catch {
return null;
}
}
function convertRelativeUrls(document: Document, baseUrl: string): void {
document.querySelectorAll("[src]").forEach((element: Element) => {
const src = element.getAttribute("src");
if (!src || src.startsWith("http") || src.startsWith("//") || src.startsWith("data:")) return;
const absolute = makeAbsoluteUrl(src, baseUrl);
if (absolute) element.setAttribute("src", absolute);
});
document.querySelectorAll("[href]").forEach((element: Element) => {
const href = element.getAttribute("href");
if (
!href ||
href.startsWith("http") ||
href.startsWith("//") ||
href.startsWith("#") ||
href.startsWith("mailto:") ||
href.startsWith("tel:") ||
href.startsWith("javascript:")
) {
return;
}
const absolute = makeAbsoluteUrl(href, baseUrl);
if (absolute) element.setAttribute("href", absolute);
});
}
export function cleanHtml(html: string, baseUrl: string, options: CleaningOptions = {}): string {
const {
removeAds = true,
removeBase64Images = true,
onlyMainContent = true,
includeTags,
excludeTags,
} = options;
const { document } = parseHTML(html);
removeElements(document, ALWAYS_REMOVE_SELECTORS);
removeElements(document, OVERLAY_SELECTORS);
if (removeAds) {
removeElements(document, AD_SELECTORS);
}
if (excludeTags?.length) {
removeElements(document, excludeTags);
}
if (onlyMainContent) {
removeWithProtection(document, NAVIGATION_SELECTORS, FORCE_INCLUDE_SELECTORS);
const mainContent = findMainContent(document);
if (mainContent && document.body) {
const clone = mainContent.cloneNode(true) as Element;
document.body.innerHTML = "";
document.body.appendChild(clone);
}
}
if (includeTags?.length && document.body) {
const matchedElements: Element[] = [];
for (const selector of includeTags) {
try {
document.querySelectorAll(selector).forEach((element: Element) => {
matchedElements.push(element.cloneNode(true) as Element);
});
} catch {
// Ignore invalid selectors.
}
}
if (matchedElements.length > 0) {
document.body.innerHTML = "";
matchedElements.forEach((element) => document.body?.appendChild(element));
}
}
if (removeBase64Images) {
removeBase64ImagesFromDocument(document);
}
const walker = document.createTreeWalker(document, 128);
const comments: Node[] = [];
while (walker.nextNode()) {
comments.push(walker.currentNode);
}
comments.forEach((comment) => comment.parentNode?.removeChild(comment));
convertRelativeUrls(document, baseUrl);
return document.documentElement?.outerHTML || html;
}
export function cleanContent(html: string, baseUrl: string, options: CleaningOptions = {}): string {
return cleanHtml(html, baseUrl, options);
}

View File

@ -0,0 +1,28 @@
import assert from "node:assert/strict";
import test from "node:test";
import { extractContent } from "./html-to-markdown.js";
const EMBEDDED_IMAGE_HTML = `<!doctype html>
<html>
<body>
<main>
<article>
<h1>Embedded Image Story</h1>
<p>
This paragraph is intentionally long enough to satisfy the extractor thresholds so the
resulting markdown keeps the main article body and the embedded image reference.
</p>
<img src="data:image/png;base64,AAAA" alt="inline">
</article>
</main>
</body>
</html>`;
test("extractContent preserves base64 images when requested for media download", async () => {
const result = await extractContent(EMBEDDED_IMAGE_HTML, "https://example.com/embedded", {
preserveBase64Images: true,
});
assert.match(result.markdown, /!\[inline\]\(data:image\/png;base64,AAAA\)/);
});

View File

@ -13,10 +13,15 @@ import {
shouldCompareWithLegacy,
} from "./legacy-converter.js";
import { tryUrlRuleParsers } from "./parsers/index.js";
import { cleanContent } from "./content-cleaner.js";
export type { ConversionResult, PageMetadata };
export { createMarkdownDocument, formatMetadataYaml };
export interface ExtractContentOptions {
preserveBase64Images?: boolean;
}
export const absolutizeUrlsScript = String.raw`
(function() {
const baseUrl = document.baseURI || location.href;
@ -85,7 +90,10 @@ export const absolutizeUrlsScript = String.raw`
absAttr(htmlClone, "video[poster]", "poster");
absSrcset(htmlClone, "img[srcset], source[srcset]");
return { html: "<!doctype html>\n" + htmlClone.outerHTML };
return {
html: "<!doctype html>\n" + htmlClone.outerHTML,
finalUrl: location.href,
};
})()
`;
@ -102,7 +110,11 @@ function shouldPreferDefuddle(result: ConversionResult): boolean {
return /^##?\s+transcript\b/im.test(result.markdown);
}
export async function extractContent(html: string, url: string): Promise<ConversionResult> {
export async function extractContent(
html: string,
url: string,
options: ExtractContentOptions = {}
): Promise<ConversionResult> {
const capturedAt = new Date().toISOString();
const baseMetadata = extractMetadataFromHtml(html, url, capturedAt);
@ -111,14 +123,23 @@ export async function extractContent(html: string, url: string): Promise<Convers
return specializedResult;
}
const defuddleResult = await tryDefuddleConversion(html, url, baseMetadata);
let cleanedHtml = html;
try {
cleanedHtml = cleanContent(html, url, {
removeBase64Images: !options.preserveBase64Images,
});
} catch {
cleanedHtml = html;
}
const defuddleResult = await tryDefuddleConversion(cleanedHtml, url, baseMetadata);
if (defuddleResult.ok) {
if (shouldPreferDefuddle(defuddleResult.result)) {
return defuddleResult.result;
return { ...defuddleResult.result, rawHtml: html };
}
if (shouldCompareWithLegacy(defuddleResult.result.markdown)) {
const legacyResult = convertWithLegacyExtractor(html, baseMetadata);
const legacyResult = convertWithLegacyExtractor(html, baseMetadata, cleanedHtml);
const legacyScore = scoreMarkdownQuality(legacyResult.markdown);
const defuddleScore = scoreMarkdownQuality(defuddleResult.result.markdown);
@ -130,10 +151,10 @@ export async function extractContent(html: string, url: string): Promise<Convers
}
}
return defuddleResult.result;
return { ...defuddleResult.result, rawHtml: html };
}
const fallbackResult = convertWithLegacyExtractor(html, baseMetadata);
const fallbackResult = convertWithLegacyExtractor(html, baseMetadata, cleanedHtml);
return {
...fallbackResult,
fallbackReason: defuddleResult.reason,

View File

@ -0,0 +1,48 @@
import assert from "node:assert/strict";
import test from "node:test";
import { cleanContent } from "./content-cleaner.js";
import { convertWithLegacyExtractor } from "./legacy-converter.js";
import { extractMetadataFromHtml } from "./markdown-conversion-shared.js";
const CAPTURED_AT = "2026-03-24T03:00:00.000Z";
const NEXT_DATA_HTML = `<!doctype html>
<html>
<head>
<title>Hydrated Story</title>
</head>
<body>
<div class="cookie-banner">Accept cookies</div>
<main>
<p>Short teaser text that should not win over the structured article payload.</p>
</main>
<script id="__NEXT_DATA__" type="application/json">
{
"props": {
"pageProps": {
"article": {
"title": "Hydrated Story",
"description": "A structured article payload from Next.js",
"body": "<p>The full article lives in __NEXT_DATA__ and should still be extracted even when the cleaned HTML removes scripts before the selector and readability passes run.</p><p>A second paragraph keeps the content comfortably above the minimum extraction threshold and proves the legacy extractor still has access to the original structured payload.</p>"
}
}
}
}
</script>
</body>
</html>`;
test("legacy extractor still uses original __NEXT_DATA__ after HTML cleaning", () => {
const url = "https://example.com/posts/hydrated-story";
const baseMetadata = extractMetadataFromHtml(NEXT_DATA_HTML, url, CAPTURED_AT);
const cleanedHtml = cleanContent(NEXT_DATA_HTML, url);
const result = convertWithLegacyExtractor(NEXT_DATA_HTML, baseMetadata, cleanedHtml);
assert.equal(result.conversionMethod, "legacy:next-data");
assert.match(result.markdown, /The full article lives in .*NEXT.*DATA/);
assert.match(result.markdown, /A second paragraph keeps the content comfortably above the minimum extraction threshold/);
assert.doesNotMatch(result.markdown, /Short teaser text that should not win/);
assert.equal(result.rawHtml, NEXT_DATA_HTML);
});

View File

@ -336,29 +336,32 @@ function tryNextDataExtraction(document: Document): ExtractionCandidate | null {
function buildReadabilityCandidate(
article: ReturnType<Readability["parse"]>,
document: Document,
referenceDocument: Document,
method: string
): ExtractionCandidate | null {
const textContent = article?.textContent?.trim() ?? "";
if (textContent.length < MIN_CONTENT_LENGTH) return null;
return {
title: pickString(article?.title, extractTitle(document)),
title: pickString(article?.title, extractTitle(referenceDocument)),
byline: pickString((article as { byline?: string } | null)?.byline),
excerpt: pickString(article?.excerpt, generateExcerpt(null, textContent)),
published: pickString((article as { publishedTime?: string } | null)?.publishedTime, extractPublishedTime(document)),
published: pickString(
(article as { publishedTime?: string } | null)?.publishedTime,
extractPublishedTime(referenceDocument)
),
html: article?.content ? sanitizeHtml(article.content) : null,
textContent,
method,
};
}
function tryReadability(document: Document): ExtractionCandidate | null {
function tryReadability(document: Document, referenceDocument: Document = document): ExtractionCandidate | null {
try {
const strictClone = document.cloneNode(true) as Document;
const strictResult = buildReadabilityCandidate(
new Readability(strictClone).parse(),
document,
referenceDocument,
"readability"
);
if (strictResult) return strictResult;
@ -366,7 +369,7 @@ function tryReadability(document: Document): ExtractionCandidate | null {
const relaxedClone = document.cloneNode(true) as Document;
return buildReadabilityCandidate(
new Readability(relaxedClone, { charThreshold: 120 }).parse(),
document,
referenceDocument,
"readability-relaxed"
);
} catch {
@ -471,14 +474,15 @@ function pickBestCandidate(candidates: ExtractionCandidate[]): ExtractionCandida
return ranked[0];
}
function extractFromHtml(html: string): ExtractionCandidate | null {
const document = parseDocument(html);
function extractFromHtml(html: string, cleanedHtml: string = html): ExtractionCandidate | null {
const originalDocument = parseDocument(html);
const cleanedDocument = parseDocument(cleanedHtml);
const readabilityCandidate = tryReadability(document);
const nextDataCandidate = tryNextDataExtraction(document);
const jsonLdCandidate = tryJsonLdExtraction(document);
const selectorCandidate = trySelectorExtraction(document);
const bodyCandidate = tryBodyExtraction(document);
const readabilityCandidate = tryReadability(cleanedDocument, originalDocument);
const nextDataCandidate = tryNextDataExtraction(originalDocument);
const jsonLdCandidate = tryJsonLdExtraction(originalDocument);
const selectorCandidate = trySelectorExtraction(cleanedDocument);
const bodyCandidate = tryBodyExtraction(cleanedDocument);
const candidates = [
readabilityCandidate,
@ -493,8 +497,8 @@ function extractFromHtml(html: string): ExtractionCandidate | null {
return {
...winner,
title: winner.title ?? extractTitle(document),
published: winner.published ?? extractPublishedTime(document),
title: winner.title ?? extractTitle(originalDocument),
published: winner.published ?? extractPublishedTime(originalDocument),
excerpt: winner.excerpt ?? generateExcerpt(null, winner.textContent),
};
}
@ -610,12 +614,16 @@ export function shouldCompareWithLegacy(markdown: string): boolean {
);
}
export function convertWithLegacyExtractor(html: string, baseMetadata: PageMetadata): ConversionResult {
const extracted = extractFromHtml(html);
export function convertWithLegacyExtractor(
html: string,
baseMetadata: PageMetadata,
cleanedHtml: string = html
): ConversionResult {
const extracted = extractFromHtml(html, cleanedHtml);
let markdown = extracted?.html ? convertHtmlFragmentToMarkdown(extracted.html) : "";
if (!markdown.trim()) {
markdown = extracted?.textContent?.trim() || fallbackPlainText(html);
markdown = extracted?.textContent?.trim() || fallbackPlainText(cleanedHtml);
}
return {

View File

@ -29,10 +29,33 @@ interface Args {
wait: boolean;
timeout: number;
downloadMedia: boolean;
browserMode: BrowserMode;
}
type BrowserMode = "auto" | "headless" | "headed";
interface CaptureAttemptOptions {
headless: boolean;
wait: boolean;
existingPort?: number;
waitPrompt?: string;
}
interface CaptureSnapshot {
html: string;
finalUrl: string;
}
const BROWSER_MODES = new Set<BrowserMode>(["auto", "headless", "headed"]);
function parseArgs(argv: string[]): Args {
const args: Args = { url: "", wait: false, timeout: DEFAULT_TIMEOUT_MS, downloadMedia: false };
const args: Args = {
url: "",
wait: false,
timeout: DEFAULT_TIMEOUT_MS,
downloadMedia: false,
browserMode: "auto",
};
for (let i = 2; i < argv.length; i++) {
const arg = argv[i];
if (arg === "--wait" || arg === "-w") {
@ -45,6 +68,12 @@ function parseArgs(argv: string[]): Args {
args.outputDir = argv[++i];
} else if (arg === "--download-media") {
args.downloadMedia = true;
} else if (arg === "--browser") {
args.browserMode = (argv[++i] as BrowserMode | undefined) ?? "auto";
} else if (arg === "--headless") {
args.browserMode = "headless";
} else if (arg === "--headed" || arg === "--noheadless" || arg === "--no-headless") {
args.browserMode = "headed";
} else if (!arg.startsWith("-") && !args.url) {
args.url = arg;
}
@ -194,21 +223,28 @@ async function generateOutputPath(url: string, title: string, outputDir?: string
return path.join(dataDir, domain, timestampSlug, `${timestampSlug}.md`);
}
async function waitForUserSignal(): Promise<void> {
console.log("Page opened. Press Enter when ready to capture...");
function defaultWaitPrompt(): string {
return "A browser window has been opened. If the page requires login or verification, complete it first, then press Enter to capture.";
}
async function waitForUserSignal(prompt: string): Promise<void> {
console.log(prompt);
const rl = createInterface({ input: process.stdin, output: process.stdout });
await new Promise<void>((resolve) => {
rl.once("line", () => { rl.close(); resolve(); });
});
}
async function captureUrl(args: Args): Promise<ConversionResult> {
const existingPort = await findExistingChromePort();
const reusing = existingPort !== null;
const port = existingPort ?? await getFreePort();
const chrome = reusing ? null : await launchChrome(args.url, port, false);
async function captureUrlOnce(args: Args, options: CaptureAttemptOptions): Promise<ConversionResult> {
const reusing = options.existingPort !== undefined;
const port = options.existingPort ?? await getFreePort();
const chrome = reusing ? null : await launchChrome(args.url, port, options.headless);
if (reusing) console.log(`Reusing existing Chrome on port ${port}`);
if (reusing) {
console.log(`Reusing existing Chrome on port ${port}`);
} else {
console.log(`Launching Chrome (${options.headless ? "headless" : "headed"})...`);
}
let cdp: CdpConnection | null = null;
let targetId: string | null = null;
@ -235,8 +271,8 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
await cdp.send("Page.enable", {}, { sessionId });
}
if (args.wait) {
await waitForUserSignal();
if (options.wait) {
await waitForUserSignal(options.waitPrompt ?? defaultWaitPrompt());
} else {
console.log("Waiting for page to load...");
await Promise.race([
@ -251,11 +287,12 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
}
console.log("Capturing page content...");
const { html } = await evaluateScript<{ html: string }>(
const snapshot = await evaluateScript<CaptureSnapshot>(
cdp, sessionId, absolutizeUrlsScript, args.timeout
);
return await extractContent(html, args.url);
return await extractContent(snapshot.html, snapshot.finalUrl || args.url, {
preserveBase64Images: args.downloadMedia,
});
} finally {
if (reusing) {
if (cdp && targetId) {
@ -272,10 +309,67 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
}
}
async function runHeadedFlow(
args: Args,
options: { existingPort?: number; wait: boolean; waitPrompt?: string }
): Promise<ConversionResult> {
return await captureUrlOnce(args, {
headless: false,
wait: options.wait,
existingPort: options.existingPort,
waitPrompt: options.waitPrompt,
});
}
async function captureUrl(args: Args): Promise<ConversionResult> {
const existingPort = await findExistingChromePort();
if (existingPort !== null) {
console.log("Found an existing Chrome session for this profile. Reusing it instead of launching a new browser.");
return await runHeadedFlow(args, {
existingPort,
wait: args.wait,
waitPrompt: args.wait ? defaultWaitPrompt() : undefined,
});
}
if (args.browserMode === "headless") {
return await captureUrlOnce(args, { headless: true, wait: false });
}
if (args.browserMode === "headed") {
return await runHeadedFlow(args, {
wait: args.wait,
waitPrompt: args.wait ? defaultWaitPrompt() : undefined,
});
}
if (args.wait) {
return await runHeadedFlow(args, {
wait: true,
waitPrompt: defaultWaitPrompt(),
});
}
try {
return await captureUrlOnce(args, { headless: true, wait: false });
} catch (error) {
const headlessMessage = error instanceof Error ? error.message : String(error);
console.warn(`Headless capture failed: ${headlessMessage}`);
console.log("Retrying with a visible browser window...");
try {
return await runHeadedFlow(args, { wait: false });
} catch (headedError) {
const headedMessage = headedError instanceof Error ? headedError.message : String(headedError);
throw new Error(`Headless capture failed (${headlessMessage}); headed retry failed (${headedMessage})`);
}
}
}
async function main(): Promise<void> {
const args = parseArgs(process.argv);
if (!args.url) {
console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--timeout ms] [--download-media]");
console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--browser auto|headless|headed] [--timeout ms] [--download-media]");
process.exit(1);
}
@ -286,6 +380,16 @@ async function main(): Promise<void> {
process.exit(1);
}
if (!BROWSER_MODES.has(args.browserMode)) {
console.error(`Invalid --browser mode: ${args.browserMode}. Expected auto, headless, or headed.`);
process.exit(1);
}
if (args.wait && args.browserMode === "headless") {
console.error("Error: --wait requires a visible browser. Use --browser auto or --browser headed.");
process.exit(1);
}
if (args.output) {
const stat = await import("node:fs").then(fs => fs.statSync(args.output!, { throwIfNoEntry: false }));
if (stat?.isDirectory()) {
@ -296,6 +400,7 @@ async function main(): Promise<void> {
console.log(`Fetching: ${args.url}`);
console.log(`Mode: ${args.wait ? "wait" : "auto"}`);
console.log(`Browser: ${args.browserMode}`);
let outputPath: string;
let htmlSnapshotPath: string | null = null;
@ -306,7 +411,7 @@ async function main(): Promise<void> {
try {
const result = await captureUrl(args);
document = createMarkdownDocument(result);
outputPath = args.output || await generateOutputPath(args.url, result.metadata.title, args.outputDir, document);
outputPath = args.output || await generateOutputPath(result.metadata.url || args.url, result.metadata.title, args.outputDir, document);
const outputDir = path.dirname(outputPath);
htmlSnapshotPath = deriveHtmlSnapshotPath(outputPath);
await mkdir(outputDir, { recursive: true });

View File

@ -0,0 +1,40 @@
import assert from "node:assert/strict";
import { mkdtemp, readFile, readdir } from "node:fs/promises";
import os from "node:os";
import path from "node:path";
import test from "node:test";
import { localizeMarkdownMedia } from "./media-localizer.js";
const PNG_1X1_BASE64 =
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/x8AAwMCAO7Z0ioAAAAASUVORK5CYII=";
test("localizeMarkdownMedia saves embedded base64 images into imgs directory", async () => {
const tempDir = await mkdtemp(path.join(os.tmpdir(), "url-to-markdown-media-"));
const dataUri = `data:image/png;base64,${PNG_1X1_BASE64}`;
const markdown = [
"---",
`coverImage: "${dataUri}"`,
"---",
"",
"# Embedded Image",
"",
`![inline](${dataUri})`,
"",
].join("\n");
const result = await localizeMarkdownMedia(markdown, {
markdownPath: path.join(tempDir, "post.md"),
});
assert.equal(result.downloadedImages, 1);
assert.equal(result.downloadedVideos, 0);
assert.match(result.markdown, /coverImage: "imgs\/img-001\.png"/);
assert.match(result.markdown, /!\[inline\]\(imgs\/img-001\.png\)/);
const files = await readdir(path.join(tempDir, "imgs"));
assert.deepEqual(files, ["img-001.png"]);
const bytes = await readFile(path.join(tempDir, "imgs", "img-001.png"));
assert.equal(bytes.length, Buffer.from(PNG_1X1_BASE64, "base64").length);
});

View File

@ -3,10 +3,12 @@ import { mkdir, writeFile } from "node:fs/promises";
type MediaKind = "image" | "video";
type MediaHint = "image" | "unknown";
type MediaSource = "remote" | "data";
type MarkdownLinkCandidate = {
url: string;
hint: MediaHint;
source: MediaSource;
};
export type LocalizeMarkdownMediaOptions = {
@ -22,8 +24,9 @@ export type LocalizeMarkdownMediaResult = {
videoDir: string | null;
};
const MARKDOWN_LINK_RE = /(!?\[[^\]\n]*\])\((<)?(https?:\/\/[^)\s>]+)(>)?\)/g;
const FRONTMATTER_COVER_RE = /^(coverImage:\s*")(https?:\/\/[^"]+)(")/m;
const MARKDOWN_LINK_RE =
/(!?\[[^\]\n]*\])\((<)?((?:https?:\/\/[^)\s>]+)|(?:data:[^)>\s]+))(>)?\)/g;
const FRONTMATTER_COVER_RE = /^(coverImage:\s*")((?:https?:\/\/[^"]+)|(?:data:[^"]+))(")/m;
const IMAGE_EXTENSIONS = new Set([
"jpg",
@ -86,6 +89,10 @@ function resolveExtensionFromUrl(rawUrl: string): string | undefined {
return undefined;
}
function resolveExtensionFromContentType(contentType: string): string | undefined {
return normalizeExtension(MIME_EXTENSION_MAP[contentType]);
}
function resolveKindFromContentType(contentType: string): MediaKind | undefined {
if (!contentType) return undefined;
if (contentType.startsWith("image/")) return "image";
@ -124,7 +131,7 @@ function resolveOutputExtension(
extension: string | undefined,
kind: MediaKind
): string {
const extFromMime = normalizeExtension(MIME_EXTENSION_MAP[contentType]);
const extFromMime = resolveExtensionFromContentType(contentType);
if (extFromMime) return extFromMime;
const normalizedExt = normalizeExtension(extension);
@ -150,6 +157,10 @@ function sanitizeFileSegment(input: string): string {
}
function resolveFileStem(rawUrl: string, extension: string): string {
if (isDataUri(rawUrl)) {
return "";
}
try {
const parsed = new URL(rawUrl);
const base = path.posix.basename(parsed.pathname);
@ -172,6 +183,26 @@ function buildFileName(kind: MediaKind, index: number, sourceUrl: string, extens
return `${prefix}-${serial}${suffix}.${extension}`;
}
function isDataUri(value: string): boolean {
return value.startsWith("data:");
}
function parseBase64DataUri(rawUrl: string): { contentType: string; bytes: Buffer } | null {
const match = rawUrl.match(/^data:([^;,]+);base64,([A-Za-z0-9+/=\s]+)$/i);
if (!match?.[1] || !match[2]) return null;
const contentType = normalizeContentType(match[1]);
if (!contentType) return null;
try {
const bytes = Buffer.from(match[2].replace(/\s+/g, ""), "base64");
if (bytes.length === 0) return null;
return { contentType, bytes };
} catch {
return null;
}
}
function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[] {
const candidates: MarkdownLinkCandidate[] = [];
const seen = new Set<string>();
@ -181,7 +212,11 @@ function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[
const coverMatch = fmMatch[1]?.match(FRONTMATTER_COVER_RE);
if (coverMatch?.[2] && !seen.has(coverMatch[2])) {
seen.add(coverMatch[2]);
candidates.push({ url: coverMatch[2], hint: "image" });
candidates.push({
url: coverMatch[2],
hint: "image",
source: isDataUri(coverMatch[2]) ? "data" : "remote",
});
}
}
@ -195,6 +230,7 @@ function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[
candidates.push({
url: rawUrl,
hint: label.startsWith("![") ? "image" : "unknown",
source: isDataUri(rawUrl) ? "data" : "remote",
});
}
@ -244,24 +280,45 @@ export async function localizeMarkdownMedia(
for (const candidate of candidates) {
try {
const response = await fetch(candidate.url, {
method: "GET",
redirect: "follow",
headers: {
"user-agent": DOWNLOAD_USER_AGENT,
},
});
let sourceUrl = candidate.url;
let contentType = "";
let extension: string | undefined;
let kind: MediaKind | undefined;
let bytes: Buffer | null = null;
if (!response.ok) {
log(`[url-to-markdown] Skip media (${response.status}): ${candidate.url}`);
continue;
if (candidate.source === "data") {
const parsed = parseBase64DataUri(candidate.url);
if (!parsed) {
log("[url-to-markdown] Skip embedded media: unsupported or invalid data URI");
continue;
}
contentType = parsed.contentType;
extension = resolveExtensionFromContentType(contentType);
kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
bytes = parsed.bytes;
} else {
const response = await fetch(candidate.url, {
method: "GET",
redirect: "follow",
headers: {
"user-agent": DOWNLOAD_USER_AGENT,
},
});
if (!response.ok) {
log(`[url-to-markdown] Skip media (${response.status}): ${candidate.url}`);
continue;
}
sourceUrl = response.url || candidate.url;
contentType = normalizeContentType(response.headers.get("content-type"));
extension = resolveExtensionFromUrl(sourceUrl) ?? resolveExtensionFromUrl(candidate.url);
kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
bytes = Buffer.from(await response.arrayBuffer());
}
const sourceUrl = response.url || candidate.url;
const contentType = normalizeContentType(response.headers.get("content-type"));
const extension = resolveExtensionFromUrl(sourceUrl) ?? resolveExtensionFromUrl(candidate.url);
const kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
if (!kind) {
if (!kind || !bytes) {
continue;
}
@ -274,7 +331,6 @@ export async function localizeMarkdownMedia(
const fileName = buildFileName(kind, nextIndex, sourceUrl, outputExtension);
const absolutePath = path.join(targetDir, fileName);
const relativePath = path.posix.join(dirName, fileName);
const bytes = Buffer.from(await response.arrayBuffer());
await writeFile(absolutePath, bytes);
replacements.set(candidate.url, relativePath);
@ -305,6 +361,7 @@ export function countRemoteMedia(markdown: string): { images: number; videos: nu
let images = 0;
let videos = 0;
for (const c of candidates) {
if (c.source !== "remote") continue;
const ext = resolveExtensionFromUrl(c.url);
const kind = resolveKindFromExtension(ext);
if (kind === "video") {

View File

@ -5,7 +5,7 @@
"dependencies": {
"@mozilla/readability": "^0.6.0",
"baoyu-chrome-cdp": "file:./vendor/baoyu-chrome-cdp",
"defuddle": "^0.12.0",
"defuddle": "^0.14.0",
"jsdom": "^24.1.3",
"linkedom": "^0.18.12",
"turndown": "^7.2.2",