feat(baoyu-url-to-markdown): add browser fallback strategy, content cleaner, and data URI support
- Browser strategy: headless first with automatic retry in visible Chrome on failure - New --browser auto|headless|headed flag with --headless/--headed shortcuts - Content cleaner module for HTML preprocessing (remove ads, base64 images, scripts) - Media localizer now handles base64 data URIs alongside remote URLs - Capture finalUrl from browser to track redirects for output path - Agent quality gate documentation for post-capture validation - Upgrade defuddle ^0.12.0 → ^0.14.0 - Add unit tests for content-cleaner, html-to-markdown, legacy-converter, media-localizer
This commit is contained in:
parent
40f9f05c22
commit
e99ce744cd
|
|
@ -118,6 +118,7 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
|
|||
## Features
|
||||
|
||||
- Chrome CDP for full JavaScript rendering
|
||||
- Browser strategy fallback: default headless first, then visible Chrome on technical failure
|
||||
- URL-specific parser layer for sites that need custom HTML rules before generic extraction
|
||||
- Two capture modes: auto or wait-for-user
|
||||
- Save rendered HTML as a sibling `-captured.html` file
|
||||
|
|
@ -137,6 +138,12 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
|
|||
# Auto mode (default) - capture when page loads
|
||||
${BUN_X} {baseDir}/scripts/main.ts <url>
|
||||
|
||||
# Force headless only
|
||||
${BUN_X} {baseDir}/scripts/main.ts <url> --browser headless
|
||||
|
||||
# Force visible browser
|
||||
${BUN_X} {baseDir}/scripts/main.ts <url> --browser headed
|
||||
|
||||
# Wait mode - wait for user signal before capture
|
||||
${BUN_X} {baseDir}/scripts/main.ts <url> --wait
|
||||
|
||||
|
|
@ -158,6 +165,9 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
|
|||
| `-o <path>` | Output file path — must be a **file** path, not directory (default: auto-generated) |
|
||||
| `--output-dir <dir>` | Base output directory — auto-generates `{dir}/{domain}/{slug}.md` (default: `./url-to-markdown/`) |
|
||||
| `--wait` | Wait for user signal before capturing |
|
||||
| `--browser <mode>` | Browser strategy: `auto` (default), `headless`, or `headed` |
|
||||
| `--headless` | Shortcut for `--browser headless` |
|
||||
| `--headed` | Shortcut for `--browser headed` |
|
||||
| `--timeout <ms>` | Page load timeout (default: 30000) |
|
||||
| `--download-media` | Download image/video assets to local `imgs/` and `videos/`, and rewrite markdown links to local relative paths |
|
||||
|
||||
|
|
@ -165,7 +175,7 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
|
|||
|
||||
| Mode | Behavior | Use When |
|
||||
|------|----------|----------|
|
||||
| Auto (default) | Capture on network idle | Public pages, static content |
|
||||
| Auto (default) | Try headless first, then retry in visible Chrome if needed | Public pages, static content, unknown pages |
|
||||
| Wait (`--wait`) | User signals when ready | Login-required, lazy loading, paywalls |
|
||||
|
||||
**Wait mode workflow**:
|
||||
|
|
@ -173,6 +183,43 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
|
|||
2. Ask user to confirm page is ready
|
||||
3. Send newline to stdin to trigger capture
|
||||
|
||||
**Default browser fallback**:
|
||||
1. Auto mode starts with headless Chrome and captures on network idle
|
||||
2. If headless capture fails technically, retry with visible Chrome
|
||||
3. If a shared Chrome session for this profile already exists, reuse it instead of launching a new browser
|
||||
4. The script does not hard-code login or paywall detection; the agent must inspect the captured markdown or HTML and decide whether to rerun with `--browser headed --wait`
|
||||
|
||||
## Agent Quality Gate
|
||||
|
||||
**CRITICAL**: The agent must treat headless capture as provisional. Some sites render differently in headless mode and can silently return an error shell, partially hydrated page, or low-quality extraction **without** causing the CLI to fail.
|
||||
|
||||
After every run that used `--browser auto` or `--browser headless`, the agent **MUST** inspect the saved markdown first, and inspect the saved `-captured.html` when the markdown looks suspicious.
|
||||
|
||||
### Quality checks the agent must perform
|
||||
|
||||
1. Confirm the markdown title matches the target page, not a generic site shell
|
||||
2. Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
|
||||
3. Watch for obvious failure signs such as:
|
||||
- `Application error`
|
||||
- `This page could not be found`
|
||||
- login, signup, subscribe, or verification shells
|
||||
- extremely short markdown for a page that should be long-form
|
||||
- raw framework payloads or mostly boilerplate content
|
||||
4. If the result is low quality, incomplete, or clearly wrong, do **not** accept the run as successful just because the CLI exited with code 0
|
||||
|
||||
### Recovery workflow the agent must follow
|
||||
|
||||
1. First run with default `auto` unless there is already a clear reason to use wait mode
|
||||
2. Review markdown quality immediately after the run
|
||||
3. If the content is low quality, rerun locally with visible Chrome:
|
||||
- `--browser headed` for ordinary rendering issues
|
||||
- `--browser headed --wait` when the page may need login, anti-bot interaction, cookie acceptance, or extra hydration time
|
||||
4. If `--wait` is used, tell the user exactly what to do:
|
||||
- if login is required, ask them to sign in
|
||||
- if the page needs time to hydrate, ask them to wait until the full content is visible
|
||||
- once ready, ask them to press Enter so capture can continue
|
||||
5. Only fall back to hosted `defuddle.md` after the local browser strategies have failed or are clearly lower fidelity
|
||||
|
||||
## Output Format
|
||||
|
||||
Each run saves two files side by side:
|
||||
|
|
@ -211,8 +258,9 @@ Conversion order:
|
|||
2. If no specialized parser matches, try Defuddle
|
||||
3. For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline
|
||||
4. If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
|
||||
5. If the entire local browser capture flow fails before markdown can be produced, try the hosted `https://defuddle.md/<url>` API and save its markdown output directly
|
||||
6. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
|
||||
5. If the agent determines the captured result is a login screen, verification screen, or paywall shell, rerun locally with `--browser headed --wait` and ask the user to complete access before capture
|
||||
6. If the entire local browser capture flow still fails before markdown can be produced, try the hosted `https://defuddle.md/<url>` API and save its markdown output directly
|
||||
7. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
|
||||
|
||||
CLI output will show:
|
||||
|
||||
|
|
|
|||
|
|
@ -6,7 +6,7 @@
|
|||
"dependencies": {
|
||||
"@mozilla/readability": "^0.6.0",
|
||||
"baoyu-chrome-cdp": "file:./vendor/baoyu-chrome-cdp",
|
||||
"defuddle": "^0.12.0",
|
||||
"defuddle": "^0.14.0",
|
||||
"jsdom": "^24.1.3",
|
||||
"linkedom": "^0.18.12",
|
||||
"turndown": "^7.2.2",
|
||||
|
|
@ -61,7 +61,7 @@
|
|||
|
||||
"decimal.js": ["decimal.js@10.6.0", "", {}, "sha512-YpgQiITW3JXGntzdUmyUR1V812Hn8T1YVXhCu+wO3OpS4eU9l4YdD3qjyiKdV6mvV29zapkMeD390UVEf2lkUg=="],
|
||||
|
||||
"defuddle": ["defuddle@0.12.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "peerDependencies": { "jsdom": "^24.0.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-Y/WgyGKBxwxFir+hWNth4nmWDDDb8BzQi3qASS2NWYPXsKU42Ku49/3M5yFYefnRef9prynnmasfnXjk99EWgA=="],
|
||||
"defuddle": ["defuddle@0.14.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "linkedom": "^0.18.12", "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-btavZGd1WgiVqrVM62WGRXMUi/aU7ckTZiq0xXWLZMHvzIqNZjwIFQEDRx8MarD7fIgsB90NXZ9xHJkKtapt2Q=="],
|
||||
|
||||
"delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,55 @@
|
|||
import assert from "node:assert/strict";
|
||||
import test from "node:test";
|
||||
|
||||
import { cleanContent } from "./content-cleaner.js";
|
||||
|
||||
const SAMPLE_HTML = `<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Example Story</title>
|
||||
<style>.cookie-banner { position: fixed; }</style>
|
||||
<script>window.__noise = true;</script>
|
||||
</head>
|
||||
<body>
|
||||
<!-- comment that should be removed -->
|
||||
<header>
|
||||
<nav>
|
||||
<a href="/home">Home</a>
|
||||
<a href="/topics">Topics</a>
|
||||
</nav>
|
||||
</header>
|
||||
<div class="cookie-banner">Accept cookies</div>
|
||||
<aside>Sidebar links</aside>
|
||||
<main>
|
||||
<article class="content">
|
||||
<h1>Actual Story Title</h1>
|
||||
<p>
|
||||
This is the first paragraph of the real story body, and it is intentionally long enough
|
||||
to survive the cleaner's main-content heuristics without being mistaken for navigation.
|
||||
</p>
|
||||
<p>
|
||||
This is the second paragraph with more useful detail, a
|
||||
<a href="/read-more">supporting link</a>, and a normal image.
|
||||
</p>
|
||||
<img src="/images/cover.jpg" alt="Cover">
|
||||
<img src="data:image/png;base64,AAAA" alt="Inline data">
|
||||
</article>
|
||||
</main>
|
||||
<footer>Footer boilerplate</footer>
|
||||
</body>
|
||||
</html>`;
|
||||
|
||||
test("cleanContent keeps the article body and removes obvious boilerplate", () => {
|
||||
const cleaned = cleanContent(SAMPLE_HTML, "https://example.com/posts/story");
|
||||
|
||||
assert.match(cleaned, /Actual Story Title/);
|
||||
assert.match(cleaned, /https:\/\/example\.com\/read-more/);
|
||||
assert.match(cleaned, /https:\/\/example\.com\/images\/cover\.jpg/);
|
||||
|
||||
assert.doesNotMatch(cleaned, /Accept cookies/);
|
||||
assert.doesNotMatch(cleaned, /Sidebar links/);
|
||||
assert.doesNotMatch(cleaned, /Footer boilerplate/);
|
||||
assert.doesNotMatch(cleaned, /window\.__noise/);
|
||||
assert.doesNotMatch(cleaned, /comment that should be removed/);
|
||||
assert.doesNotMatch(cleaned, /data:image\/png;base64/);
|
||||
});
|
||||
|
|
@ -0,0 +1,432 @@
|
|||
import { parseHTML } from "linkedom";
|
||||
|
||||
export interface CleaningOptions {
|
||||
removeAds?: boolean;
|
||||
removeBase64Images?: boolean;
|
||||
onlyMainContent?: boolean;
|
||||
includeTags?: string[];
|
||||
excludeTags?: string[];
|
||||
}
|
||||
|
||||
const ALWAYS_REMOVE_SELECTORS = [
|
||||
"script",
|
||||
"style",
|
||||
"noscript",
|
||||
"link[rel='stylesheet']",
|
||||
"[hidden]",
|
||||
"[aria-hidden='true']",
|
||||
"[style*='display: none']",
|
||||
"[style*='display:none']",
|
||||
"[style*='visibility: hidden']",
|
||||
"[style*='visibility:hidden']",
|
||||
"svg[aria-hidden='true']",
|
||||
"svg.icon",
|
||||
"svg[class*='icon']",
|
||||
"template",
|
||||
"meta",
|
||||
"iframe",
|
||||
"canvas",
|
||||
"object",
|
||||
"embed",
|
||||
"form",
|
||||
"input",
|
||||
"select",
|
||||
"textarea",
|
||||
"button",
|
||||
];
|
||||
|
||||
const OVERLAY_SELECTORS = [
|
||||
"[class*='modal']",
|
||||
"[class*='popup']",
|
||||
"[class*='overlay']",
|
||||
"[class*='dialog']",
|
||||
"[role='dialog']",
|
||||
"[role='alertdialog']",
|
||||
"[class*='cookie']",
|
||||
"[class*='consent']",
|
||||
"[class*='gdpr']",
|
||||
"[class*='privacy-banner']",
|
||||
"[class*='notification-bar']",
|
||||
"[id*='cookie']",
|
||||
"[id*='consent']",
|
||||
"[id*='gdpr']",
|
||||
"[style*='position: fixed']",
|
||||
"[style*='position:fixed']",
|
||||
"[style*='position: sticky']",
|
||||
"[style*='position:sticky']",
|
||||
];
|
||||
|
||||
const NAVIGATION_SELECTORS = [
|
||||
"header",
|
||||
"footer",
|
||||
"nav",
|
||||
"aside",
|
||||
".header",
|
||||
".top",
|
||||
".navbar",
|
||||
"#header",
|
||||
".footer",
|
||||
".bottom",
|
||||
"#footer",
|
||||
".sidebar",
|
||||
".side",
|
||||
".aside",
|
||||
"#sidebar",
|
||||
".modal",
|
||||
".popup",
|
||||
"#modal",
|
||||
".overlay",
|
||||
".ad",
|
||||
".ads",
|
||||
".advert",
|
||||
"#ad",
|
||||
".lang-selector",
|
||||
".language",
|
||||
"#language-selector",
|
||||
".social",
|
||||
".social-media",
|
||||
".social-links",
|
||||
"#social",
|
||||
".menu",
|
||||
".navigation",
|
||||
"#nav",
|
||||
".breadcrumbs",
|
||||
"#breadcrumbs",
|
||||
".share",
|
||||
"#share",
|
||||
".widget",
|
||||
"#widget",
|
||||
".cookie",
|
||||
"#cookie",
|
||||
];
|
||||
|
||||
const FORCE_INCLUDE_SELECTORS = [
|
||||
"#main",
|
||||
"#content",
|
||||
"#main-content",
|
||||
"#article",
|
||||
"#post",
|
||||
"#page-content",
|
||||
"main",
|
||||
"article",
|
||||
"[role='main']",
|
||||
".main-content",
|
||||
".content",
|
||||
".post-content",
|
||||
".article-content",
|
||||
".entry-content",
|
||||
".page-content",
|
||||
".article-body",
|
||||
".post-body",
|
||||
".story-content",
|
||||
".blog-content",
|
||||
];
|
||||
|
||||
const AD_SELECTORS = [
|
||||
"ins.adsbygoogle",
|
||||
".google-ad",
|
||||
".adsense",
|
||||
"[data-ad]",
|
||||
"[data-ads]",
|
||||
"[data-ad-slot]",
|
||||
"[data-ad-client]",
|
||||
".ad-container",
|
||||
".ad-wrapper",
|
||||
".advertisement",
|
||||
".sponsored-content",
|
||||
"img[width='1'][height='1']",
|
||||
"img[src*='pixel']",
|
||||
"img[src*='tracking']",
|
||||
"img[src*='analytics']",
|
||||
];
|
||||
|
||||
function getLinkDensity(element: Element): number {
|
||||
const text = element.textContent || "";
|
||||
const textLength = text.trim().length;
|
||||
if (textLength === 0) return 1;
|
||||
|
||||
let linkLength = 0;
|
||||
element.querySelectorAll("a").forEach((link: Element) => {
|
||||
linkLength += (link.textContent || "").trim().length;
|
||||
});
|
||||
|
||||
return linkLength / textLength;
|
||||
}
|
||||
|
||||
function getContentScore(element: Element): number {
|
||||
let score = 0;
|
||||
const text = element.textContent || "";
|
||||
const textLength = text.trim().length;
|
||||
|
||||
score += Math.min(textLength / 100, 50);
|
||||
score += element.querySelectorAll("p").length * 3;
|
||||
score += element.querySelectorAll("h1, h2, h3, h4, h5, h6").length * 2;
|
||||
score += element.querySelectorAll("img").length;
|
||||
|
||||
score -= element.querySelectorAll("a").length * 0.5;
|
||||
score -= element.querySelectorAll("li").length * 0.2;
|
||||
|
||||
const linkDensity = getLinkDensity(element);
|
||||
if (linkDensity > 0.5) score -= 30;
|
||||
else if (linkDensity > 0.3) score -= 15;
|
||||
|
||||
const className = typeof element.className === "string" ? element.className : "";
|
||||
const classAndId = `${className} ${element.id || ""}`;
|
||||
if (/article|content|post|body|main|entry/i.test(classAndId)) score += 25;
|
||||
if (/comment|sidebar|footer|nav|menu|header|widget|ad/i.test(classAndId)) score -= 25;
|
||||
|
||||
return score;
|
||||
}
|
||||
|
||||
function looksLikeNavigation(element: Element): boolean {
|
||||
const linkDensity = getLinkDensity(element);
|
||||
if (linkDensity > 0.5) return true;
|
||||
|
||||
const listItems = element.querySelectorAll("li");
|
||||
const links = element.querySelectorAll("a");
|
||||
return listItems.length > 5 && links.length > listItems.length * 0.8;
|
||||
}
|
||||
|
||||
function removeElements(document: Document, selectors: string[]): void {
|
||||
for (const selector of selectors) {
|
||||
try {
|
||||
document.querySelectorAll(selector).forEach((element: Element) => element.remove());
|
||||
} catch {
|
||||
// Ignore unsupported selectors from linkedom/jsdom differences.
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
function removeWithProtection(
|
||||
document: Document,
|
||||
selectorsToRemove: string[],
|
||||
protectedSelectors: string[]
|
||||
): void {
|
||||
for (const selector of selectorsToRemove) {
|
||||
try {
|
||||
document.querySelectorAll(selector).forEach((element: Element) => {
|
||||
const isProtected = protectedSelectors.some((protectedSelector) => {
|
||||
try {
|
||||
return element.matches(protectedSelector);
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
});
|
||||
if (isProtected) return;
|
||||
|
||||
const containsProtected = protectedSelectors.some((protectedSelector) => {
|
||||
try {
|
||||
return element.querySelector(protectedSelector) !== null;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
});
|
||||
if (containsProtected) return;
|
||||
|
||||
element.remove();
|
||||
});
|
||||
} catch {
|
||||
// Ignore unsupported selectors from linkedom/jsdom differences.
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
function findMainContent(document: Document): Element | null {
|
||||
const isValidContent = (element: Element | null): element is Element => {
|
||||
if (!element) return false;
|
||||
const text = element.textContent || "";
|
||||
if (text.trim().length < 100) return false;
|
||||
return !looksLikeNavigation(element);
|
||||
};
|
||||
|
||||
const main = document.querySelector("main");
|
||||
if (isValidContent(main) && getLinkDensity(main) < 0.4) return main;
|
||||
|
||||
const roleMain = document.querySelector('[role="main"]');
|
||||
if (isValidContent(roleMain) && getLinkDensity(roleMain) < 0.4) return roleMain;
|
||||
|
||||
const articles = document.querySelectorAll("article");
|
||||
if (articles.length === 1 && isValidContent(articles[0] ?? null)) {
|
||||
return articles[0] ?? null;
|
||||
}
|
||||
|
||||
const contentSelectors = [
|
||||
"#content",
|
||||
"#main-content",
|
||||
"#main",
|
||||
".content",
|
||||
".main-content",
|
||||
".post-content",
|
||||
".article-content",
|
||||
".entry-content",
|
||||
".page-content",
|
||||
".article-body",
|
||||
".post-body",
|
||||
".story-content",
|
||||
".blog-content",
|
||||
];
|
||||
|
||||
for (const selector of contentSelectors) {
|
||||
try {
|
||||
const element = document.querySelector(selector);
|
||||
if (isValidContent(element) && getLinkDensity(element) < 0.4) {
|
||||
return element;
|
||||
}
|
||||
} catch {
|
||||
// Ignore invalid selectors.
|
||||
}
|
||||
}
|
||||
|
||||
const candidates: Array<{ element: Element; score: number }> = [];
|
||||
const containers = document.querySelectorAll("div, section, article");
|
||||
containers.forEach((element: Element) => {
|
||||
const text = element.textContent || "";
|
||||
if (text.trim().length < 200) return;
|
||||
|
||||
const score = getContentScore(element);
|
||||
if (score > 0) {
|
||||
candidates.push({ element, score });
|
||||
}
|
||||
});
|
||||
|
||||
candidates.sort((left, right) => right.score - left.score);
|
||||
if ((candidates[0]?.score ?? 0) > 20) {
|
||||
return candidates[0]?.element ?? null;
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
function removeBase64ImagesFromDocument(document: Document): void {
|
||||
document.querySelectorAll("img[src^='data:']").forEach((element: Element) => {
|
||||
element.remove();
|
||||
});
|
||||
|
||||
document.querySelectorAll("[style*='data:image']").forEach((element: Element) => {
|
||||
const style = element.getAttribute("style");
|
||||
if (!style) return;
|
||||
|
||||
const cleanedStyle = style.replace(
|
||||
/background(-image)?:\s*url\([^)]*data:image[^)]*\)[^;]*;?/gi,
|
||||
""
|
||||
);
|
||||
|
||||
if (cleanedStyle.trim()) {
|
||||
element.setAttribute("style", cleanedStyle);
|
||||
} else {
|
||||
element.removeAttribute("style");
|
||||
}
|
||||
});
|
||||
|
||||
document.querySelectorAll("source[src^='data:'], source[srcset*='data:']").forEach((element: Element) => {
|
||||
element.remove();
|
||||
});
|
||||
}
|
||||
|
||||
function makeAbsoluteUrl(value: string, baseUrl: string): string | null {
|
||||
try {
|
||||
return new URL(value, baseUrl).toString();
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
function convertRelativeUrls(document: Document, baseUrl: string): void {
|
||||
document.querySelectorAll("[src]").forEach((element: Element) => {
|
||||
const src = element.getAttribute("src");
|
||||
if (!src || src.startsWith("http") || src.startsWith("//") || src.startsWith("data:")) return;
|
||||
|
||||
const absolute = makeAbsoluteUrl(src, baseUrl);
|
||||
if (absolute) element.setAttribute("src", absolute);
|
||||
});
|
||||
|
||||
document.querySelectorAll("[href]").forEach((element: Element) => {
|
||||
const href = element.getAttribute("href");
|
||||
if (
|
||||
!href ||
|
||||
href.startsWith("http") ||
|
||||
href.startsWith("//") ||
|
||||
href.startsWith("#") ||
|
||||
href.startsWith("mailto:") ||
|
||||
href.startsWith("tel:") ||
|
||||
href.startsWith("javascript:")
|
||||
) {
|
||||
return;
|
||||
}
|
||||
|
||||
const absolute = makeAbsoluteUrl(href, baseUrl);
|
||||
if (absolute) element.setAttribute("href", absolute);
|
||||
});
|
||||
}
|
||||
|
||||
export function cleanHtml(html: string, baseUrl: string, options: CleaningOptions = {}): string {
|
||||
const {
|
||||
removeAds = true,
|
||||
removeBase64Images = true,
|
||||
onlyMainContent = true,
|
||||
includeTags,
|
||||
excludeTags,
|
||||
} = options;
|
||||
|
||||
const { document } = parseHTML(html);
|
||||
|
||||
removeElements(document, ALWAYS_REMOVE_SELECTORS);
|
||||
removeElements(document, OVERLAY_SELECTORS);
|
||||
|
||||
if (removeAds) {
|
||||
removeElements(document, AD_SELECTORS);
|
||||
}
|
||||
|
||||
if (excludeTags?.length) {
|
||||
removeElements(document, excludeTags);
|
||||
}
|
||||
|
||||
if (onlyMainContent) {
|
||||
removeWithProtection(document, NAVIGATION_SELECTORS, FORCE_INCLUDE_SELECTORS);
|
||||
|
||||
const mainContent = findMainContent(document);
|
||||
if (mainContent && document.body) {
|
||||
const clone = mainContent.cloneNode(true) as Element;
|
||||
document.body.innerHTML = "";
|
||||
document.body.appendChild(clone);
|
||||
}
|
||||
}
|
||||
|
||||
if (includeTags?.length && document.body) {
|
||||
const matchedElements: Element[] = [];
|
||||
|
||||
for (const selector of includeTags) {
|
||||
try {
|
||||
document.querySelectorAll(selector).forEach((element: Element) => {
|
||||
matchedElements.push(element.cloneNode(true) as Element);
|
||||
});
|
||||
} catch {
|
||||
// Ignore invalid selectors.
|
||||
}
|
||||
}
|
||||
|
||||
if (matchedElements.length > 0) {
|
||||
document.body.innerHTML = "";
|
||||
matchedElements.forEach((element) => document.body?.appendChild(element));
|
||||
}
|
||||
}
|
||||
|
||||
if (removeBase64Images) {
|
||||
removeBase64ImagesFromDocument(document);
|
||||
}
|
||||
|
||||
const walker = document.createTreeWalker(document, 128);
|
||||
const comments: Node[] = [];
|
||||
while (walker.nextNode()) {
|
||||
comments.push(walker.currentNode);
|
||||
}
|
||||
comments.forEach((comment) => comment.parentNode?.removeChild(comment));
|
||||
|
||||
convertRelativeUrls(document, baseUrl);
|
||||
|
||||
return document.documentElement?.outerHTML || html;
|
||||
}
|
||||
|
||||
export function cleanContent(html: string, baseUrl: string, options: CleaningOptions = {}): string {
|
||||
return cleanHtml(html, baseUrl, options);
|
||||
}
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
import assert from "node:assert/strict";
|
||||
import test from "node:test";
|
||||
|
||||
import { extractContent } from "./html-to-markdown.js";
|
||||
|
||||
const EMBEDDED_IMAGE_HTML = `<!doctype html>
|
||||
<html>
|
||||
<body>
|
||||
<main>
|
||||
<article>
|
||||
<h1>Embedded Image Story</h1>
|
||||
<p>
|
||||
This paragraph is intentionally long enough to satisfy the extractor thresholds so the
|
||||
resulting markdown keeps the main article body and the embedded image reference.
|
||||
</p>
|
||||
<img src="data:image/png;base64,AAAA" alt="inline">
|
||||
</article>
|
||||
</main>
|
||||
</body>
|
||||
</html>`;
|
||||
|
||||
test("extractContent preserves base64 images when requested for media download", async () => {
|
||||
const result = await extractContent(EMBEDDED_IMAGE_HTML, "https://example.com/embedded", {
|
||||
preserveBase64Images: true,
|
||||
});
|
||||
|
||||
assert.match(result.markdown, /!\[inline\]\(data:image\/png;base64,AAAA\)/);
|
||||
});
|
||||
|
|
@ -13,10 +13,15 @@ import {
|
|||
shouldCompareWithLegacy,
|
||||
} from "./legacy-converter.js";
|
||||
import { tryUrlRuleParsers } from "./parsers/index.js";
|
||||
import { cleanContent } from "./content-cleaner.js";
|
||||
|
||||
export type { ConversionResult, PageMetadata };
|
||||
export { createMarkdownDocument, formatMetadataYaml };
|
||||
|
||||
export interface ExtractContentOptions {
|
||||
preserveBase64Images?: boolean;
|
||||
}
|
||||
|
||||
export const absolutizeUrlsScript = String.raw`
|
||||
(function() {
|
||||
const baseUrl = document.baseURI || location.href;
|
||||
|
|
@ -85,7 +90,10 @@ export const absolutizeUrlsScript = String.raw`
|
|||
absAttr(htmlClone, "video[poster]", "poster");
|
||||
absSrcset(htmlClone, "img[srcset], source[srcset]");
|
||||
|
||||
return { html: "<!doctype html>\n" + htmlClone.outerHTML };
|
||||
return {
|
||||
html: "<!doctype html>\n" + htmlClone.outerHTML,
|
||||
finalUrl: location.href,
|
||||
};
|
||||
})()
|
||||
`;
|
||||
|
||||
|
|
@ -102,7 +110,11 @@ function shouldPreferDefuddle(result: ConversionResult): boolean {
|
|||
return /^##?\s+transcript\b/im.test(result.markdown);
|
||||
}
|
||||
|
||||
export async function extractContent(html: string, url: string): Promise<ConversionResult> {
|
||||
export async function extractContent(
|
||||
html: string,
|
||||
url: string,
|
||||
options: ExtractContentOptions = {}
|
||||
): Promise<ConversionResult> {
|
||||
const capturedAt = new Date().toISOString();
|
||||
const baseMetadata = extractMetadataFromHtml(html, url, capturedAt);
|
||||
|
||||
|
|
@ -111,14 +123,23 @@ export async function extractContent(html: string, url: string): Promise<Convers
|
|||
return specializedResult;
|
||||
}
|
||||
|
||||
const defuddleResult = await tryDefuddleConversion(html, url, baseMetadata);
|
||||
let cleanedHtml = html;
|
||||
try {
|
||||
cleanedHtml = cleanContent(html, url, {
|
||||
removeBase64Images: !options.preserveBase64Images,
|
||||
});
|
||||
} catch {
|
||||
cleanedHtml = html;
|
||||
}
|
||||
|
||||
const defuddleResult = await tryDefuddleConversion(cleanedHtml, url, baseMetadata);
|
||||
if (defuddleResult.ok) {
|
||||
if (shouldPreferDefuddle(defuddleResult.result)) {
|
||||
return defuddleResult.result;
|
||||
return { ...defuddleResult.result, rawHtml: html };
|
||||
}
|
||||
|
||||
if (shouldCompareWithLegacy(defuddleResult.result.markdown)) {
|
||||
const legacyResult = convertWithLegacyExtractor(html, baseMetadata);
|
||||
const legacyResult = convertWithLegacyExtractor(html, baseMetadata, cleanedHtml);
|
||||
const legacyScore = scoreMarkdownQuality(legacyResult.markdown);
|
||||
const defuddleScore = scoreMarkdownQuality(defuddleResult.result.markdown);
|
||||
|
||||
|
|
@ -130,10 +151,10 @@ export async function extractContent(html: string, url: string): Promise<Convers
|
|||
}
|
||||
}
|
||||
|
||||
return defuddleResult.result;
|
||||
return { ...defuddleResult.result, rawHtml: html };
|
||||
}
|
||||
|
||||
const fallbackResult = convertWithLegacyExtractor(html, baseMetadata);
|
||||
const fallbackResult = convertWithLegacyExtractor(html, baseMetadata, cleanedHtml);
|
||||
return {
|
||||
...fallbackResult,
|
||||
fallbackReason: defuddleResult.reason,
|
||||
|
|
|
|||
|
|
@ -0,0 +1,48 @@
|
|||
import assert from "node:assert/strict";
|
||||
import test from "node:test";
|
||||
|
||||
import { cleanContent } from "./content-cleaner.js";
|
||||
import { convertWithLegacyExtractor } from "./legacy-converter.js";
|
||||
import { extractMetadataFromHtml } from "./markdown-conversion-shared.js";
|
||||
|
||||
const CAPTURED_AT = "2026-03-24T03:00:00.000Z";
|
||||
|
||||
const NEXT_DATA_HTML = `<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Hydrated Story</title>
|
||||
</head>
|
||||
<body>
|
||||
<div class="cookie-banner">Accept cookies</div>
|
||||
<main>
|
||||
<p>Short teaser text that should not win over the structured article payload.</p>
|
||||
</main>
|
||||
<script id="__NEXT_DATA__" type="application/json">
|
||||
{
|
||||
"props": {
|
||||
"pageProps": {
|
||||
"article": {
|
||||
"title": "Hydrated Story",
|
||||
"description": "A structured article payload from Next.js",
|
||||
"body": "<p>The full article lives in __NEXT_DATA__ and should still be extracted even when the cleaned HTML removes scripts before the selector and readability passes run.</p><p>A second paragraph keeps the content comfortably above the minimum extraction threshold and proves the legacy extractor still has access to the original structured payload.</p>"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
</script>
|
||||
</body>
|
||||
</html>`;
|
||||
|
||||
test("legacy extractor still uses original __NEXT_DATA__ after HTML cleaning", () => {
|
||||
const url = "https://example.com/posts/hydrated-story";
|
||||
const baseMetadata = extractMetadataFromHtml(NEXT_DATA_HTML, url, CAPTURED_AT);
|
||||
const cleanedHtml = cleanContent(NEXT_DATA_HTML, url);
|
||||
|
||||
const result = convertWithLegacyExtractor(NEXT_DATA_HTML, baseMetadata, cleanedHtml);
|
||||
|
||||
assert.equal(result.conversionMethod, "legacy:next-data");
|
||||
assert.match(result.markdown, /The full article lives in .*NEXT.*DATA/);
|
||||
assert.match(result.markdown, /A second paragraph keeps the content comfortably above the minimum extraction threshold/);
|
||||
assert.doesNotMatch(result.markdown, /Short teaser text that should not win/);
|
||||
assert.equal(result.rawHtml, NEXT_DATA_HTML);
|
||||
});
|
||||
|
|
@ -336,29 +336,32 @@ function tryNextDataExtraction(document: Document): ExtractionCandidate | null {
|
|||
|
||||
function buildReadabilityCandidate(
|
||||
article: ReturnType<Readability["parse"]>,
|
||||
document: Document,
|
||||
referenceDocument: Document,
|
||||
method: string
|
||||
): ExtractionCandidate | null {
|
||||
const textContent = article?.textContent?.trim() ?? "";
|
||||
if (textContent.length < MIN_CONTENT_LENGTH) return null;
|
||||
|
||||
return {
|
||||
title: pickString(article?.title, extractTitle(document)),
|
||||
title: pickString(article?.title, extractTitle(referenceDocument)),
|
||||
byline: pickString((article as { byline?: string } | null)?.byline),
|
||||
excerpt: pickString(article?.excerpt, generateExcerpt(null, textContent)),
|
||||
published: pickString((article as { publishedTime?: string } | null)?.publishedTime, extractPublishedTime(document)),
|
||||
published: pickString(
|
||||
(article as { publishedTime?: string } | null)?.publishedTime,
|
||||
extractPublishedTime(referenceDocument)
|
||||
),
|
||||
html: article?.content ? sanitizeHtml(article.content) : null,
|
||||
textContent,
|
||||
method,
|
||||
};
|
||||
}
|
||||
|
||||
function tryReadability(document: Document): ExtractionCandidate | null {
|
||||
function tryReadability(document: Document, referenceDocument: Document = document): ExtractionCandidate | null {
|
||||
try {
|
||||
const strictClone = document.cloneNode(true) as Document;
|
||||
const strictResult = buildReadabilityCandidate(
|
||||
new Readability(strictClone).parse(),
|
||||
document,
|
||||
referenceDocument,
|
||||
"readability"
|
||||
);
|
||||
if (strictResult) return strictResult;
|
||||
|
|
@ -366,7 +369,7 @@ function tryReadability(document: Document): ExtractionCandidate | null {
|
|||
const relaxedClone = document.cloneNode(true) as Document;
|
||||
return buildReadabilityCandidate(
|
||||
new Readability(relaxedClone, { charThreshold: 120 }).parse(),
|
||||
document,
|
||||
referenceDocument,
|
||||
"readability-relaxed"
|
||||
);
|
||||
} catch {
|
||||
|
|
@ -471,14 +474,15 @@ function pickBestCandidate(candidates: ExtractionCandidate[]): ExtractionCandida
|
|||
return ranked[0];
|
||||
}
|
||||
|
||||
function extractFromHtml(html: string): ExtractionCandidate | null {
|
||||
const document = parseDocument(html);
|
||||
function extractFromHtml(html: string, cleanedHtml: string = html): ExtractionCandidate | null {
|
||||
const originalDocument = parseDocument(html);
|
||||
const cleanedDocument = parseDocument(cleanedHtml);
|
||||
|
||||
const readabilityCandidate = tryReadability(document);
|
||||
const nextDataCandidate = tryNextDataExtraction(document);
|
||||
const jsonLdCandidate = tryJsonLdExtraction(document);
|
||||
const selectorCandidate = trySelectorExtraction(document);
|
||||
const bodyCandidate = tryBodyExtraction(document);
|
||||
const readabilityCandidate = tryReadability(cleanedDocument, originalDocument);
|
||||
const nextDataCandidate = tryNextDataExtraction(originalDocument);
|
||||
const jsonLdCandidate = tryJsonLdExtraction(originalDocument);
|
||||
const selectorCandidate = trySelectorExtraction(cleanedDocument);
|
||||
const bodyCandidate = tryBodyExtraction(cleanedDocument);
|
||||
|
||||
const candidates = [
|
||||
readabilityCandidate,
|
||||
|
|
@ -493,8 +497,8 @@ function extractFromHtml(html: string): ExtractionCandidate | null {
|
|||
|
||||
return {
|
||||
...winner,
|
||||
title: winner.title ?? extractTitle(document),
|
||||
published: winner.published ?? extractPublishedTime(document),
|
||||
title: winner.title ?? extractTitle(originalDocument),
|
||||
published: winner.published ?? extractPublishedTime(originalDocument),
|
||||
excerpt: winner.excerpt ?? generateExcerpt(null, winner.textContent),
|
||||
};
|
||||
}
|
||||
|
|
@ -610,12 +614,16 @@ export function shouldCompareWithLegacy(markdown: string): boolean {
|
|||
);
|
||||
}
|
||||
|
||||
export function convertWithLegacyExtractor(html: string, baseMetadata: PageMetadata): ConversionResult {
|
||||
const extracted = extractFromHtml(html);
|
||||
export function convertWithLegacyExtractor(
|
||||
html: string,
|
||||
baseMetadata: PageMetadata,
|
||||
cleanedHtml: string = html
|
||||
): ConversionResult {
|
||||
const extracted = extractFromHtml(html, cleanedHtml);
|
||||
|
||||
let markdown = extracted?.html ? convertHtmlFragmentToMarkdown(extracted.html) : "";
|
||||
if (!markdown.trim()) {
|
||||
markdown = extracted?.textContent?.trim() || fallbackPlainText(html);
|
||||
markdown = extracted?.textContent?.trim() || fallbackPlainText(cleanedHtml);
|
||||
}
|
||||
|
||||
return {
|
||||
|
|
|
|||
|
|
@ -29,10 +29,33 @@ interface Args {
|
|||
wait: boolean;
|
||||
timeout: number;
|
||||
downloadMedia: boolean;
|
||||
browserMode: BrowserMode;
|
||||
}
|
||||
|
||||
type BrowserMode = "auto" | "headless" | "headed";
|
||||
|
||||
interface CaptureAttemptOptions {
|
||||
headless: boolean;
|
||||
wait: boolean;
|
||||
existingPort?: number;
|
||||
waitPrompt?: string;
|
||||
}
|
||||
|
||||
interface CaptureSnapshot {
|
||||
html: string;
|
||||
finalUrl: string;
|
||||
}
|
||||
|
||||
const BROWSER_MODES = new Set<BrowserMode>(["auto", "headless", "headed"]);
|
||||
|
||||
function parseArgs(argv: string[]): Args {
|
||||
const args: Args = { url: "", wait: false, timeout: DEFAULT_TIMEOUT_MS, downloadMedia: false };
|
||||
const args: Args = {
|
||||
url: "",
|
||||
wait: false,
|
||||
timeout: DEFAULT_TIMEOUT_MS,
|
||||
downloadMedia: false,
|
||||
browserMode: "auto",
|
||||
};
|
||||
for (let i = 2; i < argv.length; i++) {
|
||||
const arg = argv[i];
|
||||
if (arg === "--wait" || arg === "-w") {
|
||||
|
|
@ -45,6 +68,12 @@ function parseArgs(argv: string[]): Args {
|
|||
args.outputDir = argv[++i];
|
||||
} else if (arg === "--download-media") {
|
||||
args.downloadMedia = true;
|
||||
} else if (arg === "--browser") {
|
||||
args.browserMode = (argv[++i] as BrowserMode | undefined) ?? "auto";
|
||||
} else if (arg === "--headless") {
|
||||
args.browserMode = "headless";
|
||||
} else if (arg === "--headed" || arg === "--noheadless" || arg === "--no-headless") {
|
||||
args.browserMode = "headed";
|
||||
} else if (!arg.startsWith("-") && !args.url) {
|
||||
args.url = arg;
|
||||
}
|
||||
|
|
@ -194,21 +223,28 @@ async function generateOutputPath(url: string, title: string, outputDir?: string
|
|||
return path.join(dataDir, domain, timestampSlug, `${timestampSlug}.md`);
|
||||
}
|
||||
|
||||
async function waitForUserSignal(): Promise<void> {
|
||||
console.log("Page opened. Press Enter when ready to capture...");
|
||||
function defaultWaitPrompt(): string {
|
||||
return "A browser window has been opened. If the page requires login or verification, complete it first, then press Enter to capture.";
|
||||
}
|
||||
|
||||
async function waitForUserSignal(prompt: string): Promise<void> {
|
||||
console.log(prompt);
|
||||
const rl = createInterface({ input: process.stdin, output: process.stdout });
|
||||
await new Promise<void>((resolve) => {
|
||||
rl.once("line", () => { rl.close(); resolve(); });
|
||||
});
|
||||
}
|
||||
|
||||
async function captureUrl(args: Args): Promise<ConversionResult> {
|
||||
const existingPort = await findExistingChromePort();
|
||||
const reusing = existingPort !== null;
|
||||
const port = existingPort ?? await getFreePort();
|
||||
const chrome = reusing ? null : await launchChrome(args.url, port, false);
|
||||
async function captureUrlOnce(args: Args, options: CaptureAttemptOptions): Promise<ConversionResult> {
|
||||
const reusing = options.existingPort !== undefined;
|
||||
const port = options.existingPort ?? await getFreePort();
|
||||
const chrome = reusing ? null : await launchChrome(args.url, port, options.headless);
|
||||
|
||||
if (reusing) console.log(`Reusing existing Chrome on port ${port}`);
|
||||
if (reusing) {
|
||||
console.log(`Reusing existing Chrome on port ${port}`);
|
||||
} else {
|
||||
console.log(`Launching Chrome (${options.headless ? "headless" : "headed"})...`);
|
||||
}
|
||||
|
||||
let cdp: CdpConnection | null = null;
|
||||
let targetId: string | null = null;
|
||||
|
|
@ -235,8 +271,8 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
|
|||
await cdp.send("Page.enable", {}, { sessionId });
|
||||
}
|
||||
|
||||
if (args.wait) {
|
||||
await waitForUserSignal();
|
||||
if (options.wait) {
|
||||
await waitForUserSignal(options.waitPrompt ?? defaultWaitPrompt());
|
||||
} else {
|
||||
console.log("Waiting for page to load...");
|
||||
await Promise.race([
|
||||
|
|
@ -251,11 +287,12 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
|
|||
}
|
||||
|
||||
console.log("Capturing page content...");
|
||||
const { html } = await evaluateScript<{ html: string }>(
|
||||
const snapshot = await evaluateScript<CaptureSnapshot>(
|
||||
cdp, sessionId, absolutizeUrlsScript, args.timeout
|
||||
);
|
||||
|
||||
return await extractContent(html, args.url);
|
||||
return await extractContent(snapshot.html, snapshot.finalUrl || args.url, {
|
||||
preserveBase64Images: args.downloadMedia,
|
||||
});
|
||||
} finally {
|
||||
if (reusing) {
|
||||
if (cdp && targetId) {
|
||||
|
|
@ -272,10 +309,67 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
|
|||
}
|
||||
}
|
||||
|
||||
async function runHeadedFlow(
|
||||
args: Args,
|
||||
options: { existingPort?: number; wait: boolean; waitPrompt?: string }
|
||||
): Promise<ConversionResult> {
|
||||
return await captureUrlOnce(args, {
|
||||
headless: false,
|
||||
wait: options.wait,
|
||||
existingPort: options.existingPort,
|
||||
waitPrompt: options.waitPrompt,
|
||||
});
|
||||
}
|
||||
|
||||
async function captureUrl(args: Args): Promise<ConversionResult> {
|
||||
const existingPort = await findExistingChromePort();
|
||||
if (existingPort !== null) {
|
||||
console.log("Found an existing Chrome session for this profile. Reusing it instead of launching a new browser.");
|
||||
return await runHeadedFlow(args, {
|
||||
existingPort,
|
||||
wait: args.wait,
|
||||
waitPrompt: args.wait ? defaultWaitPrompt() : undefined,
|
||||
});
|
||||
}
|
||||
|
||||
if (args.browserMode === "headless") {
|
||||
return await captureUrlOnce(args, { headless: true, wait: false });
|
||||
}
|
||||
|
||||
if (args.browserMode === "headed") {
|
||||
return await runHeadedFlow(args, {
|
||||
wait: args.wait,
|
||||
waitPrompt: args.wait ? defaultWaitPrompt() : undefined,
|
||||
});
|
||||
}
|
||||
|
||||
if (args.wait) {
|
||||
return await runHeadedFlow(args, {
|
||||
wait: true,
|
||||
waitPrompt: defaultWaitPrompt(),
|
||||
});
|
||||
}
|
||||
|
||||
try {
|
||||
return await captureUrlOnce(args, { headless: true, wait: false });
|
||||
} catch (error) {
|
||||
const headlessMessage = error instanceof Error ? error.message : String(error);
|
||||
console.warn(`Headless capture failed: ${headlessMessage}`);
|
||||
console.log("Retrying with a visible browser window...");
|
||||
|
||||
try {
|
||||
return await runHeadedFlow(args, { wait: false });
|
||||
} catch (headedError) {
|
||||
const headedMessage = headedError instanceof Error ? headedError.message : String(headedError);
|
||||
throw new Error(`Headless capture failed (${headlessMessage}); headed retry failed (${headedMessage})`);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async function main(): Promise<void> {
|
||||
const args = parseArgs(process.argv);
|
||||
if (!args.url) {
|
||||
console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--timeout ms] [--download-media]");
|
||||
console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--browser auto|headless|headed] [--timeout ms] [--download-media]");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
|
|
@ -286,6 +380,16 @@ async function main(): Promise<void> {
|
|||
process.exit(1);
|
||||
}
|
||||
|
||||
if (!BROWSER_MODES.has(args.browserMode)) {
|
||||
console.error(`Invalid --browser mode: ${args.browserMode}. Expected auto, headless, or headed.`);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
if (args.wait && args.browserMode === "headless") {
|
||||
console.error("Error: --wait requires a visible browser. Use --browser auto or --browser headed.");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
if (args.output) {
|
||||
const stat = await import("node:fs").then(fs => fs.statSync(args.output!, { throwIfNoEntry: false }));
|
||||
if (stat?.isDirectory()) {
|
||||
|
|
@ -296,6 +400,7 @@ async function main(): Promise<void> {
|
|||
|
||||
console.log(`Fetching: ${args.url}`);
|
||||
console.log(`Mode: ${args.wait ? "wait" : "auto"}`);
|
||||
console.log(`Browser: ${args.browserMode}`);
|
||||
|
||||
let outputPath: string;
|
||||
let htmlSnapshotPath: string | null = null;
|
||||
|
|
@ -306,7 +411,7 @@ async function main(): Promise<void> {
|
|||
try {
|
||||
const result = await captureUrl(args);
|
||||
document = createMarkdownDocument(result);
|
||||
outputPath = args.output || await generateOutputPath(args.url, result.metadata.title, args.outputDir, document);
|
||||
outputPath = args.output || await generateOutputPath(result.metadata.url || args.url, result.metadata.title, args.outputDir, document);
|
||||
const outputDir = path.dirname(outputPath);
|
||||
htmlSnapshotPath = deriveHtmlSnapshotPath(outputPath);
|
||||
await mkdir(outputDir, { recursive: true });
|
||||
|
|
|
|||
|
|
@ -0,0 +1,40 @@
|
|||
import assert from "node:assert/strict";
|
||||
import { mkdtemp, readFile, readdir } from "node:fs/promises";
|
||||
import os from "node:os";
|
||||
import path from "node:path";
|
||||
import test from "node:test";
|
||||
|
||||
import { localizeMarkdownMedia } from "./media-localizer.js";
|
||||
|
||||
const PNG_1X1_BASE64 =
|
||||
"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/x8AAwMCAO7Z0ioAAAAASUVORK5CYII=";
|
||||
|
||||
test("localizeMarkdownMedia saves embedded base64 images into imgs directory", async () => {
|
||||
const tempDir = await mkdtemp(path.join(os.tmpdir(), "url-to-markdown-media-"));
|
||||
const dataUri = `data:image/png;base64,${PNG_1X1_BASE64}`;
|
||||
const markdown = [
|
||||
"---",
|
||||
`coverImage: "${dataUri}"`,
|
||||
"---",
|
||||
"",
|
||||
"# Embedded Image",
|
||||
"",
|
||||
``,
|
||||
"",
|
||||
].join("\n");
|
||||
|
||||
const result = await localizeMarkdownMedia(markdown, {
|
||||
markdownPath: path.join(tempDir, "post.md"),
|
||||
});
|
||||
|
||||
assert.equal(result.downloadedImages, 1);
|
||||
assert.equal(result.downloadedVideos, 0);
|
||||
assert.match(result.markdown, /coverImage: "imgs\/img-001\.png"/);
|
||||
assert.match(result.markdown, /!\[inline\]\(imgs\/img-001\.png\)/);
|
||||
|
||||
const files = await readdir(path.join(tempDir, "imgs"));
|
||||
assert.deepEqual(files, ["img-001.png"]);
|
||||
|
||||
const bytes = await readFile(path.join(tempDir, "imgs", "img-001.png"));
|
||||
assert.equal(bytes.length, Buffer.from(PNG_1X1_BASE64, "base64").length);
|
||||
});
|
||||
|
|
@ -3,10 +3,12 @@ import { mkdir, writeFile } from "node:fs/promises";
|
|||
|
||||
type MediaKind = "image" | "video";
|
||||
type MediaHint = "image" | "unknown";
|
||||
type MediaSource = "remote" | "data";
|
||||
|
||||
type MarkdownLinkCandidate = {
|
||||
url: string;
|
||||
hint: MediaHint;
|
||||
source: MediaSource;
|
||||
};
|
||||
|
||||
export type LocalizeMarkdownMediaOptions = {
|
||||
|
|
@ -22,8 +24,9 @@ export type LocalizeMarkdownMediaResult = {
|
|||
videoDir: string | null;
|
||||
};
|
||||
|
||||
const MARKDOWN_LINK_RE = /(!?\[[^\]\n]*\])\((<)?(https?:\/\/[^)\s>]+)(>)?\)/g;
|
||||
const FRONTMATTER_COVER_RE = /^(coverImage:\s*")(https?:\/\/[^"]+)(")/m;
|
||||
const MARKDOWN_LINK_RE =
|
||||
/(!?\[[^\]\n]*\])\((<)?((?:https?:\/\/[^)\s>]+)|(?:data:[^)>\s]+))(>)?\)/g;
|
||||
const FRONTMATTER_COVER_RE = /^(coverImage:\s*")((?:https?:\/\/[^"]+)|(?:data:[^"]+))(")/m;
|
||||
|
||||
const IMAGE_EXTENSIONS = new Set([
|
||||
"jpg",
|
||||
|
|
@ -86,6 +89,10 @@ function resolveExtensionFromUrl(rawUrl: string): string | undefined {
|
|||
return undefined;
|
||||
}
|
||||
|
||||
function resolveExtensionFromContentType(contentType: string): string | undefined {
|
||||
return normalizeExtension(MIME_EXTENSION_MAP[contentType]);
|
||||
}
|
||||
|
||||
function resolveKindFromContentType(contentType: string): MediaKind | undefined {
|
||||
if (!contentType) return undefined;
|
||||
if (contentType.startsWith("image/")) return "image";
|
||||
|
|
@ -124,7 +131,7 @@ function resolveOutputExtension(
|
|||
extension: string | undefined,
|
||||
kind: MediaKind
|
||||
): string {
|
||||
const extFromMime = normalizeExtension(MIME_EXTENSION_MAP[contentType]);
|
||||
const extFromMime = resolveExtensionFromContentType(contentType);
|
||||
if (extFromMime) return extFromMime;
|
||||
|
||||
const normalizedExt = normalizeExtension(extension);
|
||||
|
|
@ -150,6 +157,10 @@ function sanitizeFileSegment(input: string): string {
|
|||
}
|
||||
|
||||
function resolveFileStem(rawUrl: string, extension: string): string {
|
||||
if (isDataUri(rawUrl)) {
|
||||
return "";
|
||||
}
|
||||
|
||||
try {
|
||||
const parsed = new URL(rawUrl);
|
||||
const base = path.posix.basename(parsed.pathname);
|
||||
|
|
@ -172,6 +183,26 @@ function buildFileName(kind: MediaKind, index: number, sourceUrl: string, extens
|
|||
return `${prefix}-${serial}${suffix}.${extension}`;
|
||||
}
|
||||
|
||||
function isDataUri(value: string): boolean {
|
||||
return value.startsWith("data:");
|
||||
}
|
||||
|
||||
function parseBase64DataUri(rawUrl: string): { contentType: string; bytes: Buffer } | null {
|
||||
const match = rawUrl.match(/^data:([^;,]+);base64,([A-Za-z0-9+/=\s]+)$/i);
|
||||
if (!match?.[1] || !match[2]) return null;
|
||||
|
||||
const contentType = normalizeContentType(match[1]);
|
||||
if (!contentType) return null;
|
||||
|
||||
try {
|
||||
const bytes = Buffer.from(match[2].replace(/\s+/g, ""), "base64");
|
||||
if (bytes.length === 0) return null;
|
||||
return { contentType, bytes };
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[] {
|
||||
const candidates: MarkdownLinkCandidate[] = [];
|
||||
const seen = new Set<string>();
|
||||
|
|
@ -181,7 +212,11 @@ function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[
|
|||
const coverMatch = fmMatch[1]?.match(FRONTMATTER_COVER_RE);
|
||||
if (coverMatch?.[2] && !seen.has(coverMatch[2])) {
|
||||
seen.add(coverMatch[2]);
|
||||
candidates.push({ url: coverMatch[2], hint: "image" });
|
||||
candidates.push({
|
||||
url: coverMatch[2],
|
||||
hint: "image",
|
||||
source: isDataUri(coverMatch[2]) ? "data" : "remote",
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -195,6 +230,7 @@ function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[
|
|||
candidates.push({
|
||||
url: rawUrl,
|
||||
hint: label.startsWith("![") ? "image" : "unknown",
|
||||
source: isDataUri(rawUrl) ? "data" : "remote",
|
||||
});
|
||||
}
|
||||
|
||||
|
|
@ -244,24 +280,45 @@ export async function localizeMarkdownMedia(
|
|||
|
||||
for (const candidate of candidates) {
|
||||
try {
|
||||
const response = await fetch(candidate.url, {
|
||||
method: "GET",
|
||||
redirect: "follow",
|
||||
headers: {
|
||||
"user-agent": DOWNLOAD_USER_AGENT,
|
||||
},
|
||||
});
|
||||
let sourceUrl = candidate.url;
|
||||
let contentType = "";
|
||||
let extension: string | undefined;
|
||||
let kind: MediaKind | undefined;
|
||||
let bytes: Buffer | null = null;
|
||||
|
||||
if (!response.ok) {
|
||||
log(`[url-to-markdown] Skip media (${response.status}): ${candidate.url}`);
|
||||
continue;
|
||||
if (candidate.source === "data") {
|
||||
const parsed = parseBase64DataUri(candidate.url);
|
||||
if (!parsed) {
|
||||
log("[url-to-markdown] Skip embedded media: unsupported or invalid data URI");
|
||||
continue;
|
||||
}
|
||||
|
||||
contentType = parsed.contentType;
|
||||
extension = resolveExtensionFromContentType(contentType);
|
||||
kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
|
||||
bytes = parsed.bytes;
|
||||
} else {
|
||||
const response = await fetch(candidate.url, {
|
||||
method: "GET",
|
||||
redirect: "follow",
|
||||
headers: {
|
||||
"user-agent": DOWNLOAD_USER_AGENT,
|
||||
},
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
log(`[url-to-markdown] Skip media (${response.status}): ${candidate.url}`);
|
||||
continue;
|
||||
}
|
||||
|
||||
sourceUrl = response.url || candidate.url;
|
||||
contentType = normalizeContentType(response.headers.get("content-type"));
|
||||
extension = resolveExtensionFromUrl(sourceUrl) ?? resolveExtensionFromUrl(candidate.url);
|
||||
kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
|
||||
bytes = Buffer.from(await response.arrayBuffer());
|
||||
}
|
||||
|
||||
const sourceUrl = response.url || candidate.url;
|
||||
const contentType = normalizeContentType(response.headers.get("content-type"));
|
||||
const extension = resolveExtensionFromUrl(sourceUrl) ?? resolveExtensionFromUrl(candidate.url);
|
||||
const kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
|
||||
if (!kind) {
|
||||
if (!kind || !bytes) {
|
||||
continue;
|
||||
}
|
||||
|
||||
|
|
@ -274,7 +331,6 @@ export async function localizeMarkdownMedia(
|
|||
const fileName = buildFileName(kind, nextIndex, sourceUrl, outputExtension);
|
||||
const absolutePath = path.join(targetDir, fileName);
|
||||
const relativePath = path.posix.join(dirName, fileName);
|
||||
const bytes = Buffer.from(await response.arrayBuffer());
|
||||
await writeFile(absolutePath, bytes);
|
||||
replacements.set(candidate.url, relativePath);
|
||||
|
||||
|
|
@ -305,6 +361,7 @@ export function countRemoteMedia(markdown: string): { images: number; videos: nu
|
|||
let images = 0;
|
||||
let videos = 0;
|
||||
for (const c of candidates) {
|
||||
if (c.source !== "remote") continue;
|
||||
const ext = resolveExtensionFromUrl(c.url);
|
||||
const kind = resolveKindFromExtension(ext);
|
||||
if (kind === "video") {
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@
|
|||
"dependencies": {
|
||||
"@mozilla/readability": "^0.6.0",
|
||||
"baoyu-chrome-cdp": "file:./vendor/baoyu-chrome-cdp",
|
||||
"defuddle": "^0.12.0",
|
||||
"defuddle": "^0.14.0",
|
||||
"jsdom": "^24.1.3",
|
||||
"linkedom": "^0.18.12",
|
||||
"turndown": "^7.2.2",
|
||||
|
|
|
|||
Loading…
Reference in New Issue