feat(baoyu-url-to-markdown): add browser fallback strategy, content cleaner, and data URI support

- Browser strategy: headless first with automatic retry in visible Chrome on failure - New --browser auto|headless|headed flag with --headless/--headed shortcuts - Content cleaner module for HTML preprocessing (remove ads, base64 images, scripts) - Media localizer now handles base64 data URIs alongside remote URLs - Capture finalUrl from browser to track redirects for output path - Agent quality gate documentation for post-capture validation - Upgrade defuddle ^0.12.0 → ^0.14.0 - Add unit tests for content-cleaner, html-to-markdown, legacy-converter, media-localizer
2026-03-24 22:39:17 -05:00 · 2026-03-24 22:39:17 -05:00 · e99ce744cd
parent 40f9f05c22
commit e99ce744cd
12 changed files with 909 additions and 67 deletions
--- a/skills/baoyu-url-to-markdown/SKILL.md
+++ b/skills/baoyu-url-to-markdown/SKILL.md
@ -118,6 +118,7 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
 ## Features

 - Chrome CDP for full JavaScript rendering
+- Browser strategy fallback: default headless first, then visible Chrome on technical failure
 - URL-specific parser layer for sites that need custom HTML rules before generic extraction
 - Two capture modes: auto or wait-for-user
 - Save rendered HTML as a sibling `-captured.html` file
@ -137,6 +138,12 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
 # Auto mode (default) - capture when page loads
 ${BUN_X} {baseDir}/scripts/main.ts <url>

+# Force headless only
+${BUN_X} {baseDir}/scripts/main.ts <url> --browser headless
+
+# Force visible browser
+${BUN_X} {baseDir}/scripts/main.ts <url> --browser headed
+
 # Wait mode - wait for user signal before capture
 ${BUN_X} {baseDir}/scripts/main.ts <url> --wait

@ -158,6 +165,9 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
 | `-o <path>` | Output file path — must be a **file** path, not directory (default: auto-generated) |
 | `--output-dir <dir>` | Base output directory — auto-generates `{dir}/{domain}/{slug}.md` (default: `./url-to-markdown/`) |
 | `--wait` | Wait for user signal before capturing |
+| `--browser <mode>` | Browser strategy: `auto` (default), `headless`, or `headed` |
+| `--headless` | Shortcut for `--browser headless` |
+| `--headed` | Shortcut for `--browser headed` |
 | `--timeout <ms>` | Page load timeout (default: 30000) |
 | `--download-media` | Download image/video assets to local `imgs/` and `videos/`, and rewrite markdown links to local relative paths |

@ -165,7 +175,7 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media

 | Mode | Behavior | Use When |
 |------|----------|----------|
-| Auto (default) | Capture on network idle | Public pages, static content |
+| Auto (default) | Try headless first, then retry in visible Chrome if needed | Public pages, static content, unknown pages |
 | Wait (`--wait`) | User signals when ready | Login-required, lazy loading, paywalls |

 **Wait mode workflow**:
@ -173,6 +183,43 @@ ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
 2. Ask user to confirm page is ready
 3. Send newline to stdin to trigger capture

+**Default browser fallback**:
+1. Auto mode starts with headless Chrome and captures on network idle
+2. If headless capture fails technically, retry with visible Chrome
+3. If a shared Chrome session for this profile already exists, reuse it instead of launching a new browser
+4. The script does not hard-code login or paywall detection; the agent must inspect the captured markdown or HTML and decide whether to rerun with `--browser headed --wait`
+
+## Agent Quality Gate
+
+**CRITICAL**: The agent must treat headless capture as provisional. Some sites render differently in headless mode and can silently return an error shell, partially hydrated page, or low-quality extraction **without** causing the CLI to fail.
+
+After every run that used `--browser auto` or `--browser headless`, the agent **MUST** inspect the saved markdown first, and inspect the saved `-captured.html` when the markdown looks suspicious.
+
+### Quality checks the agent must perform
+
+1. Confirm the markdown title matches the target page, not a generic site shell
+2. Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
+3. Watch for obvious failure signs such as:
+   - `Application error`
+   - `This page could not be found`
+   - login, signup, subscribe, or verification shells
+   - extremely short markdown for a page that should be long-form
+   - raw framework payloads or mostly boilerplate content
+4. If the result is low quality, incomplete, or clearly wrong, do **not** accept the run as successful just because the CLI exited with code 0
+
+### Recovery workflow the agent must follow
+
+1. First run with default `auto` unless there is already a clear reason to use wait mode
+2. Review markdown quality immediately after the run
+3. If the content is low quality, rerun locally with visible Chrome:
+   - `--browser headed` for ordinary rendering issues
+   - `--browser headed --wait` when the page may need login, anti-bot interaction, cookie acceptance, or extra hydration time
+4. If `--wait` is used, tell the user exactly what to do:
+   - if login is required, ask them to sign in
+   - if the page needs time to hydrate, ask them to wait until the full content is visible
+   - once ready, ask them to press Enter so capture can continue
+5. Only fall back to hosted `defuddle.md` after the local browser strategies have failed or are clearly lower fidelity
+
 ## Output Format

 Each run saves two files side by side:
@ -211,8 +258,9 @@ Conversion order:
 2. If no specialized parser matches, try Defuddle
 3. For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline
 4. If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
-5. If the entire local browser capture flow fails before markdown can be produced, try the hosted `https://defuddle.md/<url>` API and save its markdown output directly
-6. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
+5. If the agent determines the captured result is a login screen, verification screen, or paywall shell, rerun locally with `--browser headed --wait` and ask the user to complete access before capture
+6. If the entire local browser capture flow still fails before markdown can be produced, try the hosted `https://defuddle.md/<url>` API and save its markdown output directly
+7. The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history

 CLI output will show:

--- a/skills/baoyu-url-to-markdown/scripts/bun.lock
+++ b/skills/baoyu-url-to-markdown/scripts/bun.lock
@ -6,7 +6,7 @@
      "dependencies": {
        "@mozilla/readability": "^0.6.0",
        "baoyu-chrome-cdp": "file:./vendor/baoyu-chrome-cdp",
-        "defuddle": "^0.12.0",
+        "defuddle": "^0.14.0",
        "jsdom": "^24.1.3",
        "linkedom": "^0.18.12",
        "turndown": "^7.2.2",
@ -61,7 +61,7 @@

    "decimal.js": ["decimal.js@10.6.0", "", {}, "sha512-YpgQiITW3JXGntzdUmyUR1V812Hn8T1YVXhCu+wO3OpS4eU9l4YdD3qjyiKdV6mvV29zapkMeD390UVEf2lkUg=="],

-    "defuddle": ["defuddle@0.12.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "peerDependencies": { "jsdom": "^24.0.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-Y/WgyGKBxwxFir+hWNth4nmWDDDb8BzQi3qASS2NWYPXsKU42Ku49/3M5yFYefnRef9prynnmasfnXjk99EWgA=="],
+    "defuddle": ["defuddle@0.14.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "linkedom": "^0.18.12", "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-btavZGd1WgiVqrVM62WGRXMUi/aU7ckTZiq0xXWLZMHvzIqNZjwIFQEDRx8MarD7fIgsB90NXZ9xHJkKtapt2Q=="],

    "delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],

--- a/skills/baoyu-url-to-markdown/scripts/content-cleaner.test.ts
+++ b/skills/baoyu-url-to-markdown/scripts/content-cleaner.test.ts
@ -0,0 +1,55 @@
+import assert from "node:assert/strict";
+import test from "node:test";
+
+import { cleanContent } from "./content-cleaner.js";
+
+const SAMPLE_HTML = `<!doctype html>
+<html>
+  <head>
+    <title>Example Story</title>
+    <style>.cookie-banner { position: fixed; }</style>
+    <script>window.__noise = true;</script>
+  </head>
+  <body>
+    <!-- comment that should be removed -->
+    <header>
+      <nav>
+        <a href="/home">Home</a>
+        <a href="/topics">Topics</a>
+      </nav>
+    </header>
+    <div class="cookie-banner">Accept cookies</div>
+    <aside>Sidebar links</aside>
+    <main>
+      <article class="content">
+        <h1>Actual Story Title</h1>
+        <p>
+          This is the first paragraph of the real story body, and it is intentionally long enough
+          to survive the cleaner's main-content heuristics without being mistaken for navigation.
+        </p>
+        <p>
+          This is the second paragraph with more useful detail, a
+          <a href="/read-more">supporting link</a>, and a normal image.
+        </p>
+        <img src="/images/cover.jpg" alt="Cover">
+        <img src="data:image/png;base64,AAAA" alt="Inline data">
+      </article>
+    </main>
+    <footer>Footer boilerplate</footer>
+  </body>
+</html>`;
+
+test("cleanContent keeps the article body and removes obvious boilerplate", () => {
+  const cleaned = cleanContent(SAMPLE_HTML, "https://example.com/posts/story");
+
+  assert.match(cleaned, /Actual Story Title/);
+  assert.match(cleaned, /https:\/\/example\.com\/read-more/);
+  assert.match(cleaned, /https:\/\/example\.com\/images\/cover\.jpg/);
+
+  assert.doesNotMatch(cleaned, /Accept cookies/);
+  assert.doesNotMatch(cleaned, /Sidebar links/);
+  assert.doesNotMatch(cleaned, /Footer boilerplate/);
+  assert.doesNotMatch(cleaned, /window\.__noise/);
+  assert.doesNotMatch(cleaned, /comment that should be removed/);
+  assert.doesNotMatch(cleaned, /data:image\/png;base64/);
+});
--- a/skills/baoyu-url-to-markdown/scripts/content-cleaner.ts
+++ b/skills/baoyu-url-to-markdown/scripts/content-cleaner.ts
@ -0,0 +1,432 @@
+import { parseHTML } from "linkedom";
+
+export interface CleaningOptions {
+  removeAds?: boolean;
+  removeBase64Images?: boolean;
+  onlyMainContent?: boolean;
+  includeTags?: string[];
+  excludeTags?: string[];
+}
+
+const ALWAYS_REMOVE_SELECTORS = [
+  "script",
+  "style",
+  "noscript",
+  "link[rel='stylesheet']",
+  "[hidden]",
+  "[aria-hidden='true']",
+  "[style*='display: none']",
+  "[style*='display:none']",
+  "[style*='visibility: hidden']",
+  "[style*='visibility:hidden']",
+  "svg[aria-hidden='true']",
+  "svg.icon",
+  "svg[class*='icon']",
+  "template",
+  "meta",
+  "iframe",
+  "canvas",
+  "object",
+  "embed",
+  "form",
+  "input",
+  "select",
+  "textarea",
+  "button",
+];
+
+const OVERLAY_SELECTORS = [
+  "[class*='modal']",
+  "[class*='popup']",
+  "[class*='overlay']",
+  "[class*='dialog']",
+  "[role='dialog']",
+  "[role='alertdialog']",
+  "[class*='cookie']",
+  "[class*='consent']",
+  "[class*='gdpr']",
+  "[class*='privacy-banner']",
+  "[class*='notification-bar']",
+  "[id*='cookie']",
+  "[id*='consent']",
+  "[id*='gdpr']",
+  "[style*='position: fixed']",
+  "[style*='position:fixed']",
+  "[style*='position: sticky']",
+  "[style*='position:sticky']",
+];
+
+const NAVIGATION_SELECTORS = [
+  "header",
+  "footer",
+  "nav",
+  "aside",
+  ".header",
+  ".top",
+  ".navbar",
+  "#header",
+  ".footer",
+  ".bottom",
+  "#footer",
+  ".sidebar",
+  ".side",
+  ".aside",
+  "#sidebar",
+  ".modal",
+  ".popup",
+  "#modal",
+  ".overlay",
+  ".ad",
+  ".ads",
+  ".advert",
+  "#ad",
+  ".lang-selector",
+  ".language",
+  "#language-selector",
+  ".social",
+  ".social-media",
+  ".social-links",
+  "#social",
+  ".menu",
+  ".navigation",
+  "#nav",
+  ".breadcrumbs",
+  "#breadcrumbs",
+  ".share",
+  "#share",
+  ".widget",
+  "#widget",
+  ".cookie",
+  "#cookie",
+];
+
+const FORCE_INCLUDE_SELECTORS = [
+  "#main",
+  "#content",
+  "#main-content",
+  "#article",
+  "#post",
+  "#page-content",
+  "main",
+  "article",
+  "[role='main']",
+  ".main-content",
+  ".content",
+  ".post-content",
+  ".article-content",
+  ".entry-content",
+  ".page-content",
+  ".article-body",
+  ".post-body",
+  ".story-content",
+  ".blog-content",
+];
+
+const AD_SELECTORS = [
+  "ins.adsbygoogle",
+  ".google-ad",
+  ".adsense",
+  "[data-ad]",
+  "[data-ads]",
+  "[data-ad-slot]",
+  "[data-ad-client]",
+  ".ad-container",
+  ".ad-wrapper",
+  ".advertisement",
+  ".sponsored-content",
+  "img[width='1'][height='1']",
+  "img[src*='pixel']",
+  "img[src*='tracking']",
+  "img[src*='analytics']",
+];
+
+function getLinkDensity(element: Element): number {
+  const text = element.textContent || "";
+  const textLength = text.trim().length;
+  if (textLength === 0) return 1;
+
+  let linkLength = 0;
+  element.querySelectorAll("a").forEach((link: Element) => {
+    linkLength += (link.textContent || "").trim().length;
+  });
+
+  return linkLength / textLength;
+}
+
+function getContentScore(element: Element): number {
+  let score = 0;
+  const text = element.textContent || "";
+  const textLength = text.trim().length;
+
+  score += Math.min(textLength / 100, 50);
+  score += element.querySelectorAll("p").length * 3;
+  score += element.querySelectorAll("h1, h2, h3, h4, h5, h6").length * 2;
+  score += element.querySelectorAll("img").length;
+
+  score -= element.querySelectorAll("a").length * 0.5;
+  score -= element.querySelectorAll("li").length * 0.2;
+
+  const linkDensity = getLinkDensity(element);
+  if (linkDensity > 0.5) score -= 30;
+  else if (linkDensity > 0.3) score -= 15;
+
+  const className = typeof element.className === "string" ? element.className : "";
+  const classAndId = `${className} ${element.id || ""}`;
+  if (/article|content|post|body|main|entry/i.test(classAndId)) score += 25;
+  if (/comment|sidebar|footer|nav|menu|header|widget|ad/i.test(classAndId)) score -= 25;
+
+  return score;
+}
+
+function looksLikeNavigation(element: Element): boolean {
+  const linkDensity = getLinkDensity(element);
+  if (linkDensity > 0.5) return true;
+
+  const listItems = element.querySelectorAll("li");
+  const links = element.querySelectorAll("a");
+  return listItems.length > 5 && links.length > listItems.length * 0.8;
+}
+
+function removeElements(document: Document, selectors: string[]): void {
+  for (const selector of selectors) {
+    try {
+      document.querySelectorAll(selector).forEach((element: Element) => element.remove());
+    } catch {
+      // Ignore unsupported selectors from linkedom/jsdom differences.
+    }
+  }
+}
+
+function removeWithProtection(
+  document: Document,
+  selectorsToRemove: string[],
+  protectedSelectors: string[]
+): void {
+  for (const selector of selectorsToRemove) {
+    try {
+      document.querySelectorAll(selector).forEach((element: Element) => {
+        const isProtected = protectedSelectors.some((protectedSelector) => {
+          try {
+            return element.matches(protectedSelector);
+          } catch {
+            return false;
+          }
+        });
+        if (isProtected) return;
+
+        const containsProtected = protectedSelectors.some((protectedSelector) => {
+          try {
+            return element.querySelector(protectedSelector) !== null;
+          } catch {
+            return false;
+          }
+        });
+        if (containsProtected) return;
+
+        element.remove();
+      });
+    } catch {
+      // Ignore unsupported selectors from linkedom/jsdom differences.
+    }
+  }
+}
+
+function findMainContent(document: Document): Element | null {
+  const isValidContent = (element: Element | null): element is Element => {
+    if (!element) return false;
+    const text = element.textContent || "";
+    if (text.trim().length < 100) return false;
+    return !looksLikeNavigation(element);
+  };
+
+  const main = document.querySelector("main");
+  if (isValidContent(main) && getLinkDensity(main) < 0.4) return main;
+
+  const roleMain = document.querySelector('[role="main"]');
+  if (isValidContent(roleMain) && getLinkDensity(roleMain) < 0.4) return roleMain;
+
+  const articles = document.querySelectorAll("article");
+  if (articles.length === 1 && isValidContent(articles[0] ?? null)) {
+    return articles[0] ?? null;
+  }
+
+  const contentSelectors = [
+    "#content",
+    "#main-content",
+    "#main",
+    ".content",
+    ".main-content",
+    ".post-content",
+    ".article-content",
+    ".entry-content",
+    ".page-content",
+    ".article-body",
+    ".post-body",
+    ".story-content",
+    ".blog-content",
+  ];
+
+  for (const selector of contentSelectors) {
+    try {
+      const element = document.querySelector(selector);
+      if (isValidContent(element) && getLinkDensity(element) < 0.4) {
+        return element;
+      }
+    } catch {
+      // Ignore invalid selectors.
+    }
+  }
+
+  const candidates: Array<{ element: Element; score: number }> = [];
+  const containers = document.querySelectorAll("div, section, article");
+  containers.forEach((element: Element) => {
+    const text = element.textContent || "";
+    if (text.trim().length < 200) return;
+
+    const score = getContentScore(element);
+    if (score > 0) {
+      candidates.push({ element, score });
+    }
+  });
+
+  candidates.sort((left, right) => right.score - left.score);
+  if ((candidates[0]?.score ?? 0) > 20) {
+    return candidates[0]?.element ?? null;
+  }
+
+  return null;
+}
+
+function removeBase64ImagesFromDocument(document: Document): void {
+  document.querySelectorAll("img[src^='data:']").forEach((element: Element) => {
+    element.remove();
+  });
+
+  document.querySelectorAll("[style*='data:image']").forEach((element: Element) => {
+    const style = element.getAttribute("style");
+    if (!style) return;
+
+    const cleanedStyle = style.replace(
+      /background(-image)?:\s*url\([^)]*data:image[^)]*\)[^;]*;?/gi,
+      ""
+    );
+
+    if (cleanedStyle.trim()) {
+      element.setAttribute("style", cleanedStyle);
+    } else {
+      element.removeAttribute("style");
+    }
+  });
+
+  document.querySelectorAll("source[src^='data:'], source[srcset*='data:']").forEach((element: Element) => {
+    element.remove();
+  });
+}
+
+function makeAbsoluteUrl(value: string, baseUrl: string): string | null {
+  try {
+    return new URL(value, baseUrl).toString();
+  } catch {
+    return null;
+  }
+}
+
+function convertRelativeUrls(document: Document, baseUrl: string): void {
+  document.querySelectorAll("[src]").forEach((element: Element) => {
+    const src = element.getAttribute("src");
+    if (!src || src.startsWith("http") || src.startsWith("//") || src.startsWith("data:")) return;
+
+    const absolute = makeAbsoluteUrl(src, baseUrl);
+    if (absolute) element.setAttribute("src", absolute);
+  });
+
+  document.querySelectorAll("[href]").forEach((element: Element) => {
+    const href = element.getAttribute("href");
+    if (
+      !href ||
+      href.startsWith("http") ||
+      href.startsWith("//") ||
+      href.startsWith("#") ||
+      href.startsWith("mailto:") ||
+      href.startsWith("tel:") ||
+      href.startsWith("javascript:")
+    ) {
+      return;
+    }
+
+    const absolute = makeAbsoluteUrl(href, baseUrl);
+    if (absolute) element.setAttribute("href", absolute);
+  });
+}
+
+export function cleanHtml(html: string, baseUrl: string, options: CleaningOptions = {}): string {
+  const {
+    removeAds = true,
+    removeBase64Images = true,
+    onlyMainContent = true,
+    includeTags,
+    excludeTags,
+  } = options;
+
+  const { document } = parseHTML(html);
+
+  removeElements(document, ALWAYS_REMOVE_SELECTORS);
+  removeElements(document, OVERLAY_SELECTORS);
+
+  if (removeAds) {
+    removeElements(document, AD_SELECTORS);
+  }
+
+  if (excludeTags?.length) {
+    removeElements(document, excludeTags);
+  }
+
+  if (onlyMainContent) {
+    removeWithProtection(document, NAVIGATION_SELECTORS, FORCE_INCLUDE_SELECTORS);
+
+    const mainContent = findMainContent(document);
+    if (mainContent && document.body) {
+      const clone = mainContent.cloneNode(true) as Element;
+      document.body.innerHTML = "";
+      document.body.appendChild(clone);
+    }
+  }
+
+  if (includeTags?.length && document.body) {
+    const matchedElements: Element[] = [];
+
+    for (const selector of includeTags) {
+      try {
+        document.querySelectorAll(selector).forEach((element: Element) => {
+          matchedElements.push(element.cloneNode(true) as Element);
+        });
+      } catch {
+        // Ignore invalid selectors.
+      }
+    }
+
+    if (matchedElements.length > 0) {
+      document.body.innerHTML = "";
+      matchedElements.forEach((element) => document.body?.appendChild(element));
+    }
+  }
+
+  if (removeBase64Images) {
+    removeBase64ImagesFromDocument(document);
+  }
+
+  const walker = document.createTreeWalker(document, 128);
+  const comments: Node[] = [];
+  while (walker.nextNode()) {
+    comments.push(walker.currentNode);
+  }
+  comments.forEach((comment) => comment.parentNode?.removeChild(comment));
+
+  convertRelativeUrls(document, baseUrl);
+
+  return document.documentElement?.outerHTML || html;
+}
+
+export function cleanContent(html: string, baseUrl: string, options: CleaningOptions = {}): string {
+  return cleanHtml(html, baseUrl, options);
+}
--- a/skills/baoyu-url-to-markdown/scripts/html-to-markdown.test.ts
+++ b/skills/baoyu-url-to-markdown/scripts/html-to-markdown.test.ts
@ -0,0 +1,28 @@
+import assert from "node:assert/strict";
+import test from "node:test";
+
+import { extractContent } from "./html-to-markdown.js";
+
+const EMBEDDED_IMAGE_HTML = `<!doctype html>
+<html>
+  <body>
+    <main>
+      <article>
+        <h1>Embedded Image Story</h1>
+        <p>
+          This paragraph is intentionally long enough to satisfy the extractor thresholds so the
+          resulting markdown keeps the main article body and the embedded image reference.
+        </p>
+        <img src="data:image/png;base64,AAAA" alt="inline">
+      </article>
+    </main>
+  </body>
+</html>`;
+
+test("extractContent preserves base64 images when requested for media download", async () => {
+  const result = await extractContent(EMBEDDED_IMAGE_HTML, "https://example.com/embedded", {
+    preserveBase64Images: true,
+  });
+
+  assert.match(result.markdown, /!\[inline\]\(data:image\/png;base64,AAAA\)/);
+});
--- a/skills/baoyu-url-to-markdown/scripts/html-to-markdown.ts
+++ b/skills/baoyu-url-to-markdown/scripts/html-to-markdown.ts
@ -13,10 +13,15 @@ import {
  shouldCompareWithLegacy,
 } from "./legacy-converter.js";
 import { tryUrlRuleParsers } from "./parsers/index.js";
+import { cleanContent } from "./content-cleaner.js";

 export type { ConversionResult, PageMetadata };
 export { createMarkdownDocument, formatMetadataYaml };

+export interface ExtractContentOptions {
+  preserveBase64Images?: boolean;
+}
+
 export const absolutizeUrlsScript = String.raw`
 (function() {
  const baseUrl = document.baseURI || location.href;
@ -85,7 +90,10 @@ export const absolutizeUrlsScript = String.raw`
  absAttr(htmlClone, "video[poster]", "poster");
  absSrcset(htmlClone, "img[srcset], source[srcset]");

-  return { html: "<!doctype html>\n" + htmlClone.outerHTML };
+  return {
+    html: "<!doctype html>\n" + htmlClone.outerHTML,
+    finalUrl: location.href,
+  };
 })()
 `;

@ -102,7 +110,11 @@ function shouldPreferDefuddle(result: ConversionResult): boolean {
  return /^##?\s+transcript\b/im.test(result.markdown);
 }

-export async function extractContent(html: string, url: string): Promise<ConversionResult> {
+export async function extractContent(
+  html: string,
+  url: string,
+  options: ExtractContentOptions = {}
+): Promise<ConversionResult> {
  const capturedAt = new Date().toISOString();
  const baseMetadata = extractMetadataFromHtml(html, url, capturedAt);

@ -111,14 +123,23 @@ export async function extractContent(html: string, url: string): Promise<Convers
    return specializedResult;
  }

-  const defuddleResult = await tryDefuddleConversion(html, url, baseMetadata);
+  let cleanedHtml = html;
+  try {
+    cleanedHtml = cleanContent(html, url, {
+      removeBase64Images: !options.preserveBase64Images,
+    });
+  } catch {
+    cleanedHtml = html;
+  }
+
+  const defuddleResult = await tryDefuddleConversion(cleanedHtml, url, baseMetadata);
  if (defuddleResult.ok) {
    if (shouldPreferDefuddle(defuddleResult.result)) {
-      return defuddleResult.result;
+      return { ...defuddleResult.result, rawHtml: html };
    }

    if (shouldCompareWithLegacy(defuddleResult.result.markdown)) {
-      const legacyResult = convertWithLegacyExtractor(html, baseMetadata);
+      const legacyResult = convertWithLegacyExtractor(html, baseMetadata, cleanedHtml);
      const legacyScore = scoreMarkdownQuality(legacyResult.markdown);
      const defuddleScore = scoreMarkdownQuality(defuddleResult.result.markdown);

@ -130,10 +151,10 @@ export async function extractContent(html: string, url: string): Promise<Convers
      }
    }

-    return defuddleResult.result;
+    return { ...defuddleResult.result, rawHtml: html };
  }

-  const fallbackResult = convertWithLegacyExtractor(html, baseMetadata);
+  const fallbackResult = convertWithLegacyExtractor(html, baseMetadata, cleanedHtml);
  return {
    ...fallbackResult,
    fallbackReason: defuddleResult.reason,
--- a/skills/baoyu-url-to-markdown/scripts/legacy-converter.test.ts
+++ b/skills/baoyu-url-to-markdown/scripts/legacy-converter.test.ts
@ -0,0 +1,48 @@
+import assert from "node:assert/strict";
+import test from "node:test";
+
+import { cleanContent } from "./content-cleaner.js";
+import { convertWithLegacyExtractor } from "./legacy-converter.js";
+import { extractMetadataFromHtml } from "./markdown-conversion-shared.js";
+
+const CAPTURED_AT = "2026-03-24T03:00:00.000Z";
+
+const NEXT_DATA_HTML = `<!doctype html>
+<html>
+  <head>
+    <title>Hydrated Story</title>
+  </head>
+  <body>
+    <div class="cookie-banner">Accept cookies</div>
+    <main>
+      <p>Short teaser text that should not win over the structured article payload.</p>
+    </main>
+    <script id="__NEXT_DATA__" type="application/json">
+      {
+        "props": {
+          "pageProps": {
+            "article": {
+              "title": "Hydrated Story",
+              "description": "A structured article payload from Next.js",
+              "body": "<p>The full article lives in __NEXT_DATA__ and should still be extracted even when the cleaned HTML removes scripts before the selector and readability passes run.</p><p>A second paragraph keeps the content comfortably above the minimum extraction threshold and proves the legacy extractor still has access to the original structured payload.</p>"
+            }
+          }
+        }
+      }
+    </script>
+  </body>
+</html>`;
+
+test("legacy extractor still uses original __NEXT_DATA__ after HTML cleaning", () => {
+  const url = "https://example.com/posts/hydrated-story";
+  const baseMetadata = extractMetadataFromHtml(NEXT_DATA_HTML, url, CAPTURED_AT);
+  const cleanedHtml = cleanContent(NEXT_DATA_HTML, url);
+
+  const result = convertWithLegacyExtractor(NEXT_DATA_HTML, baseMetadata, cleanedHtml);
+
+  assert.equal(result.conversionMethod, "legacy:next-data");
+  assert.match(result.markdown, /The full article lives in .*NEXT.*DATA/);
+  assert.match(result.markdown, /A second paragraph keeps the content comfortably above the minimum extraction threshold/);
+  assert.doesNotMatch(result.markdown, /Short teaser text that should not win/);
+  assert.equal(result.rawHtml, NEXT_DATA_HTML);
+});
--- a/skills/baoyu-url-to-markdown/scripts/legacy-converter.ts
+++ b/skills/baoyu-url-to-markdown/scripts/legacy-converter.ts
@ -336,29 +336,32 @@ function tryNextDataExtraction(document: Document): ExtractionCandidate | null {

 function buildReadabilityCandidate(
  article: ReturnType<Readability["parse"]>,
-  document: Document,
+  referenceDocument: Document,
  method: string
 ): ExtractionCandidate | null {
  const textContent = article?.textContent?.trim() ?? "";
  if (textContent.length < MIN_CONTENT_LENGTH) return null;

  return {
-    title: pickString(article?.title, extractTitle(document)),
+    title: pickString(article?.title, extractTitle(referenceDocument)),
    byline: pickString((article as { byline?: string } | null)?.byline),
    excerpt: pickString(article?.excerpt, generateExcerpt(null, textContent)),
-    published: pickString((article as { publishedTime?: string } | null)?.publishedTime, extractPublishedTime(document)),
+    published: pickString(
+      (article as { publishedTime?: string } | null)?.publishedTime,
+      extractPublishedTime(referenceDocument)
+    ),
    html: article?.content ? sanitizeHtml(article.content) : null,
    textContent,
    method,
  };
 }

-function tryReadability(document: Document): ExtractionCandidate | null {
+function tryReadability(document: Document, referenceDocument: Document = document): ExtractionCandidate | null {
  try {
    const strictClone = document.cloneNode(true) as Document;
    const strictResult = buildReadabilityCandidate(
      new Readability(strictClone).parse(),
-      document,
+      referenceDocument,
      "readability"
    );
    if (strictResult) return strictResult;
@ -366,7 +369,7 @@ function tryReadability(document: Document): ExtractionCandidate | null {
    const relaxedClone = document.cloneNode(true) as Document;
    return buildReadabilityCandidate(
      new Readability(relaxedClone, { charThreshold: 120 }).parse(),
-      document,
+      referenceDocument,
      "readability-relaxed"
    );
  } catch {
@ -471,14 +474,15 @@ function pickBestCandidate(candidates: ExtractionCandidate[]): ExtractionCandida
  return ranked[0];
 }

-function extractFromHtml(html: string): ExtractionCandidate | null {
-  const document = parseDocument(html);
+function extractFromHtml(html: string, cleanedHtml: string = html): ExtractionCandidate | null {
+  const originalDocument = parseDocument(html);
+  const cleanedDocument = parseDocument(cleanedHtml);

-  const readabilityCandidate = tryReadability(document);
-  const nextDataCandidate = tryNextDataExtraction(document);
-  const jsonLdCandidate = tryJsonLdExtraction(document);
-  const selectorCandidate = trySelectorExtraction(document);
-  const bodyCandidate = tryBodyExtraction(document);
+  const readabilityCandidate = tryReadability(cleanedDocument, originalDocument);
+  const nextDataCandidate = tryNextDataExtraction(originalDocument);
+  const jsonLdCandidate = tryJsonLdExtraction(originalDocument);
+  const selectorCandidate = trySelectorExtraction(cleanedDocument);
+  const bodyCandidate = tryBodyExtraction(cleanedDocument);

  const candidates = [
    readabilityCandidate,
@ -493,8 +497,8 @@ function extractFromHtml(html: string): ExtractionCandidate | null {

  return {
    ...winner,
-    title: winner.title ?? extractTitle(document),
-    published: winner.published ?? extractPublishedTime(document),
+    title: winner.title ?? extractTitle(originalDocument),
+    published: winner.published ?? extractPublishedTime(originalDocument),
    excerpt: winner.excerpt ?? generateExcerpt(null, winner.textContent),
  };
 }
@ -610,12 +614,16 @@ export function shouldCompareWithLegacy(markdown: string): boolean {
  );
 }

-export function convertWithLegacyExtractor(html: string, baseMetadata: PageMetadata): ConversionResult {
-  const extracted = extractFromHtml(html);
+export function convertWithLegacyExtractor(
+  html: string,
+  baseMetadata: PageMetadata,
+  cleanedHtml: string = html
+): ConversionResult {
+  const extracted = extractFromHtml(html, cleanedHtml);

  let markdown = extracted?.html ? convertHtmlFragmentToMarkdown(extracted.html) : "";
  if (!markdown.trim()) {
-    markdown = extracted?.textContent?.trim() || fallbackPlainText(html);
+    markdown = extracted?.textContent?.trim() || fallbackPlainText(cleanedHtml);
  }

  return {
--- a/skills/baoyu-url-to-markdown/scripts/main.ts
+++ b/skills/baoyu-url-to-markdown/scripts/main.ts
@ -29,10 +29,33 @@ interface Args {
  wait: boolean;
  timeout: number;
  downloadMedia: boolean;
+  browserMode: BrowserMode;
 }

+type BrowserMode = "auto" | "headless" | "headed";
+
+interface CaptureAttemptOptions {
+  headless: boolean;
+  wait: boolean;
+  existingPort?: number;
+  waitPrompt?: string;
+}
+
+interface CaptureSnapshot {
+  html: string;
+  finalUrl: string;
+}
+
+const BROWSER_MODES = new Set<BrowserMode>(["auto", "headless", "headed"]);
+
 function parseArgs(argv: string[]): Args {
-  const args: Args = { url: "", wait: false, timeout: DEFAULT_TIMEOUT_MS, downloadMedia: false };
+  const args: Args = {
+    url: "",
+    wait: false,
+    timeout: DEFAULT_TIMEOUT_MS,
+    downloadMedia: false,
+    browserMode: "auto",
+  };
  for (let i = 2; i < argv.length; i++) {
    const arg = argv[i];
    if (arg === "--wait" || arg === "-w") {
@ -45,6 +68,12 @@ function parseArgs(argv: string[]): Args {
      args.outputDir = argv[++i];
    } else if (arg === "--download-media") {
      args.downloadMedia = true;
+    } else if (arg === "--browser") {
+      args.browserMode = (argv[++i] as BrowserMode | undefined) ?? "auto";
+    } else if (arg === "--headless") {
+      args.browserMode = "headless";
+    } else if (arg === "--headed" || arg === "--noheadless" || arg === "--no-headless") {
+      args.browserMode = "headed";
    } else if (!arg.startsWith("-") && !args.url) {
      args.url = arg;
    }
@ -194,21 +223,28 @@ async function generateOutputPath(url: string, title: string, outputDir?: string
  return path.join(dataDir, domain, timestampSlug, `${timestampSlug}.md`);
 }

-async function waitForUserSignal(): Promise<void> {
-  console.log("Page opened. Press Enter when ready to capture...");
+function defaultWaitPrompt(): string {
+  return "A browser window has been opened. If the page requires login or verification, complete it first, then press Enter to capture.";
+}
+
+async function waitForUserSignal(prompt: string): Promise<void> {
+  console.log(prompt);
  const rl = createInterface({ input: process.stdin, output: process.stdout });
  await new Promise<void>((resolve) => {
    rl.once("line", () => { rl.close(); resolve(); });
  });
 }

-async function captureUrl(args: Args): Promise<ConversionResult> {
-  const existingPort = await findExistingChromePort();
-  const reusing = existingPort !== null;
-  const port = existingPort ?? await getFreePort();
-  const chrome = reusing ? null : await launchChrome(args.url, port, false);
+async function captureUrlOnce(args: Args, options: CaptureAttemptOptions): Promise<ConversionResult> {
+  const reusing = options.existingPort !== undefined;
+  const port = options.existingPort ?? await getFreePort();
+  const chrome = reusing ? null : await launchChrome(args.url, port, options.headless);

-  if (reusing) console.log(`Reusing existing Chrome on port ${port}`);
+  if (reusing) {
+    console.log(`Reusing existing Chrome on port ${port}`);
+  } else {
+    console.log(`Launching Chrome (${options.headless ? "headless" : "headed"})...`);
+  }

  let cdp: CdpConnection | null = null;
  let targetId: string | null = null;
@ -235,8 +271,8 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
      await cdp.send("Page.enable", {}, { sessionId });
    }

-    if (args.wait) {
-      await waitForUserSignal();
+    if (options.wait) {
+      await waitForUserSignal(options.waitPrompt ?? defaultWaitPrompt());
    } else {
      console.log("Waiting for page to load...");
      await Promise.race([
@ -251,11 +287,12 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
    }

    console.log("Capturing page content...");
-    const { html } = await evaluateScript<{ html: string }>(
+    const snapshot = await evaluateScript<CaptureSnapshot>(
      cdp, sessionId, absolutizeUrlsScript, args.timeout
    );
-
-    return await extractContent(html, args.url);
+    return await extractContent(snapshot.html, snapshot.finalUrl || args.url, {
+      preserveBase64Images: args.downloadMedia,
+    });
  } finally {
    if (reusing) {
      if (cdp && targetId) {
@ -272,10 +309,67 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
  }
 }

+async function runHeadedFlow(
+  args: Args,
+  options: { existingPort?: number; wait: boolean; waitPrompt?: string }
+): Promise<ConversionResult> {
+  return await captureUrlOnce(args, {
+    headless: false,
+    wait: options.wait,
+    existingPort: options.existingPort,
+    waitPrompt: options.waitPrompt,
+  });
+}
+
+async function captureUrl(args: Args): Promise<ConversionResult> {
+  const existingPort = await findExistingChromePort();
+  if (existingPort !== null) {
+    console.log("Found an existing Chrome session for this profile. Reusing it instead of launching a new browser.");
+    return await runHeadedFlow(args, {
+      existingPort,
+      wait: args.wait,
+      waitPrompt: args.wait ? defaultWaitPrompt() : undefined,
+    });
+  }
+
+  if (args.browserMode === "headless") {
+    return await captureUrlOnce(args, { headless: true, wait: false });
+  }
+
+  if (args.browserMode === "headed") {
+    return await runHeadedFlow(args, {
+      wait: args.wait,
+      waitPrompt: args.wait ? defaultWaitPrompt() : undefined,
+    });
+  }
+
+  if (args.wait) {
+    return await runHeadedFlow(args, {
+      wait: true,
+      waitPrompt: defaultWaitPrompt(),
+    });
+  }
+
+  try {
+    return await captureUrlOnce(args, { headless: true, wait: false });
+  } catch (error) {
+    const headlessMessage = error instanceof Error ? error.message : String(error);
+    console.warn(`Headless capture failed: ${headlessMessage}`);
+    console.log("Retrying with a visible browser window...");
+
+    try {
+      return await runHeadedFlow(args, { wait: false });
+    } catch (headedError) {
+      const headedMessage = headedError instanceof Error ? headedError.message : String(headedError);
+      throw new Error(`Headless capture failed (${headlessMessage}); headed retry failed (${headedMessage})`);
+    }
+  }
+}
+
 async function main(): Promise<void> {
  const args = parseArgs(process.argv);
  if (!args.url) {
-    console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--timeout ms] [--download-media]");
+    console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--browser auto|headless|headed] [--timeout ms] [--download-media]");
    process.exit(1);
  }

@ -286,6 +380,16 @@ async function main(): Promise<void> {
    process.exit(1);
  }

+  if (!BROWSER_MODES.has(args.browserMode)) {
+    console.error(`Invalid --browser mode: ${args.browserMode}. Expected auto, headless, or headed.`);
+    process.exit(1);
+  }
+
+  if (args.wait && args.browserMode === "headless") {
+    console.error("Error: --wait requires a visible browser. Use --browser auto or --browser headed.");
+    process.exit(1);
+  }
+
  if (args.output) {
    const stat = await import("node:fs").then(fs => fs.statSync(args.output!, { throwIfNoEntry: false }));
    if (stat?.isDirectory()) {
@ -296,6 +400,7 @@ async function main(): Promise<void> {

  console.log(`Fetching: ${args.url}`);
  console.log(`Mode: ${args.wait ? "wait" : "auto"}`);
+  console.log(`Browser: ${args.browserMode}`);

  let outputPath: string;
  let htmlSnapshotPath: string | null = null;
@ -306,7 +411,7 @@ async function main(): Promise<void> {
  try {
    const result = await captureUrl(args);
    document = createMarkdownDocument(result);
-    outputPath = args.output || await generateOutputPath(args.url, result.metadata.title, args.outputDir, document);
+    outputPath = args.output || await generateOutputPath(result.metadata.url || args.url, result.metadata.title, args.outputDir, document);
    const outputDir = path.dirname(outputPath);
    htmlSnapshotPath = deriveHtmlSnapshotPath(outputPath);
    await mkdir(outputDir, { recursive: true });
--- a/skills/baoyu-url-to-markdown/scripts/media-localizer.test.ts
+++ b/skills/baoyu-url-to-markdown/scripts/media-localizer.test.ts
@ -0,0 +1,40 @@
+import assert from "node:assert/strict";
+import { mkdtemp, readFile, readdir } from "node:fs/promises";
+import os from "node:os";
+import path from "node:path";
+import test from "node:test";
+
+import { localizeMarkdownMedia } from "./media-localizer.js";
+
+const PNG_1X1_BASE64 =
+  "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/x8AAwMCAO7Z0ioAAAAASUVORK5CYII=";
+
+test("localizeMarkdownMedia saves embedded base64 images into imgs directory", async () => {
+  const tempDir = await mkdtemp(path.join(os.tmpdir(), "url-to-markdown-media-"));
+  const dataUri = `data:image/png;base64,${PNG_1X1_BASE64}`;
+  const markdown = [
+    "---",
+    `coverImage: "${dataUri}"`,
+    "---",
+    "",
+    "# Embedded Image",
+    "",
+    `![inline](${dataUri})`,
+    "",
+  ].join("\n");
+
+  const result = await localizeMarkdownMedia(markdown, {
+    markdownPath: path.join(tempDir, "post.md"),
+  });
+
+  assert.equal(result.downloadedImages, 1);
+  assert.equal(result.downloadedVideos, 0);
+  assert.match(result.markdown, /coverImage: "imgs\/img-001\.png"/);
+  assert.match(result.markdown, /!\[inline\]\(imgs\/img-001\.png\)/);
+
+  const files = await readdir(path.join(tempDir, "imgs"));
+  assert.deepEqual(files, ["img-001.png"]);
+
+  const bytes = await readFile(path.join(tempDir, "imgs", "img-001.png"));
+  assert.equal(bytes.length, Buffer.from(PNG_1X1_BASE64, "base64").length);
+});
--- a/skills/baoyu-url-to-markdown/scripts/media-localizer.ts
+++ b/skills/baoyu-url-to-markdown/scripts/media-localizer.ts
@ -3,10 +3,12 @@ import { mkdir, writeFile } from "node:fs/promises";

 type MediaKind = "image" | "video";
 type MediaHint = "image" | "unknown";
+type MediaSource = "remote" | "data";

 type MarkdownLinkCandidate = {
  url: string;
  hint: MediaHint;
+  source: MediaSource;
 };

 export type LocalizeMarkdownMediaOptions = {
@ -22,8 +24,9 @@ export type LocalizeMarkdownMediaResult = {
  videoDir: string | null;
 };

-const MARKDOWN_LINK_RE = /(!?\[[^\]\n]*\])\((<)?(https?:\/\/[^)\s>]+)(>)?\)/g;
-const FRONTMATTER_COVER_RE = /^(coverImage:\s*")(https?:\/\/[^"]+)(")/m;
+const MARKDOWN_LINK_RE =
+  /(!?\[[^\]\n]*\])\((<)?((?:https?:\/\/[^)\s>]+)|(?:data:[^)>\s]+))(>)?\)/g;
+const FRONTMATTER_COVER_RE = /^(coverImage:\s*")((?:https?:\/\/[^"]+)|(?:data:[^"]+))(")/m;

 const IMAGE_EXTENSIONS = new Set([
  "jpg",
@ -86,6 +89,10 @@ function resolveExtensionFromUrl(rawUrl: string): string | undefined {
  return undefined;
 }

+function resolveExtensionFromContentType(contentType: string): string | undefined {
+  return normalizeExtension(MIME_EXTENSION_MAP[contentType]);
+}
+
 function resolveKindFromContentType(contentType: string): MediaKind | undefined {
  if (!contentType) return undefined;
  if (contentType.startsWith("image/")) return "image";
@ -124,7 +131,7 @@ function resolveOutputExtension(
  extension: string | undefined,
  kind: MediaKind
 ): string {
-  const extFromMime = normalizeExtension(MIME_EXTENSION_MAP[contentType]);
+  const extFromMime = resolveExtensionFromContentType(contentType);
  if (extFromMime) return extFromMime;

  const normalizedExt = normalizeExtension(extension);
@ -150,6 +157,10 @@ function sanitizeFileSegment(input: string): string {
 }

 function resolveFileStem(rawUrl: string, extension: string): string {
+  if (isDataUri(rawUrl)) {
+    return "";
+  }
+
  try {
    const parsed = new URL(rawUrl);
    const base = path.posix.basename(parsed.pathname);
@ -172,6 +183,26 @@ function buildFileName(kind: MediaKind, index: number, sourceUrl: string, extens
  return `${prefix}-${serial}${suffix}.${extension}`;
 }

+function isDataUri(value: string): boolean {
+  return value.startsWith("data:");
+}
+
+function parseBase64DataUri(rawUrl: string): { contentType: string; bytes: Buffer } | null {
+  const match = rawUrl.match(/^data:([^;,]+);base64,([A-Za-z0-9+/=\s]+)$/i);
+  if (!match?.[1] || !match[2]) return null;
+
+  const contentType = normalizeContentType(match[1]);
+  if (!contentType) return null;
+
+  try {
+    const bytes = Buffer.from(match[2].replace(/\s+/g, ""), "base64");
+    if (bytes.length === 0) return null;
+    return { contentType, bytes };
+  } catch {
+    return null;
+  }
+}
+
 function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[] {
  const candidates: MarkdownLinkCandidate[] = [];
  const seen = new Set<string>();
@ -181,7 +212,11 @@ function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[
    const coverMatch = fmMatch[1]?.match(FRONTMATTER_COVER_RE);
    if (coverMatch?.[2] && !seen.has(coverMatch[2])) {
      seen.add(coverMatch[2]);
-      candidates.push({ url: coverMatch[2], hint: "image" });
+      candidates.push({
+        url: coverMatch[2],
+        hint: "image",
+        source: isDataUri(coverMatch[2]) ? "data" : "remote",
+      });
    }
  }

@ -195,6 +230,7 @@ function collectMarkdownLinkCandidates(markdown: string): MarkdownLinkCandidate[
    candidates.push({
      url: rawUrl,
      hint: label.startsWith("![") ? "image" : "unknown",
+      source: isDataUri(rawUrl) ? "data" : "remote",
    });
  }

@ -244,24 +280,45 @@ export async function localizeMarkdownMedia(

  for (const candidate of candidates) {
    try {
-      const response = await fetch(candidate.url, {
-        method: "GET",
-        redirect: "follow",
-        headers: {
-          "user-agent": DOWNLOAD_USER_AGENT,
-        },
-      });
+      let sourceUrl = candidate.url;
+      let contentType = "";
+      let extension: string | undefined;
+      let kind: MediaKind | undefined;
+      let bytes: Buffer | null = null;

-      if (!response.ok) {
-        log(`[url-to-markdown] Skip media (${response.status}): ${candidate.url}`);
-        continue;
+      if (candidate.source === "data") {
+        const parsed = parseBase64DataUri(candidate.url);
+        if (!parsed) {
+          log("[url-to-markdown] Skip embedded media: unsupported or invalid data URI");
+          continue;
+        }
+
+        contentType = parsed.contentType;
+        extension = resolveExtensionFromContentType(contentType);
+        kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
+        bytes = parsed.bytes;
+      } else {
+        const response = await fetch(candidate.url, {
+          method: "GET",
+          redirect: "follow",
+          headers: {
+            "user-agent": DOWNLOAD_USER_AGENT,
+          },
+        });
+
+        if (!response.ok) {
+          log(`[url-to-markdown] Skip media (${response.status}): ${candidate.url}`);
+          continue;
+        }
+
+        sourceUrl = response.url || candidate.url;
+        contentType = normalizeContentType(response.headers.get("content-type"));
+        extension = resolveExtensionFromUrl(sourceUrl) ?? resolveExtensionFromUrl(candidate.url);
+        kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
+        bytes = Buffer.from(await response.arrayBuffer());
      }

-      const sourceUrl = response.url || candidate.url;
-      const contentType = normalizeContentType(response.headers.get("content-type"));
-      const extension = resolveExtensionFromUrl(sourceUrl) ?? resolveExtensionFromUrl(candidate.url);
-      const kind = resolveMediaKind(sourceUrl, contentType, extension, candidate.hint);
-      if (!kind) {
+      if (!kind || !bytes) {
        continue;
      }

@ -274,7 +331,6 @@ export async function localizeMarkdownMedia(
      const fileName = buildFileName(kind, nextIndex, sourceUrl, outputExtension);
      const absolutePath = path.join(targetDir, fileName);
      const relativePath = path.posix.join(dirName, fileName);
-      const bytes = Buffer.from(await response.arrayBuffer());
      await writeFile(absolutePath, bytes);
      replacements.set(candidate.url, relativePath);

@ -305,6 +361,7 @@ export function countRemoteMedia(markdown: string): { images: number; videos: nu
  let images = 0;
  let videos = 0;
  for (const c of candidates) {
+    if (c.source !== "remote") continue;
    const ext = resolveExtensionFromUrl(c.url);
    const kind = resolveKindFromExtension(ext);
    if (kind === "video") {
--- a/skills/baoyu-url-to-markdown/scripts/package.json
+++ b/skills/baoyu-url-to-markdown/scripts/package.json
@ -5,7 +5,7 @@
  "dependencies": {
    "@mozilla/readability": "^0.6.0",
    "baoyu-chrome-cdp": "file:./vendor/baoyu-chrome-cdp",
-    "defuddle": "^0.12.0",
+    "defuddle": "^0.14.0",
    "jsdom": "^24.1.3",
    "linkedom": "^0.18.12",
    "turndown": "^7.2.2",