chore: release v1.13.0

This commit is contained in:
Jim Liu 宝玉 2026-01-21 19:40:46 -06:00
parent 235868343c
commit 97da7ab4eb
18 changed files with 999 additions and 22 deletions

View File

@ -6,7 +6,7 @@
},
"metadata": {
"description": "Skills shared by Baoyu for improving daily work efficiency",
"version": "1.12.0"
"version": "1.13.0"
},
"plugins": [
{
@ -42,7 +42,8 @@
"strict": false,
"skills": [
"./skills/baoyu-danger-x-to-markdown",
"./skills/baoyu-compress-image"
"./skills/baoyu-compress-image",
"./skills/baoyu-url-to-markdown"
]
}
]

2
.gitignore vendored
View File

@ -6,6 +6,7 @@ yarn-debug.log*
yarn-error.log*
lerna-debug.log*
# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
@ -145,3 +146,4 @@ tests-data/
.baoyu-skills/
x-to-markdown/
xhs-images/
url-to-markdown/

View File

@ -2,6 +2,14 @@
English | [中文](./CHANGELOG.zh.md)
## 1.13.0 - 2026-01-21
### Features
- `baoyu-url-to-markdown`: new utility skill for fetching any URL via Chrome CDP and converting to clean markdown. Supports two capture modes—auto (immediate capture on page load) and wait (user-controlled capture for login-required pages).
### Improvements
- `baoyu-xhs-images`: updates style recommendations—replaces `tech` references with `notion` and `chalkboard` for technical and educational content.
## 1.12.0 - 2026-01-21
### Refactor

View File

@ -2,6 +2,14 @@
[English](./CHANGELOG.md) | 中文
## 1.13.0 - 2026-01-21
### 新功能
- `baoyu-url-to-markdown`:新增 URL 转 Markdown 工具技能,通过 Chrome CDP 抓取任意网页并转换为干净的 Markdown 格式。支持两种抓取模式——自动模式(页面加载后立即抓取)和等待模式(用户控制抓取时机,适用于需要登录的页面)。
### 改进
- `baoyu-xhs-images`:更新风格推荐——将 `tech` 风格引用替换为 `notion``chalkboard`,用于技术和教育内容。
## 1.12.0 - 2026-01-21
### 重构

View File

@ -55,7 +55,7 @@ Simply tell Claude Code:
|--------|-------------|--------|
| **content-skills** | Content generation and publishing | [xhs-images](#baoyu-xhs-images), [infographic](#baoyu-infographic), [cover-image](#baoyu-cover-image), [slide-deck](#baoyu-slide-deck), [comic](#baoyu-comic), [article-illustrator](#baoyu-article-illustrator), [post-to-x](#baoyu-post-to-x), [post-to-wechat](#baoyu-post-to-wechat) |
| **ai-generation-skills** | AI-powered generation backends | [image-gen](#baoyu-image-gen), [danger-gemini-web](#baoyu-danger-gemini-web) |
| **utility-skills** | Utility tools for content processing | [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
| **utility-skills** | Utility tools for content processing | [url-to-markdown](#baoyu-url-to-markdown), [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
## Update Skills
@ -586,6 +586,35 @@ Interacts with Gemini Web to generate text and images.
Utility tools for content processing.
#### baoyu-url-to-markdown
Fetch any URL via Chrome CDP and convert to clean markdown. Supports two capture modes for different scenarios.
```bash
# Auto mode (default) - capture when page loads
/baoyu-url-to-markdown https://example.com/article
# Wait mode - for login-required pages
/baoyu-url-to-markdown https://example.com/private --wait
# Save to specific file
/baoyu-url-to-markdown https://example.com/article -o output.md
```
**Capture Modes**:
| Mode | Description | Best For |
|------|-------------|----------|
| Auto (default) | Captures immediately after page load | Public pages, static content |
| Wait (`--wait`) | Waits for user signal before capture | Login-required, dynamic content |
**Options**:
| Option | Description |
|--------|-------------|
| `<url>` | URL to fetch |
| `-o <path>` | Output file path |
| `--wait` | Wait for user signal before capturing |
| `--timeout <ms>` | Page load timeout (default: 30000) |
#### baoyu-danger-x-to-markdown
Converts X (Twitter) content to markdown format. Supports tweet threads and X Articles.

View File

@ -55,7 +55,7 @@ npx skills add jimliu/baoyu-skills
|------|------|----------|
| **content-skills** | 内容生成和发布 | [xhs-images](#baoyu-xhs-images), [infographic](#baoyu-infographic), [cover-image](#baoyu-cover-image), [slide-deck](#baoyu-slide-deck), [comic](#baoyu-comic), [article-illustrator](#baoyu-article-illustrator), [post-to-x](#baoyu-post-to-x), [post-to-wechat](#baoyu-post-to-wechat) |
| **ai-generation-skills** | AI 生成后端 | [image-gen](#baoyu-image-gen), [danger-gemini-web](#baoyu-danger-gemini-web) |
| **utility-skills** | 内容处理工具 | [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
| **utility-skills** | 内容处理工具 | [url-to-markdown](#baoyu-url-to-markdown), [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
## 更新技能
@ -586,6 +586,35 @@ AI 驱动的生成后端。
内容处理工具。
#### baoyu-url-to-markdown
通过 Chrome CDP 抓取任意 URL 并转换为干净的 Markdown。支持两种抓取模式适应不同场景。
```bash
# 自动模式(默认)- 页面加载后立即抓取
/baoyu-url-to-markdown https://example.com/article
# 等待模式 - 适用于需要登录的页面
/baoyu-url-to-markdown https://example.com/private --wait
# 保存到指定文件
/baoyu-url-to-markdown https://example.com/article -o output.md
```
**抓取模式**
| 模式 | 说明 | 适用场景 |
|------|------|----------|
| 自动(默认) | 页面加载后立即抓取 | 公开页面、静态内容 |
| 等待(`--wait` | 等待用户信号后抓取 | 需登录页面、动态内容 |
**选项**
| 选项 | 说明 |
|------|------|
| `<url>` | 要抓取的 URL |
| `-o <path>` | 输出文件路径 |
| `--wait` | 等待用户信号后抓取 |
| `--timeout <ms>` | 页面加载超时默认30000 |
#### baoyu-danger-x-to-markdown
将 X (Twitter) 内容转换为 markdown 格式。支持推文串和 X 文章。

View File

@ -0,0 +1,169 @@
---
name: baoyu-url-to-markdown
description: Fetch any URL and convert to markdown using Chrome CDP. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
---
# URL to Markdown
Fetches any URL via Chrome CDP and converts HTML to clean markdown.
## Script Directory
**Important**: All scripts are located in the `scripts/` subdirectory of this skill.
**Agent Execution Instructions**:
1. Determine this SKILL.md file's directory path as `SKILL_DIR`
2. Script path = `${SKILL_DIR}/scripts/<script-name>.ts`
3. Replace all `${SKILL_DIR}` in this document with the actual path
**Script Reference**:
| Script | Purpose |
|--------|---------|
| `scripts/main.ts` | CLI entry point for URL fetching |
## Features
- Chrome CDP for full JavaScript rendering
- Two capture modes: auto or wait-for-user
- Clean markdown output with metadata
- Handles login-required pages via wait mode
## Usage
```bash
# Auto mode (default) - capture when page loads
npx -y bun ${SKILL_DIR}/scripts/main.ts <url>
# Wait mode - wait for user signal before capture
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> --wait
# Save to specific file
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> -o output.md
```
## Options
| Option | Description |
|--------|-------------|
| `<url>` | URL to fetch |
| `-o <path>` | Output file path (default: auto-generated) |
| `--wait` | Wait for user signal before capturing |
| `--timeout <ms>` | Page load timeout (default: 30000) |
## Capture Modes
### Auto Mode (default)
Page loads → waits for network idle → captures immediately.
Best for:
- Public pages
- Static content
- No login required
### Wait Mode (`--wait`)
Page opens → user can interact (login, scroll, etc.) → user signals ready → captures.
Best for:
- Login-required pages
- Dynamic content needing interaction
- Pages with lazy loading
**Agent workflow for wait mode**:
1. Run script with `--wait` flag
2. Script outputs: `Page opened. Press Enter when ready to capture...`
3. Use `AskUserQuestion` to ask user if page is ready
4. When user confirms, send newline to stdin to trigger capture
## Output Format
```markdown
---
url: https://example.com/page
title: "Page Title"
description: "Meta description if available"
author: "Author if available"
published: "2024-01-01"
captured_at: "2024-01-15T10:30:00Z"
---
# Page Title
Converted markdown content...
```
## Mode Selection Guide
When user requests URL capture, help select appropriate mode:
**Suggest Auto Mode when**:
- URL is public (no login wall visible)
- Content appears static
- User doesn't mention login requirements
**Suggest Wait Mode when**:
- User mentions needing to log in
- Site known to require authentication
- User wants to scroll/interact before capture
- Content is behind paywall
**Ask user when unclear**:
```
The page may require login or interaction before capturing.
Which mode should I use?
1. Auto - Capture immediately when loaded
2. Wait - Wait for you to interact first
```
## Output Directory
Each capture creates a file organized by domain:
```
url-to-markdown/
└── <domain>/
└── <slug>.md
```
**Path Components**:
- `<domain>`: Site domain (e.g., `example.com`, `github.com`)
- `<slug>`: Generated from page title or URL path (kebab-case)
**Slug Generation**:
1. Extract from page title (preferred) or URL path
2. Convert to kebab-case, 2-6 words
3. Example: "Getting Started with React" → `getting-started-with-react`
**Conflict Resolution**:
If `url-to-markdown/<domain>/<slug>.md` already exists:
- Append timestamp: `<slug>-YYYYMMDD-HHMMSS.md`
- Example: `getting-started.md` exists → `getting-started-20260118-143052.md`
## Error Handling
| Error | Resolution |
|-------|------------|
| Chrome not found | Install Chrome or set `URL_CHROME_PATH` env |
| Page timeout | Increase `--timeout` value |
| Capture failed | Try wait mode for complex pages |
| Empty content | Page may need JS rendering time |
## Environment Variables
| Variable | Description |
|----------|-------------|
| `URL_CHROME_PATH` | Custom Chrome executable path |
| `URL_DATA_DIR` | Custom data directory |
| `URL_CHROME_PROFILE_DIR` | Custom Chrome profile directory |
## Extension Support
Custom configurations via EXTEND.md.
**Check paths** (priority order):
1. `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` (project)
2. `~/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` (user)
If found, load before workflow. Extension content overrides defaults.

View File

@ -0,0 +1,290 @@
import { spawn, type ChildProcess } from "node:child_process";
import fs from "node:fs";
import { mkdir } from "node:fs/promises";
import net from "node:net";
import process from "node:process";
import { resolveUrlToMarkdownChromeProfileDir } from "./paths.js";
import { CDP_CONNECT_TIMEOUT_MS, NETWORK_IDLE_TIMEOUT_MS } from "./constants.js";
type CdpSendOptions = { sessionId?: string; timeoutMs?: number };
function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function fetchWithTimeout(url: string, init: RequestInit & { timeoutMs?: number } = {}): Promise<Response> {
const { timeoutMs, ...rest } = init;
if (!timeoutMs || timeoutMs <= 0) return fetch(url, rest);
const ctl = new AbortController();
const t = setTimeout(() => ctl.abort(), timeoutMs);
try {
return await fetch(url, { ...rest, signal: ctl.signal });
} finally {
clearTimeout(t);
}
}
export class CdpConnection {
private ws: WebSocket;
private nextId = 0;
private pending = new Map<number, { resolve: (v: unknown) => void; reject: (e: Error) => void; timer: ReturnType<typeof setTimeout> | null }>();
private eventHandlers = new Map<string, Set<(params: unknown) => void>>();
private constructor(ws: WebSocket) {
this.ws = ws;
this.ws.addEventListener("message", (event) => {
try {
const data = typeof event.data === "string" ? event.data : new TextDecoder().decode(event.data as ArrayBuffer);
const msg = JSON.parse(data) as { id?: number; method?: string; params?: unknown; result?: unknown; error?: { message?: string } };
if (msg.id) {
const p = this.pending.get(msg.id);
if (p) {
this.pending.delete(msg.id);
if (p.timer) clearTimeout(p.timer);
if (msg.error?.message) p.reject(new Error(msg.error.message));
else p.resolve(msg.result);
}
} else if (msg.method) {
const handlers = this.eventHandlers.get(msg.method);
if (handlers) {
for (const h of handlers) h(msg.params);
}
}
} catch {}
});
this.ws.addEventListener("close", () => {
for (const [id, p] of this.pending.entries()) {
this.pending.delete(id);
if (p.timer) clearTimeout(p.timer);
p.reject(new Error("CDP connection closed."));
}
});
}
static async connect(url: string, timeoutMs: number): Promise<CdpConnection> {
const ws = new WebSocket(url);
await new Promise<void>((resolve, reject) => {
const t = setTimeout(() => reject(new Error("CDP connection timeout.")), timeoutMs);
ws.addEventListener("open", () => { clearTimeout(t); resolve(); });
ws.addEventListener("error", () => { clearTimeout(t); reject(new Error("CDP connection failed.")); });
});
return new CdpConnection(ws);
}
on(event: string, handler: (params: unknown) => void): void {
let handlers = this.eventHandlers.get(event);
if (!handlers) {
handlers = new Set();
this.eventHandlers.set(event, handlers);
}
handlers.add(handler);
}
off(event: string, handler: (params: unknown) => void): void {
this.eventHandlers.get(event)?.delete(handler);
}
async send<T = unknown>(method: string, params?: Record<string, unknown>, opts?: CdpSendOptions): Promise<T> {
const id = ++this.nextId;
const msg: Record<string, unknown> = { id, method };
if (params) msg.params = params;
if (opts?.sessionId) msg.sessionId = opts.sessionId;
const timeoutMs = opts?.timeoutMs ?? 15_000;
const out = await new Promise<unknown>((resolve, reject) => {
const t = timeoutMs > 0 ? setTimeout(() => { this.pending.delete(id); reject(new Error(`CDP timeout: ${method}`)); }, timeoutMs) : null;
this.pending.set(id, { resolve, reject, timer: t });
this.ws.send(JSON.stringify(msg));
});
return out as T;
}
close(): void {
try { this.ws.close(); } catch {}
}
}
export async function getFreePort(): Promise<number> {
return await new Promise((resolve, reject) => {
const srv = net.createServer();
srv.unref();
srv.on("error", reject);
srv.listen(0, "127.0.0.1", () => {
const addr = srv.address();
if (!addr || typeof addr === "string") {
srv.close(() => reject(new Error("Unable to allocate a free TCP port.")));
return;
}
const port = addr.port;
srv.close((err) => (err ? reject(err) : resolve(port)));
});
});
}
export function findChromeExecutable(): string | null {
const override = process.env.URL_CHROME_PATH?.trim();
if (override && fs.existsSync(override)) return override;
const candidates: string[] = [];
switch (process.platform) {
case "darwin":
candidates.push(
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
"/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary",
"/Applications/Google Chrome Beta.app/Contents/MacOS/Google Chrome Beta",
"/Applications/Chromium.app/Contents/MacOS/Chromium",
"/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge"
);
break;
case "win32":
candidates.push(
"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
"C:\\Program Files\\Microsoft\\Edge\\Application\\msedge.exe",
"C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe"
);
break;
default:
candidates.push(
"/usr/bin/google-chrome",
"/usr/bin/google-chrome-stable",
"/usr/bin/chromium",
"/usr/bin/chromium-browser",
"/snap/bin/chromium",
"/usr/bin/microsoft-edge"
);
break;
}
for (const p of candidates) {
if (fs.existsSync(p)) return p;
}
return null;
}
export async function waitForChromeDebugPort(port: number, timeoutMs: number): Promise<string> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
try {
const res = await fetchWithTimeout(`http://127.0.0.1:${port}/json/version`, { timeoutMs: 5_000 });
if (!res.ok) throw new Error(`status=${res.status}`);
const j = (await res.json()) as { webSocketDebuggerUrl?: string };
if (j.webSocketDebuggerUrl) return j.webSocketDebuggerUrl;
} catch {}
await sleep(200);
}
throw new Error("Chrome debug port not ready");
}
export async function launchChrome(url: string, port: number, headless: boolean = false): Promise<ChildProcess> {
const chrome = findChromeExecutable();
if (!chrome) throw new Error("Chrome executable not found. Install Chrome or set URL_CHROME_PATH env.");
const profileDir = resolveUrlToMarkdownChromeProfileDir();
await mkdir(profileDir, { recursive: true });
const args = [
`--remote-debugging-port=${port}`,
`--user-data-dir=${profileDir}`,
"--no-first-run",
"--no-default-browser-check",
"--disable-popup-blocking",
];
if (headless) args.push("--headless=new");
args.push(url);
return spawn(chrome, args, { stdio: "ignore" });
}
export async function waitForNetworkIdle(cdp: CdpConnection, sessionId: string, timeoutMs: number = NETWORK_IDLE_TIMEOUT_MS): Promise<void> {
return new Promise((resolve) => {
let timer: ReturnType<typeof setTimeout> | null = null;
let pending = 0;
const cleanup = () => {
if (timer) clearTimeout(timer);
cdp.off("Network.requestWillBeSent", onRequest);
cdp.off("Network.loadingFinished", onFinish);
cdp.off("Network.loadingFailed", onFinish);
};
const done = () => { cleanup(); resolve(); };
const resetTimer = () => {
if (timer) clearTimeout(timer);
timer = setTimeout(done, timeoutMs);
};
const onRequest = () => { pending++; resetTimer(); };
const onFinish = () => { pending = Math.max(0, pending - 1); if (pending <= 2) resetTimer(); };
cdp.on("Network.requestWillBeSent", onRequest);
cdp.on("Network.loadingFinished", onFinish);
cdp.on("Network.loadingFailed", onFinish);
resetTimer();
});
}
export async function waitForPageLoad(cdp: CdpConnection, sessionId: string, timeoutMs: number = 30_000): Promise<void> {
return new Promise((resolve, reject) => {
const timer = setTimeout(() => {
cdp.off("Page.loadEventFired", handler);
resolve();
}, timeoutMs);
const handler = () => {
clearTimeout(timer);
cdp.off("Page.loadEventFired", handler);
resolve();
};
cdp.on("Page.loadEventFired", handler);
});
}
export async function createTargetAndAttach(cdp: CdpConnection, url: string): Promise<{ targetId: string; sessionId: string }> {
const { targetId } = await cdp.send<{ targetId: string }>("Target.createTarget", { url });
const { sessionId } = await cdp.send<{ sessionId: string }>("Target.attachToTarget", { targetId, flatten: true });
await cdp.send("Network.enable", {}, { sessionId });
await cdp.send("Page.enable", {}, { sessionId });
return { targetId, sessionId };
}
export async function navigateAndWait(cdp: CdpConnection, sessionId: string, url: string, timeoutMs: number): Promise<void> {
const loadPromise = new Promise<void>((resolve, reject) => {
const timer = setTimeout(() => reject(new Error("Page load timeout")), timeoutMs);
const handler = (params: unknown) => {
const p = params as { name?: string };
if (p.name === "load" || p.name === "DOMContentLoaded") {
clearTimeout(timer);
cdp.off("Page.lifecycleEvent", handler);
resolve();
}
};
cdp.on("Page.lifecycleEvent", handler);
});
await cdp.send("Page.navigate", { url }, { sessionId });
await loadPromise;
}
export async function evaluateScript<T>(cdp: CdpConnection, sessionId: string, expression: string, timeoutMs: number = 30_000): Promise<T> {
const result = await cdp.send<{ result: { value?: T; type?: string; description?: string } }>(
"Runtime.evaluate",
{ expression, returnByValue: true, awaitPromise: true },
{ sessionId, timeoutMs }
);
return result.result.value as T;
}
export async function autoScroll(cdp: CdpConnection, sessionId: string, steps: number = 8, waitMs: number = 600): Promise<void> {
let lastHeight = await evaluateScript<number>(cdp, sessionId, "document.body.scrollHeight");
for (let i = 0; i < steps; i++) {
await evaluateScript<void>(cdp, sessionId, "window.scrollTo(0, document.body.scrollHeight)");
await sleep(waitMs);
const newHeight = await evaluateScript<number>(cdp, sessionId, "document.body.scrollHeight");
if (newHeight === lastHeight) break;
lastHeight = newHeight;
}
await evaluateScript<void>(cdp, sessionId, "window.scrollTo(0, 0)");
}
export function killChrome(chrome: ChildProcess): void {
try { chrome.kill("SIGTERM"); } catch {}
setTimeout(() => {
if (!chrome.killed) {
try { chrome.kill("SIGKILL"); } catch {}
}
}, 2_000).unref?.();
}

View File

@ -0,0 +1,13 @@
import { resolveUrlToMarkdownChromeProfileDir } from "./paths.js";
export const DEFAULT_USER_AGENT =
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36";
export const USER_DATA_DIR = resolveUrlToMarkdownChromeProfileDir();
export const DEFAULT_TIMEOUT_MS = 30_000;
export const CDP_CONNECT_TIMEOUT_MS = 15_000;
export const NETWORK_IDLE_TIMEOUT_MS = 1_500;
export const POST_LOAD_DELAY_MS = 800;
export const SCROLL_STEP_WAIT_MS = 600;
export const SCROLL_MAX_STEPS = 8;

View File

@ -0,0 +1,223 @@
export interface PageMetadata {
url: string;
title: string;
description?: string;
author?: string;
published?: string;
captured_at: string;
}
export interface ConversionResult {
metadata: PageMetadata;
markdown: string;
}
export const cleanupAndExtractScript = `
(function() {
const removeSelectors = [
'script', 'style', 'noscript', 'iframe', 'svg', 'canvas',
'header nav', 'footer', '.sidebar', '.nav', '.navigation',
'.advertisement', '.ad', '.ads', '.cookie-banner', '.popup',
'[role="banner"]', '[role="navigation"]', '[role="complementary"]'
];
for (const sel of removeSelectors) {
try {
document.querySelectorAll(sel).forEach(el => el.remove());
} catch {}
}
document.querySelectorAll('*').forEach(el => {
el.removeAttribute('style');
el.removeAttribute('onclick');
el.removeAttribute('onload');
el.removeAttribute('onerror');
});
const baseUrl = document.baseURI || location.href;
document.querySelectorAll('a[href]').forEach(a => {
try {
const href = a.getAttribute('href');
if (href && !href.startsWith('#') && !href.startsWith('javascript:')) {
a.setAttribute('href', new URL(href, baseUrl).href);
}
} catch {}
});
document.querySelectorAll('img[src]').forEach(img => {
try {
const src = img.getAttribute('src');
if (src) img.setAttribute('src', new URL(src, baseUrl).href);
} catch {}
});
const getMeta = (names) => {
for (const name of names) {
const el = document.querySelector(\`meta[name="\${name}"], meta[property="\${name}"]\`);
if (el) {
const content = el.getAttribute('content');
if (content) return content.trim();
}
}
return undefined;
};
const getTitle = () => {
const ogTitle = getMeta(['og:title', 'twitter:title']);
if (ogTitle) return ogTitle;
const h1 = document.querySelector('h1');
if (h1) return h1.textContent?.trim();
return document.title?.trim() || '';
};
const getPublished = () => {
const timeEl = document.querySelector('time[datetime]');
if (timeEl) return timeEl.getAttribute('datetime');
return getMeta(['article:published_time', 'datePublished', 'date']);
};
const main = document.querySelector('main, article, [role="main"], .main-content, .post-content, .article-content, .content');
const html = main ? main.innerHTML : document.body.innerHTML;
return {
title: getTitle(),
description: getMeta(['description', 'og:description', 'twitter:description']),
author: getMeta(['author', 'article:author', 'twitter:creator']),
published: getPublished(),
html: html
};
})()
`;
function decodeHtmlEntities(text: string): string {
return text
.replace(/&nbsp;/g, ' ')
.replace(/&amp;/g, '&')
.replace(/&lt;/g, '<')
.replace(/&gt;/g, '>')
.replace(/&quot;/g, '"')
.replace(/&#39;/g, "'")
.replace(/&#x27;/g, "'")
.replace(/&#(\d+);/g, (_, n) => String.fromCharCode(parseInt(n, 10)))
.replace(/&#x([0-9a-fA-F]+);/g, (_, n) => String.fromCharCode(parseInt(n, 16)));
}
function stripTags(html: string): string {
return html.replace(/<[^>]+>/g, '');
}
function normalizeWhitespace(text: string): string {
return text.replace(/[ \t]+/g, ' ').replace(/\n{3,}/g, '\n\n').trim();
}
export function htmlToMarkdown(html: string): string {
let md = html;
md = md.replace(/<br\s*\/?>/gi, '\n');
md = md.replace(/<hr\s*\/?>/gi, '\n\n---\n\n');
md = md.replace(/<h1[^>]*>([\s\S]*?)<\/h1>/gi, (_, c) => `\n\n# ${stripTags(c).trim()}\n\n`);
md = md.replace(/<h2[^>]*>([\s\S]*?)<\/h2>/gi, (_, c) => `\n\n## ${stripTags(c).trim()}\n\n`);
md = md.replace(/<h3[^>]*>([\s\S]*?)<\/h3>/gi, (_, c) => `\n\n### ${stripTags(c).trim()}\n\n`);
md = md.replace(/<h4[^>]*>([\s\S]*?)<\/h4>/gi, (_, c) => `\n\n#### ${stripTags(c).trim()}\n\n`);
md = md.replace(/<h5[^>]*>([\s\S]*?)<\/h5>/gi, (_, c) => `\n\n##### ${stripTags(c).trim()}\n\n`);
md = md.replace(/<h6[^>]*>([\s\S]*?)<\/h6>/gi, (_, c) => `\n\n###### ${stripTags(c).trim()}\n\n`);
md = md.replace(/<strong[^>]*>([\s\S]*?)<\/strong>/gi, (_, c) => `**${stripTags(c).trim()}**`);
md = md.replace(/<b[^>]*>([\s\S]*?)<\/b>/gi, (_, c) => `**${stripTags(c).trim()}**`);
md = md.replace(/<em[^>]*>([\s\S]*?)<\/em>/gi, (_, c) => `*${stripTags(c).trim()}*`);
md = md.replace(/<i[^>]*>([\s\S]*?)<\/i>/gi, (_, c) => `*${stripTags(c).trim()}*`);
md = md.replace(/<del[^>]*>([\s\S]*?)<\/del>/gi, (_, c) => `~~${stripTags(c).trim()}~~`);
md = md.replace(/<s[^>]*>([\s\S]*?)<\/s>/gi, (_, c) => `~~${stripTags(c).trim()}~~`);
md = md.replace(/<mark[^>]*>([\s\S]*?)<\/mark>/gi, (_, c) => `==${stripTags(c).trim()}==`);
md = md.replace(/<a[^>]*href=["']([^"']+)["'][^>]*>([\s\S]*?)<\/a>/gi, (_, href, text) => {
const t = stripTags(text).trim();
if (!t || href.startsWith('javascript:')) return t;
return `[${t}](${href})`;
});
md = md.replace(/<img[^>]*src=["']([^"']+)["'][^>]*alt=["']([^"']*)["'][^>]*\/?>/gi, (_, src, alt) => `![${alt}](${src})`);
md = md.replace(/<img[^>]*alt=["']([^"']*)["'][^>]*src=["']([^"']+)["'][^>]*\/?>/gi, (_, alt, src) => `![${alt}](${src})`);
md = md.replace(/<img[^>]*src=["']([^"']+)["'][^>]*\/?>/gi, (_, src) => `![](${src})`);
md = md.replace(/<pre[^>]*><code[^>]*>([\s\S]*?)<\/code><\/pre>/gi, (_, code) => `\n\n\`\`\`\n${decodeHtmlEntities(stripTags(code)).trim()}\n\`\`\`\n\n`);
md = md.replace(/<pre[^>]*>([\s\S]*?)<\/pre>/gi, (_, code) => `\n\n\`\`\`\n${decodeHtmlEntities(stripTags(code)).trim()}\n\`\`\`\n\n`);
md = md.replace(/<code[^>]*>([\s\S]*?)<\/code>/gi, (_, code) => `\`${decodeHtmlEntities(stripTags(code)).trim()}\``);
md = md.replace(/<blockquote[^>]*>([\s\S]*?)<\/blockquote>/gi, (_, content) => {
const lines = stripTags(content).trim().split('\n');
return '\n\n' + lines.map(l => `> ${l.trim()}`).join('\n') + '\n\n';
});
md = md.replace(/<ul[^>]*>([\s\S]*?)<\/ul>/gi, (_, items) => {
const lis = items.match(/<li[^>]*>([\s\S]*?)<\/li>/gi) || [];
const lines = lis.map(li => {
const content = li.replace(/<li[^>]*>([\s\S]*?)<\/li>/i, '$1');
return `- ${stripTags(content).trim()}`;
});
return '\n\n' + lines.join('\n') + '\n\n';
});
md = md.replace(/<ol[^>]*>([\s\S]*?)<\/ol>/gi, (_, items) => {
const lis = items.match(/<li[^>]*>([\s\S]*?)<\/li>/gi) || [];
const lines = lis.map((li, i) => {
const content = li.replace(/<li[^>]*>([\s\S]*?)<\/li>/i, '$1');
return `${i + 1}. ${stripTags(content).trim()}`;
});
return '\n\n' + lines.join('\n') + '\n\n';
});
md = md.replace(/<table[^>]*>([\s\S]*?)<\/table>/gi, (_, table) => {
const rows: string[][] = [];
const trMatches = table.match(/<tr[^>]*>([\s\S]*?)<\/tr>/gi) || [];
for (const tr of trMatches) {
const cells: string[] = [];
const cellMatches = tr.match(/<t[hd][^>]*>([\s\S]*?)<\/t[hd]>/gi) || [];
for (const cell of cellMatches) {
const content = cell.replace(/<t[hd][^>]*>([\s\S]*?)<\/t[hd]>/i, '$1');
cells.push(stripTags(content).trim().replace(/\|/g, '\\|'));
}
if (cells.length > 0) rows.push(cells);
}
if (rows.length === 0) return '';
const colCount = Math.max(...rows.map(r => r.length));
const normalizedRows = rows.map(r => {
while (r.length < colCount) r.push('');
return r;
});
const header = `| ${normalizedRows[0].join(' | ')} |`;
const sep = `| ${normalizedRows[0].map(() => '---').join(' | ')} |`;
const body = normalizedRows.slice(1).map(r => `| ${r.join(' | ')} |`).join('\n');
return '\n\n' + header + '\n' + sep + (body ? '\n' + body : '') + '\n\n';
});
md = md.replace(/<p[^>]*>([\s\S]*?)<\/p>/gi, (_, c) => `\n\n${stripTags(c).trim()}\n\n`);
md = md.replace(/<div[^>]*>([\s\S]*?)<\/div>/gi, (_, c) => `\n${stripTags(c).trim()}\n`);
md = md.replace(/<span[^>]*>([\s\S]*?)<\/span>/gi, (_, c) => stripTags(c));
md = stripTags(md);
md = decodeHtmlEntities(md);
md = normalizeWhitespace(md);
return md;
}
export function formatMetadataYaml(meta: PageMetadata): string {
const lines = ['---'];
lines.push(`url: ${meta.url}`);
lines.push(`title: "${meta.title.replace(/"/g, '\\"')}"`);
if (meta.description) lines.push(`description: "${meta.description.replace(/"/g, '\\"')}"`);
if (meta.author) lines.push(`author: "${meta.author.replace(/"/g, '\\"')}"`);
if (meta.published) lines.push(`published: "${meta.published}"`);
lines.push(`captured_at: "${meta.captured_at}"`);
lines.push('---');
return lines.join('\n');
}
export function createMarkdownDocument(result: ConversionResult): string {
const yaml = formatMetadataYaml(result.metadata);
const titleRegex = new RegExp(`^#\\s+${result.metadata.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s*\\n`, 'i');
const hasTitle = titleRegex.test(result.markdown);
const title = result.metadata.title && !hasTitle ? `\n\n# ${result.metadata.title}\n\n` : '\n\n';
return yaml + title + result.markdown;
}

View File

@ -0,0 +1,176 @@
import { createInterface } from "node:readline";
import { writeFile, mkdir, access } from "node:fs/promises";
import path from "node:path";
import process from "node:process";
import { CdpConnection, getFreePort, launchChrome, waitForChromeDebugPort, waitForNetworkIdle, waitForPageLoad, autoScroll, evaluateScript, killChrome } from "./cdp.js";
import { cleanupAndExtractScript, htmlToMarkdown, createMarkdownDocument, type PageMetadata, type ConversionResult } from "./html-to-markdown.js";
import { resolveUrlToMarkdownDataDir } from "./paths.js";
import { DEFAULT_TIMEOUT_MS, CDP_CONNECT_TIMEOUT_MS, NETWORK_IDLE_TIMEOUT_MS, POST_LOAD_DELAY_MS, SCROLL_STEP_WAIT_MS, SCROLL_MAX_STEPS } from "./constants.js";
function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function fileExists(filePath: string): Promise<boolean> {
try {
await access(filePath);
return true;
} catch {
return false;
}
}
interface Args {
url: string;
output?: string;
wait: boolean;
timeout: number;
}
function parseArgs(argv: string[]): Args {
const args: Args = { url: "", wait: false, timeout: DEFAULT_TIMEOUT_MS };
for (let i = 2; i < argv.length; i++) {
const arg = argv[i];
if (arg === "--wait" || arg === "-w") {
args.wait = true;
} else if (arg === "-o" || arg === "--output") {
args.output = argv[++i];
} else if (arg === "--timeout" || arg === "-t") {
args.timeout = parseInt(argv[++i], 10) || DEFAULT_TIMEOUT_MS;
} else if (!arg.startsWith("-") && !args.url) {
args.url = arg;
}
}
return args;
}
function generateSlug(title: string, url: string): string {
const text = title || new URL(url).pathname.replace(/\//g, "-");
return text
.toLowerCase()
.replace(/[^\w\s-]/g, "")
.replace(/\s+/g, "-")
.replace(/-+/g, "-")
.replace(/^-|-$/g, "")
.slice(0, 50) || "page";
}
function formatTimestamp(): string {
const now = new Date();
const pad = (n: number) => n.toString().padStart(2, "0");
return `${now.getFullYear()}${pad(now.getMonth() + 1)}${pad(now.getDate())}-${pad(now.getHours())}${pad(now.getMinutes())}${pad(now.getSeconds())}`;
}
async function generateOutputPath(url: string, title: string): Promise<string> {
const domain = new URL(url).hostname.replace(/^www\./, "");
const slug = generateSlug(title, url);
const dataDir = resolveUrlToMarkdownDataDir();
const basePath = path.join(dataDir, domain, `${slug}.md`);
if (!(await fileExists(basePath))) {
return basePath;
}
const timestampSlug = `${slug}-${formatTimestamp()}`;
return path.join(dataDir, domain, `${timestampSlug}.md`);
}
async function waitForUserSignal(): Promise<void> {
console.log("Page opened. Press Enter when ready to capture...");
const rl = createInterface({ input: process.stdin, output: process.stdout });
await new Promise<void>((resolve) => {
rl.once("line", () => { rl.close(); resolve(); });
});
}
async function captureUrl(args: Args): Promise<ConversionResult> {
const port = await getFreePort();
const chrome = await launchChrome(args.url, port, false);
let cdp: CdpConnection | null = null;
try {
const wsUrl = await waitForChromeDebugPort(port, 30_000);
cdp = await CdpConnection.connect(wsUrl, CDP_CONNECT_TIMEOUT_MS);
const targets = await cdp.send<{ targetInfos: Array<{ targetId: string; type: string; url: string }> }>("Target.getTargets");
const pageTarget = targets.targetInfos.find(t => t.type === "page" && t.url.startsWith("http"));
if (!pageTarget) throw new Error("No page target found");
const { sessionId } = await cdp.send<{ sessionId: string }>("Target.attachToTarget", { targetId: pageTarget.targetId, flatten: true });
await cdp.send("Network.enable", {}, { sessionId });
await cdp.send("Page.enable", {}, { sessionId });
if (args.wait) {
await waitForUserSignal();
} else {
console.log("Waiting for page to load...");
await Promise.race([
waitForPageLoad(cdp, sessionId, 15_000),
sleep(8_000)
]);
await waitForNetworkIdle(cdp, sessionId, NETWORK_IDLE_TIMEOUT_MS);
await sleep(POST_LOAD_DELAY_MS);
console.log("Scrolling to trigger lazy load...");
await autoScroll(cdp, sessionId, SCROLL_MAX_STEPS, SCROLL_STEP_WAIT_MS);
await sleep(POST_LOAD_DELAY_MS);
}
console.log("Capturing page content...");
const extracted = await evaluateScript<{ title: string; description?: string; author?: string; published?: string; html: string }>(
cdp, sessionId, cleanupAndExtractScript, args.timeout
);
const metadata: PageMetadata = {
url: args.url,
title: extracted.title || "",
description: extracted.description,
author: extracted.author,
published: extracted.published,
captured_at: new Date().toISOString()
};
const markdown = htmlToMarkdown(extracted.html);
return { metadata, markdown };
} finally {
if (cdp) {
try { await cdp.send("Browser.close", {}, { timeoutMs: 5_000 }); } catch {}
cdp.close();
}
killChrome(chrome);
}
}
async function main(): Promise<void> {
const args = parseArgs(process.argv);
if (!args.url) {
console.error("Usage: bun main.ts <url> [-o output.md] [--wait] [--timeout ms]");
process.exit(1);
}
try {
new URL(args.url);
} catch {
console.error(`Invalid URL: ${args.url}`);
process.exit(1);
}
console.log(`Fetching: ${args.url}`);
console.log(`Mode: ${args.wait ? "wait" : "auto"}`);
const result = await captureUrl(args);
const outputPath = args.output || await generateOutputPath(args.url, result.metadata.title);
const outputDir = path.dirname(outputPath);
await mkdir(outputDir, { recursive: true });
const document = createMarkdownDocument(result);
await writeFile(outputPath, document, "utf-8");
console.log(`Saved: ${outputPath}`);
console.log(`Title: ${result.metadata.title || "(no title)"}`);
}
main().catch((err) => {
console.error("Error:", err instanceof Error ? err.message : String(err));
process.exit(1);
});

View File

@ -0,0 +1,29 @@
import os from "node:os";
import path from "node:path";
import process from "node:process";
const APP_DATA_DIR = "baoyu-skills";
const URL_TO_MARKDOWN_DATA_DIR = "url-to-markdown";
const PROFILE_DIR_NAME = "chrome-profile";
export function resolveUserDataRoot(): string {
if (process.platform === "win32") {
return process.env.APPDATA ?? path.join(os.homedir(), "AppData", "Roaming");
}
if (process.platform === "darwin") {
return path.join(os.homedir(), "Library", "Application Support");
}
return process.env.XDG_DATA_HOME ?? path.join(os.homedir(), ".local", "share");
}
export function resolveUrlToMarkdownDataDir(): string {
const override = process.env.URL_DATA_DIR?.trim();
if (override) return path.resolve(override);
return path.join(process.cwd(), URL_TO_MARKDOWN_DATA_DIR);
}
export function resolveUrlToMarkdownChromeProfileDir(): string {
const override = process.env.URL_CHROME_PROFILE_DIR?.trim();
if (override) return path.resolve(override);
return path.join(resolveUserDataRoot(), APP_DATA_DIR, URL_TO_MARKDOWN_DATA_DIR, PROFILE_DIR_NAME);
}

View File

@ -20,7 +20,7 @@ Break down complex content into eye-catching infographic series for Xiaohongshu
/baoyu-xhs-images posts/ai-future/article.md --layout dense
# Combine style and layout
/baoyu-xhs-images posts/ai-future/article.md --style tech --layout list
/baoyu-xhs-images posts/ai-future/article.md --style notion --layout list
# Direct content input
/baoyu-xhs-images
@ -98,7 +98,7 @@ Each session creates an independent directory named by content slug:
xhs-images/{topic-slug}/
├── source-{slug}.{ext} # Source files (text, images, etc.)
├── analysis.md # Deep analysis results
├── outline-style-[slug].md # Variant A (e.g., outline-style-tech.md)
├── outline-style-[slug].md # Variant A (e.g., outline-style-notion.md)
├── outline-style-[slug].md # Variant B (e.g., outline-style-notion.md)
├── outline-style-[slug].md # Variant C (e.g., outline-style-minimal.md)
├── outline.md # Final selected
@ -162,7 +162,7 @@ Based on analysis, create three distinct style variants.
| Variant | Selection Logic | Example Filename |
|---------|-----------------|------------------|
| A | Primary recommendation | `outline-style-tech.md` |
| A | Primary recommendation | `outline-style-notion.md` |
| B | Alternative style | `outline-style-notion.md` |
| C | Different audience/mood | `outline-style-minimal.md` |
@ -188,7 +188,7 @@ Based on analysis, create three distinct style variants.
```
Question 1 (Style): Which style variant?
- A: tech + dense (Recommended) - 专业科技感,适合干货
- A: notion + dense (Recommended) - 知识卡片风格,适合干货
- B: notion + list - 清爽知识卡片
- C: minimal + balanced - 简约高端风格
- Custom: 自定义风格描述
@ -239,10 +239,10 @@ Location: [directory path]
Images: N total
✓ analysis.md
✓ outline-style-tech.md
✓ outline-style-notion.md
✓ outline-style-chalkboard.md
✓ outline-style-minimal.md
✓ outline.md (selected: tech + dense)
✓ outline.md (selected: notion + dense)
Files:
- 01-cover-[slug].png ✓ Cover (sparse)

View File

@ -25,9 +25,9 @@ Unlike other platforms, Xiaohongshu content must prioritize:
| Type | Characteristics | Best Style | Best Layout |
|------|----------------|------------|-------------|
| 种草/安利 | Product recommendation, benefits focus | cute/fresh | balanced/list |
| 干货分享 | Knowledge, tips, how-to | notion/tech | dense/list |
| 干货分享 | Knowledge, tips, how-to | notion | dense/list |
| 个人故事 | Personal experience, emotional | warm | balanced |
| 测评对比 | Review, comparison, pros/cons | tech/bold | comparison |
| 测评对比 | Review, comparison, pros/cons | bold/notion | comparison |
| 教程步骤 | Step-by-step guide | fresh/notion | flow/list |
| 避坑指南 | Warnings, mistakes to avoid | bold | list/comparison |
| 清单合集 | Collections, recommendations | cute/minimal | list/dense |
@ -55,10 +55,10 @@ Evaluate title/hook potential using these patterns:
| Audience | Interests | Preferred Style | Content Focus |
|----------|-----------|-----------------|---------------|
| 学生党 | 省钱、学习、校园 | cute/fresh | 平价、教程、学习方法 |
| 打工人 | 效率、职场、减压 | minimal/tech | 工具、技巧、摸鱼 |
| 打工人 | 效率、职场、减压 | minimal/notion | 工具、技巧、摸鱼 |
| 宝妈 | 育儿、家居、省心 | warm/fresh | 实用、安全、经验 |
| 精致女孩 | 美妆、穿搭、仪式感 | cute/retro | 好看、氛围、品质 |
| 技术宅 | 工具、效率、极客 | tech/notion | 深度、专业、新奇 |
| 技术宅 | 工具、效率、极客 | notion/chalkboard | 深度、专业、新奇 |
| 美食爱好者 | 探店、食谱、测评 | warm/pop | 好吃、简单、颜值 |
| 旅行达人 | 攻略、打卡、小众 | fresh/retro | 省钱、避坑、拍照 |
@ -165,7 +165,7 @@ recommended_image_count: 6
## Content Signals
- "AI工具" → tech + dense
- "AI工具" → notion + dense
- "效率" → notion + list
- "干货" → minimal + dense
@ -180,7 +180,7 @@ recommended_image_count: 6
## Recommended Approaches
1. **Tech + Dense** - 专业科技感,适合干货分享 (recommended)
1. **Notion + Dense** - 知识卡片风格,适合干货分享 (recommended)
2. **Notion + List** - 清爽知识卡片风格
3. **Minimal + Balanced** - 简约高端,适合职场人群
```

View File

@ -28,4 +28,4 @@ Comparisons, transformations, decision helpers, 对比图
## Best Style Pairings
bold (dramatic contrast), tech (data comparison), warm (before/after stories)
bold (dramatic contrast), notion (data comparison), warm (before/after stories)

View File

@ -28,4 +28,4 @@ Summary cards, cheat sheets, comprehensive guides, 干货总结
## Best Style Pairings
tech, notion, minimal (clean styles prevent visual overload)
notion, minimal, chalkboard (clean styles prevent visual overload)

View File

@ -27,4 +27,4 @@ Processes, timelines, cause-effect chains, workflows
## Best Style Pairings
tech (process diagrams), notion (simple flows), fresh (organic flows)
notion (process diagrams), chalkboard (educational flows), fresh (organic flows)

View File

@ -5,8 +5,8 @@ Template for generating infographic series outlines.
## File Naming
Outline files use style slug in the name:
- `outline-style-tech.md` - Tech style variant
- `outline-style-notion.md` - Notion style variant
- `outline-style-chalkboard.md` - Chalkboard style variant
- `outline-style-minimal.md` - Minimal style variant
- `outline.md` - Final selected (copied from chosen variant)
@ -43,7 +43,7 @@ NN-{type}-[slug].md (in prompts/)
# Xiaohongshu Infographic Series Outline
---
style: tech
style: notion
default_layout: dense
image_count: 6
generated: YYYY-MM-DD HH:mm
@ -223,6 +223,6 @@ Three variants should differ meaningfully:
| Audience | Primary target | Secondary target | Broader appeal |
**Example for "AI工具推荐"**:
- `outline-style-tech.md`: Tech + Dense - 专业极客
- `outline-style-notion.md`: Notion + Dense - 知识卡片
- `outline-style-notion.md`: Notion + List - 清爽知识卡片
- `outline-style-cute.md`: Cute + Balanced - 可爱易读风