chore: release v1.13.0
This commit is contained in:
parent
235868343c
commit
97da7ab4eb
|
|
@ -6,7 +6,7 @@
|
|||
},
|
||||
"metadata": {
|
||||
"description": "Skills shared by Baoyu for improving daily work efficiency",
|
||||
"version": "1.12.0"
|
||||
"version": "1.13.0"
|
||||
},
|
||||
"plugins": [
|
||||
{
|
||||
|
|
@ -42,7 +42,8 @@
|
|||
"strict": false,
|
||||
"skills": [
|
||||
"./skills/baoyu-danger-x-to-markdown",
|
||||
"./skills/baoyu-compress-image"
|
||||
"./skills/baoyu-compress-image",
|
||||
"./skills/baoyu-url-to-markdown"
|
||||
]
|
||||
}
|
||||
]
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@ yarn-debug.log*
|
|||
yarn-error.log*
|
||||
lerna-debug.log*
|
||||
|
||||
|
||||
# Diagnostic reports (https://nodejs.org/api/report.html)
|
||||
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
|
||||
|
||||
|
|
@ -145,3 +146,4 @@ tests-data/
|
|||
.baoyu-skills/
|
||||
x-to-markdown/
|
||||
xhs-images/
|
||||
url-to-markdown/
|
||||
|
|
|
|||
|
|
@ -2,6 +2,14 @@
|
|||
|
||||
English | [中文](./CHANGELOG.zh.md)
|
||||
|
||||
## 1.13.0 - 2026-01-21
|
||||
|
||||
### Features
|
||||
- `baoyu-url-to-markdown`: new utility skill for fetching any URL via Chrome CDP and converting to clean markdown. Supports two capture modes—auto (immediate capture on page load) and wait (user-controlled capture for login-required pages).
|
||||
|
||||
### Improvements
|
||||
- `baoyu-xhs-images`: updates style recommendations—replaces `tech` references with `notion` and `chalkboard` for technical and educational content.
|
||||
|
||||
## 1.12.0 - 2026-01-21
|
||||
|
||||
### Refactor
|
||||
|
|
|
|||
|
|
@ -2,6 +2,14 @@
|
|||
|
||||
[English](./CHANGELOG.md) | 中文
|
||||
|
||||
## 1.13.0 - 2026-01-21
|
||||
|
||||
### 新功能
|
||||
- `baoyu-url-to-markdown`:新增 URL 转 Markdown 工具技能,通过 Chrome CDP 抓取任意网页并转换为干净的 Markdown 格式。支持两种抓取模式——自动模式(页面加载后立即抓取)和等待模式(用户控制抓取时机,适用于需要登录的页面)。
|
||||
|
||||
### 改进
|
||||
- `baoyu-xhs-images`:更新风格推荐——将 `tech` 风格引用替换为 `notion` 和 `chalkboard`,用于技术和教育内容。
|
||||
|
||||
## 1.12.0 - 2026-01-21
|
||||
|
||||
### 重构
|
||||
|
|
|
|||
31
README.md
31
README.md
|
|
@ -55,7 +55,7 @@ Simply tell Claude Code:
|
|||
|--------|-------------|--------|
|
||||
| **content-skills** | Content generation and publishing | [xhs-images](#baoyu-xhs-images), [infographic](#baoyu-infographic), [cover-image](#baoyu-cover-image), [slide-deck](#baoyu-slide-deck), [comic](#baoyu-comic), [article-illustrator](#baoyu-article-illustrator), [post-to-x](#baoyu-post-to-x), [post-to-wechat](#baoyu-post-to-wechat) |
|
||||
| **ai-generation-skills** | AI-powered generation backends | [image-gen](#baoyu-image-gen), [danger-gemini-web](#baoyu-danger-gemini-web) |
|
||||
| **utility-skills** | Utility tools for content processing | [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
|
||||
| **utility-skills** | Utility tools for content processing | [url-to-markdown](#baoyu-url-to-markdown), [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
|
||||
|
||||
## Update Skills
|
||||
|
||||
|
|
@ -586,6 +586,35 @@ Interacts with Gemini Web to generate text and images.
|
|||
|
||||
Utility tools for content processing.
|
||||
|
||||
#### baoyu-url-to-markdown
|
||||
|
||||
Fetch any URL via Chrome CDP and convert to clean markdown. Supports two capture modes for different scenarios.
|
||||
|
||||
```bash
|
||||
# Auto mode (default) - capture when page loads
|
||||
/baoyu-url-to-markdown https://example.com/article
|
||||
|
||||
# Wait mode - for login-required pages
|
||||
/baoyu-url-to-markdown https://example.com/private --wait
|
||||
|
||||
# Save to specific file
|
||||
/baoyu-url-to-markdown https://example.com/article -o output.md
|
||||
```
|
||||
|
||||
**Capture Modes**:
|
||||
| Mode | Description | Best For |
|
||||
|------|-------------|----------|
|
||||
| Auto (default) | Captures immediately after page load | Public pages, static content |
|
||||
| Wait (`--wait`) | Waits for user signal before capture | Login-required, dynamic content |
|
||||
|
||||
**Options**:
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `<url>` | URL to fetch |
|
||||
| `-o <path>` | Output file path |
|
||||
| `--wait` | Wait for user signal before capturing |
|
||||
| `--timeout <ms>` | Page load timeout (default: 30000) |
|
||||
|
||||
#### baoyu-danger-x-to-markdown
|
||||
|
||||
Converts X (Twitter) content to markdown format. Supports tweet threads and X Articles.
|
||||
|
|
|
|||
31
README.zh.md
31
README.zh.md
|
|
@ -55,7 +55,7 @@ npx skills add jimliu/baoyu-skills
|
|||
|------|------|----------|
|
||||
| **content-skills** | 内容生成和发布 | [xhs-images](#baoyu-xhs-images), [infographic](#baoyu-infographic), [cover-image](#baoyu-cover-image), [slide-deck](#baoyu-slide-deck), [comic](#baoyu-comic), [article-illustrator](#baoyu-article-illustrator), [post-to-x](#baoyu-post-to-x), [post-to-wechat](#baoyu-post-to-wechat) |
|
||||
| **ai-generation-skills** | AI 生成后端 | [image-gen](#baoyu-image-gen), [danger-gemini-web](#baoyu-danger-gemini-web) |
|
||||
| **utility-skills** | 内容处理工具 | [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
|
||||
| **utility-skills** | 内容处理工具 | [url-to-markdown](#baoyu-url-to-markdown), [danger-x-to-markdown](#baoyu-danger-x-to-markdown), [compress-image](#baoyu-compress-image) |
|
||||
|
||||
## 更新技能
|
||||
|
||||
|
|
@ -586,6 +586,35 @@ AI 驱动的生成后端。
|
|||
|
||||
内容处理工具。
|
||||
|
||||
#### baoyu-url-to-markdown
|
||||
|
||||
通过 Chrome CDP 抓取任意 URL 并转换为干净的 Markdown。支持两种抓取模式,适应不同场景。
|
||||
|
||||
```bash
|
||||
# 自动模式(默认)- 页面加载后立即抓取
|
||||
/baoyu-url-to-markdown https://example.com/article
|
||||
|
||||
# 等待模式 - 适用于需要登录的页面
|
||||
/baoyu-url-to-markdown https://example.com/private --wait
|
||||
|
||||
# 保存到指定文件
|
||||
/baoyu-url-to-markdown https://example.com/article -o output.md
|
||||
```
|
||||
|
||||
**抓取模式**:
|
||||
| 模式 | 说明 | 适用场景 |
|
||||
|------|------|----------|
|
||||
| 自动(默认) | 页面加载后立即抓取 | 公开页面、静态内容 |
|
||||
| 等待(`--wait`) | 等待用户信号后抓取 | 需登录页面、动态内容 |
|
||||
|
||||
**选项**:
|
||||
| 选项 | 说明 |
|
||||
|------|------|
|
||||
| `<url>` | 要抓取的 URL |
|
||||
| `-o <path>` | 输出文件路径 |
|
||||
| `--wait` | 等待用户信号后抓取 |
|
||||
| `--timeout <ms>` | 页面加载超时(默认:30000) |
|
||||
|
||||
#### baoyu-danger-x-to-markdown
|
||||
|
||||
将 X (Twitter) 内容转换为 markdown 格式。支持推文串和 X 文章。
|
||||
|
|
|
|||
|
|
@ -0,0 +1,169 @@
|
|||
---
|
||||
name: baoyu-url-to-markdown
|
||||
description: Fetch any URL and convert to markdown using Chrome CDP. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
|
||||
---
|
||||
|
||||
# URL to Markdown
|
||||
|
||||
Fetches any URL via Chrome CDP and converts HTML to clean markdown.
|
||||
|
||||
## Script Directory
|
||||
|
||||
**Important**: All scripts are located in the `scripts/` subdirectory of this skill.
|
||||
|
||||
**Agent Execution Instructions**:
|
||||
1. Determine this SKILL.md file's directory path as `SKILL_DIR`
|
||||
2. Script path = `${SKILL_DIR}/scripts/<script-name>.ts`
|
||||
3. Replace all `${SKILL_DIR}` in this document with the actual path
|
||||
|
||||
**Script Reference**:
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `scripts/main.ts` | CLI entry point for URL fetching |
|
||||
|
||||
## Features
|
||||
|
||||
- Chrome CDP for full JavaScript rendering
|
||||
- Two capture modes: auto or wait-for-user
|
||||
- Clean markdown output with metadata
|
||||
- Handles login-required pages via wait mode
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Auto mode (default) - capture when page loads
|
||||
npx -y bun ${SKILL_DIR}/scripts/main.ts <url>
|
||||
|
||||
# Wait mode - wait for user signal before capture
|
||||
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> --wait
|
||||
|
||||
# Save to specific file
|
||||
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> -o output.md
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `<url>` | URL to fetch |
|
||||
| `-o <path>` | Output file path (default: auto-generated) |
|
||||
| `--wait` | Wait for user signal before capturing |
|
||||
| `--timeout <ms>` | Page load timeout (default: 30000) |
|
||||
|
||||
## Capture Modes
|
||||
|
||||
### Auto Mode (default)
|
||||
|
||||
Page loads → waits for network idle → captures immediately.
|
||||
|
||||
Best for:
|
||||
- Public pages
|
||||
- Static content
|
||||
- No login required
|
||||
|
||||
### Wait Mode (`--wait`)
|
||||
|
||||
Page opens → user can interact (login, scroll, etc.) → user signals ready → captures.
|
||||
|
||||
Best for:
|
||||
- Login-required pages
|
||||
- Dynamic content needing interaction
|
||||
- Pages with lazy loading
|
||||
|
||||
**Agent workflow for wait mode**:
|
||||
1. Run script with `--wait` flag
|
||||
2. Script outputs: `Page opened. Press Enter when ready to capture...`
|
||||
3. Use `AskUserQuestion` to ask user if page is ready
|
||||
4. When user confirms, send newline to stdin to trigger capture
|
||||
|
||||
## Output Format
|
||||
|
||||
```markdown
|
||||
---
|
||||
url: https://example.com/page
|
||||
title: "Page Title"
|
||||
description: "Meta description if available"
|
||||
author: "Author if available"
|
||||
published: "2024-01-01"
|
||||
captured_at: "2024-01-15T10:30:00Z"
|
||||
---
|
||||
|
||||
# Page Title
|
||||
|
||||
Converted markdown content...
|
||||
```
|
||||
|
||||
## Mode Selection Guide
|
||||
|
||||
When user requests URL capture, help select appropriate mode:
|
||||
|
||||
**Suggest Auto Mode when**:
|
||||
- URL is public (no login wall visible)
|
||||
- Content appears static
|
||||
- User doesn't mention login requirements
|
||||
|
||||
**Suggest Wait Mode when**:
|
||||
- User mentions needing to log in
|
||||
- Site known to require authentication
|
||||
- User wants to scroll/interact before capture
|
||||
- Content is behind paywall
|
||||
|
||||
**Ask user when unclear**:
|
||||
```
|
||||
The page may require login or interaction before capturing.
|
||||
|
||||
Which mode should I use?
|
||||
1. Auto - Capture immediately when loaded
|
||||
2. Wait - Wait for you to interact first
|
||||
```
|
||||
|
||||
## Output Directory
|
||||
|
||||
Each capture creates a file organized by domain:
|
||||
|
||||
```
|
||||
url-to-markdown/
|
||||
└── <domain>/
|
||||
└── <slug>.md
|
||||
```
|
||||
|
||||
**Path Components**:
|
||||
- `<domain>`: Site domain (e.g., `example.com`, `github.com`)
|
||||
- `<slug>`: Generated from page title or URL path (kebab-case)
|
||||
|
||||
**Slug Generation**:
|
||||
1. Extract from page title (preferred) or URL path
|
||||
2. Convert to kebab-case, 2-6 words
|
||||
3. Example: "Getting Started with React" → `getting-started-with-react`
|
||||
|
||||
**Conflict Resolution**:
|
||||
If `url-to-markdown/<domain>/<slug>.md` already exists:
|
||||
- Append timestamp: `<slug>-YYYYMMDD-HHMMSS.md`
|
||||
- Example: `getting-started.md` exists → `getting-started-20260118-143052.md`
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Error | Resolution |
|
||||
|-------|------------|
|
||||
| Chrome not found | Install Chrome or set `URL_CHROME_PATH` env |
|
||||
| Page timeout | Increase `--timeout` value |
|
||||
| Capture failed | Try wait mode for complex pages |
|
||||
| Empty content | Page may need JS rendering time |
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Description |
|
||||
|----------|-------------|
|
||||
| `URL_CHROME_PATH` | Custom Chrome executable path |
|
||||
| `URL_DATA_DIR` | Custom data directory |
|
||||
| `URL_CHROME_PROFILE_DIR` | Custom Chrome profile directory |
|
||||
|
||||
## Extension Support
|
||||
|
||||
Custom configurations via EXTEND.md.
|
||||
|
||||
**Check paths** (priority order):
|
||||
1. `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` (project)
|
||||
2. `~/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` (user)
|
||||
|
||||
If found, load before workflow. Extension content overrides defaults.
|
||||
|
|
@ -0,0 +1,290 @@
|
|||
import { spawn, type ChildProcess } from "node:child_process";
|
||||
import fs from "node:fs";
|
||||
import { mkdir } from "node:fs/promises";
|
||||
import net from "node:net";
|
||||
import process from "node:process";
|
||||
|
||||
import { resolveUrlToMarkdownChromeProfileDir } from "./paths.js";
|
||||
import { CDP_CONNECT_TIMEOUT_MS, NETWORK_IDLE_TIMEOUT_MS } from "./constants.js";
|
||||
|
||||
type CdpSendOptions = { sessionId?: string; timeoutMs?: number };
|
||||
|
||||
function sleep(ms: number): Promise<void> {
|
||||
return new Promise((resolve) => setTimeout(resolve, ms));
|
||||
}
|
||||
|
||||
async function fetchWithTimeout(url: string, init: RequestInit & { timeoutMs?: number } = {}): Promise<Response> {
|
||||
const { timeoutMs, ...rest } = init;
|
||||
if (!timeoutMs || timeoutMs <= 0) return fetch(url, rest);
|
||||
const ctl = new AbortController();
|
||||
const t = setTimeout(() => ctl.abort(), timeoutMs);
|
||||
try {
|
||||
return await fetch(url, { ...rest, signal: ctl.signal });
|
||||
} finally {
|
||||
clearTimeout(t);
|
||||
}
|
||||
}
|
||||
|
||||
export class CdpConnection {
|
||||
private ws: WebSocket;
|
||||
private nextId = 0;
|
||||
private pending = new Map<number, { resolve: (v: unknown) => void; reject: (e: Error) => void; timer: ReturnType<typeof setTimeout> | null }>();
|
||||
private eventHandlers = new Map<string, Set<(params: unknown) => void>>();
|
||||
|
||||
private constructor(ws: WebSocket) {
|
||||
this.ws = ws;
|
||||
this.ws.addEventListener("message", (event) => {
|
||||
try {
|
||||
const data = typeof event.data === "string" ? event.data : new TextDecoder().decode(event.data as ArrayBuffer);
|
||||
const msg = JSON.parse(data) as { id?: number; method?: string; params?: unknown; result?: unknown; error?: { message?: string } };
|
||||
if (msg.id) {
|
||||
const p = this.pending.get(msg.id);
|
||||
if (p) {
|
||||
this.pending.delete(msg.id);
|
||||
if (p.timer) clearTimeout(p.timer);
|
||||
if (msg.error?.message) p.reject(new Error(msg.error.message));
|
||||
else p.resolve(msg.result);
|
||||
}
|
||||
} else if (msg.method) {
|
||||
const handlers = this.eventHandlers.get(msg.method);
|
||||
if (handlers) {
|
||||
for (const h of handlers) h(msg.params);
|
||||
}
|
||||
}
|
||||
} catch {}
|
||||
});
|
||||
this.ws.addEventListener("close", () => {
|
||||
for (const [id, p] of this.pending.entries()) {
|
||||
this.pending.delete(id);
|
||||
if (p.timer) clearTimeout(p.timer);
|
||||
p.reject(new Error("CDP connection closed."));
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
static async connect(url: string, timeoutMs: number): Promise<CdpConnection> {
|
||||
const ws = new WebSocket(url);
|
||||
await new Promise<void>((resolve, reject) => {
|
||||
const t = setTimeout(() => reject(new Error("CDP connection timeout.")), timeoutMs);
|
||||
ws.addEventListener("open", () => { clearTimeout(t); resolve(); });
|
||||
ws.addEventListener("error", () => { clearTimeout(t); reject(new Error("CDP connection failed.")); });
|
||||
});
|
||||
return new CdpConnection(ws);
|
||||
}
|
||||
|
||||
on(event: string, handler: (params: unknown) => void): void {
|
||||
let handlers = this.eventHandlers.get(event);
|
||||
if (!handlers) {
|
||||
handlers = new Set();
|
||||
this.eventHandlers.set(event, handlers);
|
||||
}
|
||||
handlers.add(handler);
|
||||
}
|
||||
|
||||
off(event: string, handler: (params: unknown) => void): void {
|
||||
this.eventHandlers.get(event)?.delete(handler);
|
||||
}
|
||||
|
||||
async send<T = unknown>(method: string, params?: Record<string, unknown>, opts?: CdpSendOptions): Promise<T> {
|
||||
const id = ++this.nextId;
|
||||
const msg: Record<string, unknown> = { id, method };
|
||||
if (params) msg.params = params;
|
||||
if (opts?.sessionId) msg.sessionId = opts.sessionId;
|
||||
const timeoutMs = opts?.timeoutMs ?? 15_000;
|
||||
const out = await new Promise<unknown>((resolve, reject) => {
|
||||
const t = timeoutMs > 0 ? setTimeout(() => { this.pending.delete(id); reject(new Error(`CDP timeout: ${method}`)); }, timeoutMs) : null;
|
||||
this.pending.set(id, { resolve, reject, timer: t });
|
||||
this.ws.send(JSON.stringify(msg));
|
||||
});
|
||||
return out as T;
|
||||
}
|
||||
|
||||
close(): void {
|
||||
try { this.ws.close(); } catch {}
|
||||
}
|
||||
}
|
||||
|
||||
export async function getFreePort(): Promise<number> {
|
||||
return await new Promise((resolve, reject) => {
|
||||
const srv = net.createServer();
|
||||
srv.unref();
|
||||
srv.on("error", reject);
|
||||
srv.listen(0, "127.0.0.1", () => {
|
||||
const addr = srv.address();
|
||||
if (!addr || typeof addr === "string") {
|
||||
srv.close(() => reject(new Error("Unable to allocate a free TCP port.")));
|
||||
return;
|
||||
}
|
||||
const port = addr.port;
|
||||
srv.close((err) => (err ? reject(err) : resolve(port)));
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
export function findChromeExecutable(): string | null {
|
||||
const override = process.env.URL_CHROME_PATH?.trim();
|
||||
if (override && fs.existsSync(override)) return override;
|
||||
|
||||
const candidates: string[] = [];
|
||||
switch (process.platform) {
|
||||
case "darwin":
|
||||
candidates.push(
|
||||
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
|
||||
"/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary",
|
||||
"/Applications/Google Chrome Beta.app/Contents/MacOS/Google Chrome Beta",
|
||||
"/Applications/Chromium.app/Contents/MacOS/Chromium",
|
||||
"/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge"
|
||||
);
|
||||
break;
|
||||
case "win32":
|
||||
candidates.push(
|
||||
"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
|
||||
"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
|
||||
"C:\\Program Files\\Microsoft\\Edge\\Application\\msedge.exe",
|
||||
"C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe"
|
||||
);
|
||||
break;
|
||||
default:
|
||||
candidates.push(
|
||||
"/usr/bin/google-chrome",
|
||||
"/usr/bin/google-chrome-stable",
|
||||
"/usr/bin/chromium",
|
||||
"/usr/bin/chromium-browser",
|
||||
"/snap/bin/chromium",
|
||||
"/usr/bin/microsoft-edge"
|
||||
);
|
||||
break;
|
||||
}
|
||||
|
||||
for (const p of candidates) {
|
||||
if (fs.existsSync(p)) return p;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
export async function waitForChromeDebugPort(port: number, timeoutMs: number): Promise<string> {
|
||||
const start = Date.now();
|
||||
while (Date.now() - start < timeoutMs) {
|
||||
try {
|
||||
const res = await fetchWithTimeout(`http://127.0.0.1:${port}/json/version`, { timeoutMs: 5_000 });
|
||||
if (!res.ok) throw new Error(`status=${res.status}`);
|
||||
const j = (await res.json()) as { webSocketDebuggerUrl?: string };
|
||||
if (j.webSocketDebuggerUrl) return j.webSocketDebuggerUrl;
|
||||
} catch {}
|
||||
await sleep(200);
|
||||
}
|
||||
throw new Error("Chrome debug port not ready");
|
||||
}
|
||||
|
||||
export async function launchChrome(url: string, port: number, headless: boolean = false): Promise<ChildProcess> {
|
||||
const chrome = findChromeExecutable();
|
||||
if (!chrome) throw new Error("Chrome executable not found. Install Chrome or set URL_CHROME_PATH env.");
|
||||
const profileDir = resolveUrlToMarkdownChromeProfileDir();
|
||||
await mkdir(profileDir, { recursive: true });
|
||||
|
||||
const args = [
|
||||
`--remote-debugging-port=${port}`,
|
||||
`--user-data-dir=${profileDir}`,
|
||||
"--no-first-run",
|
||||
"--no-default-browser-check",
|
||||
"--disable-popup-blocking",
|
||||
];
|
||||
if (headless) args.push("--headless=new");
|
||||
args.push(url);
|
||||
|
||||
return spawn(chrome, args, { stdio: "ignore" });
|
||||
}
|
||||
|
||||
export async function waitForNetworkIdle(cdp: CdpConnection, sessionId: string, timeoutMs: number = NETWORK_IDLE_TIMEOUT_MS): Promise<void> {
|
||||
return new Promise((resolve) => {
|
||||
let timer: ReturnType<typeof setTimeout> | null = null;
|
||||
let pending = 0;
|
||||
const cleanup = () => {
|
||||
if (timer) clearTimeout(timer);
|
||||
cdp.off("Network.requestWillBeSent", onRequest);
|
||||
cdp.off("Network.loadingFinished", onFinish);
|
||||
cdp.off("Network.loadingFailed", onFinish);
|
||||
};
|
||||
const done = () => { cleanup(); resolve(); };
|
||||
const resetTimer = () => {
|
||||
if (timer) clearTimeout(timer);
|
||||
timer = setTimeout(done, timeoutMs);
|
||||
};
|
||||
const onRequest = () => { pending++; resetTimer(); };
|
||||
const onFinish = () => { pending = Math.max(0, pending - 1); if (pending <= 2) resetTimer(); };
|
||||
cdp.on("Network.requestWillBeSent", onRequest);
|
||||
cdp.on("Network.loadingFinished", onFinish);
|
||||
cdp.on("Network.loadingFailed", onFinish);
|
||||
resetTimer();
|
||||
});
|
||||
}
|
||||
|
||||
export async function waitForPageLoad(cdp: CdpConnection, sessionId: string, timeoutMs: number = 30_000): Promise<void> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const timer = setTimeout(() => {
|
||||
cdp.off("Page.loadEventFired", handler);
|
||||
resolve();
|
||||
}, timeoutMs);
|
||||
const handler = () => {
|
||||
clearTimeout(timer);
|
||||
cdp.off("Page.loadEventFired", handler);
|
||||
resolve();
|
||||
};
|
||||
cdp.on("Page.loadEventFired", handler);
|
||||
});
|
||||
}
|
||||
|
||||
export async function createTargetAndAttach(cdp: CdpConnection, url: string): Promise<{ targetId: string; sessionId: string }> {
|
||||
const { targetId } = await cdp.send<{ targetId: string }>("Target.createTarget", { url });
|
||||
const { sessionId } = await cdp.send<{ sessionId: string }>("Target.attachToTarget", { targetId, flatten: true });
|
||||
await cdp.send("Network.enable", {}, { sessionId });
|
||||
await cdp.send("Page.enable", {}, { sessionId });
|
||||
return { targetId, sessionId };
|
||||
}
|
||||
|
||||
export async function navigateAndWait(cdp: CdpConnection, sessionId: string, url: string, timeoutMs: number): Promise<void> {
|
||||
const loadPromise = new Promise<void>((resolve, reject) => {
|
||||
const timer = setTimeout(() => reject(new Error("Page load timeout")), timeoutMs);
|
||||
const handler = (params: unknown) => {
|
||||
const p = params as { name?: string };
|
||||
if (p.name === "load" || p.name === "DOMContentLoaded") {
|
||||
clearTimeout(timer);
|
||||
cdp.off("Page.lifecycleEvent", handler);
|
||||
resolve();
|
||||
}
|
||||
};
|
||||
cdp.on("Page.lifecycleEvent", handler);
|
||||
});
|
||||
await cdp.send("Page.navigate", { url }, { sessionId });
|
||||
await loadPromise;
|
||||
}
|
||||
|
||||
export async function evaluateScript<T>(cdp: CdpConnection, sessionId: string, expression: string, timeoutMs: number = 30_000): Promise<T> {
|
||||
const result = await cdp.send<{ result: { value?: T; type?: string; description?: string } }>(
|
||||
"Runtime.evaluate",
|
||||
{ expression, returnByValue: true, awaitPromise: true },
|
||||
{ sessionId, timeoutMs }
|
||||
);
|
||||
return result.result.value as T;
|
||||
}
|
||||
|
||||
export async function autoScroll(cdp: CdpConnection, sessionId: string, steps: number = 8, waitMs: number = 600): Promise<void> {
|
||||
let lastHeight = await evaluateScript<number>(cdp, sessionId, "document.body.scrollHeight");
|
||||
for (let i = 0; i < steps; i++) {
|
||||
await evaluateScript<void>(cdp, sessionId, "window.scrollTo(0, document.body.scrollHeight)");
|
||||
await sleep(waitMs);
|
||||
const newHeight = await evaluateScript<number>(cdp, sessionId, "document.body.scrollHeight");
|
||||
if (newHeight === lastHeight) break;
|
||||
lastHeight = newHeight;
|
||||
}
|
||||
await evaluateScript<void>(cdp, sessionId, "window.scrollTo(0, 0)");
|
||||
}
|
||||
|
||||
export function killChrome(chrome: ChildProcess): void {
|
||||
try { chrome.kill("SIGTERM"); } catch {}
|
||||
setTimeout(() => {
|
||||
if (!chrome.killed) {
|
||||
try { chrome.kill("SIGKILL"); } catch {}
|
||||
}
|
||||
}, 2_000).unref?.();
|
||||
}
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
import { resolveUrlToMarkdownChromeProfileDir } from "./paths.js";
|
||||
|
||||
export const DEFAULT_USER_AGENT =
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36";
|
||||
|
||||
export const USER_DATA_DIR = resolveUrlToMarkdownChromeProfileDir();
|
||||
|
||||
export const DEFAULT_TIMEOUT_MS = 30_000;
|
||||
export const CDP_CONNECT_TIMEOUT_MS = 15_000;
|
||||
export const NETWORK_IDLE_TIMEOUT_MS = 1_500;
|
||||
export const POST_LOAD_DELAY_MS = 800;
|
||||
export const SCROLL_STEP_WAIT_MS = 600;
|
||||
export const SCROLL_MAX_STEPS = 8;
|
||||
|
|
@ -0,0 +1,223 @@
|
|||
export interface PageMetadata {
|
||||
url: string;
|
||||
title: string;
|
||||
description?: string;
|
||||
author?: string;
|
||||
published?: string;
|
||||
captured_at: string;
|
||||
}
|
||||
|
||||
export interface ConversionResult {
|
||||
metadata: PageMetadata;
|
||||
markdown: string;
|
||||
}
|
||||
|
||||
export const cleanupAndExtractScript = `
|
||||
(function() {
|
||||
const removeSelectors = [
|
||||
'script', 'style', 'noscript', 'iframe', 'svg', 'canvas',
|
||||
'header nav', 'footer', '.sidebar', '.nav', '.navigation',
|
||||
'.advertisement', '.ad', '.ads', '.cookie-banner', '.popup',
|
||||
'[role="banner"]', '[role="navigation"]', '[role="complementary"]'
|
||||
];
|
||||
|
||||
for (const sel of removeSelectors) {
|
||||
try {
|
||||
document.querySelectorAll(sel).forEach(el => el.remove());
|
||||
} catch {}
|
||||
}
|
||||
|
||||
document.querySelectorAll('*').forEach(el => {
|
||||
el.removeAttribute('style');
|
||||
el.removeAttribute('onclick');
|
||||
el.removeAttribute('onload');
|
||||
el.removeAttribute('onerror');
|
||||
});
|
||||
|
||||
const baseUrl = document.baseURI || location.href;
|
||||
document.querySelectorAll('a[href]').forEach(a => {
|
||||
try {
|
||||
const href = a.getAttribute('href');
|
||||
if (href && !href.startsWith('#') && !href.startsWith('javascript:')) {
|
||||
a.setAttribute('href', new URL(href, baseUrl).href);
|
||||
}
|
||||
} catch {}
|
||||
});
|
||||
document.querySelectorAll('img[src]').forEach(img => {
|
||||
try {
|
||||
const src = img.getAttribute('src');
|
||||
if (src) img.setAttribute('src', new URL(src, baseUrl).href);
|
||||
} catch {}
|
||||
});
|
||||
|
||||
const getMeta = (names) => {
|
||||
for (const name of names) {
|
||||
const el = document.querySelector(\`meta[name="\${name}"], meta[property="\${name}"]\`);
|
||||
if (el) {
|
||||
const content = el.getAttribute('content');
|
||||
if (content) return content.trim();
|
||||
}
|
||||
}
|
||||
return undefined;
|
||||
};
|
||||
|
||||
const getTitle = () => {
|
||||
const ogTitle = getMeta(['og:title', 'twitter:title']);
|
||||
if (ogTitle) return ogTitle;
|
||||
const h1 = document.querySelector('h1');
|
||||
if (h1) return h1.textContent?.trim();
|
||||
return document.title?.trim() || '';
|
||||
};
|
||||
|
||||
const getPublished = () => {
|
||||
const timeEl = document.querySelector('time[datetime]');
|
||||
if (timeEl) return timeEl.getAttribute('datetime');
|
||||
return getMeta(['article:published_time', 'datePublished', 'date']);
|
||||
};
|
||||
|
||||
const main = document.querySelector('main, article, [role="main"], .main-content, .post-content, .article-content, .content');
|
||||
const html = main ? main.innerHTML : document.body.innerHTML;
|
||||
|
||||
return {
|
||||
title: getTitle(),
|
||||
description: getMeta(['description', 'og:description', 'twitter:description']),
|
||||
author: getMeta(['author', 'article:author', 'twitter:creator']),
|
||||
published: getPublished(),
|
||||
html: html
|
||||
};
|
||||
})()
|
||||
`;
|
||||
|
||||
function decodeHtmlEntities(text: string): string {
|
||||
return text
|
||||
.replace(/ /g, ' ')
|
||||
.replace(/&/g, '&')
|
||||
.replace(/</g, '<')
|
||||
.replace(/>/g, '>')
|
||||
.replace(/"/g, '"')
|
||||
.replace(/'/g, "'")
|
||||
.replace(/'/g, "'")
|
||||
.replace(/&#(\d+);/g, (_, n) => String.fromCharCode(parseInt(n, 10)))
|
||||
.replace(/&#x([0-9a-fA-F]+);/g, (_, n) => String.fromCharCode(parseInt(n, 16)));
|
||||
}
|
||||
|
||||
function stripTags(html: string): string {
|
||||
return html.replace(/<[^>]+>/g, '');
|
||||
}
|
||||
|
||||
function normalizeWhitespace(text: string): string {
|
||||
return text.replace(/[ \t]+/g, ' ').replace(/\n{3,}/g, '\n\n').trim();
|
||||
}
|
||||
|
||||
export function htmlToMarkdown(html: string): string {
|
||||
let md = html;
|
||||
|
||||
md = md.replace(/<br\s*\/?>/gi, '\n');
|
||||
md = md.replace(/<hr\s*\/?>/gi, '\n\n---\n\n');
|
||||
|
||||
md = md.replace(/<h1[^>]*>([\s\S]*?)<\/h1>/gi, (_, c) => `\n\n# ${stripTags(c).trim()}\n\n`);
|
||||
md = md.replace(/<h2[^>]*>([\s\S]*?)<\/h2>/gi, (_, c) => `\n\n## ${stripTags(c).trim()}\n\n`);
|
||||
md = md.replace(/<h3[^>]*>([\s\S]*?)<\/h3>/gi, (_, c) => `\n\n### ${stripTags(c).trim()}\n\n`);
|
||||
md = md.replace(/<h4[^>]*>([\s\S]*?)<\/h4>/gi, (_, c) => `\n\n#### ${stripTags(c).trim()}\n\n`);
|
||||
md = md.replace(/<h5[^>]*>([\s\S]*?)<\/h5>/gi, (_, c) => `\n\n##### ${stripTags(c).trim()}\n\n`);
|
||||
md = md.replace(/<h6[^>]*>([\s\S]*?)<\/h6>/gi, (_, c) => `\n\n###### ${stripTags(c).trim()}\n\n`);
|
||||
|
||||
md = md.replace(/<strong[^>]*>([\s\S]*?)<\/strong>/gi, (_, c) => `**${stripTags(c).trim()}**`);
|
||||
md = md.replace(/<b[^>]*>([\s\S]*?)<\/b>/gi, (_, c) => `**${stripTags(c).trim()}**`);
|
||||
md = md.replace(/<em[^>]*>([\s\S]*?)<\/em>/gi, (_, c) => `*${stripTags(c).trim()}*`);
|
||||
md = md.replace(/<i[^>]*>([\s\S]*?)<\/i>/gi, (_, c) => `*${stripTags(c).trim()}*`);
|
||||
md = md.replace(/<del[^>]*>([\s\S]*?)<\/del>/gi, (_, c) => `~~${stripTags(c).trim()}~~`);
|
||||
md = md.replace(/<s[^>]*>([\s\S]*?)<\/s>/gi, (_, c) => `~~${stripTags(c).trim()}~~`);
|
||||
md = md.replace(/<mark[^>]*>([\s\S]*?)<\/mark>/gi, (_, c) => `==${stripTags(c).trim()}==`);
|
||||
|
||||
md = md.replace(/<a[^>]*href=["']([^"']+)["'][^>]*>([\s\S]*?)<\/a>/gi, (_, href, text) => {
|
||||
const t = stripTags(text).trim();
|
||||
if (!t || href.startsWith('javascript:')) return t;
|
||||
return `[${t}](${href})`;
|
||||
});
|
||||
|
||||
md = md.replace(/<img[^>]*src=["']([^"']+)["'][^>]*alt=["']([^"']*)["'][^>]*\/?>/gi, (_, src, alt) => ``);
|
||||
md = md.replace(/<img[^>]*alt=["']([^"']*)["'][^>]*src=["']([^"']+)["'][^>]*\/?>/gi, (_, alt, src) => ``);
|
||||
md = md.replace(/<img[^>]*src=["']([^"']+)["'][^>]*\/?>/gi, (_, src) => ``);
|
||||
|
||||
md = md.replace(/<pre[^>]*><code[^>]*>([\s\S]*?)<\/code><\/pre>/gi, (_, code) => `\n\n\`\`\`\n${decodeHtmlEntities(stripTags(code)).trim()}\n\`\`\`\n\n`);
|
||||
md = md.replace(/<pre[^>]*>([\s\S]*?)<\/pre>/gi, (_, code) => `\n\n\`\`\`\n${decodeHtmlEntities(stripTags(code)).trim()}\n\`\`\`\n\n`);
|
||||
md = md.replace(/<code[^>]*>([\s\S]*?)<\/code>/gi, (_, code) => `\`${decodeHtmlEntities(stripTags(code)).trim()}\``);
|
||||
|
||||
md = md.replace(/<blockquote[^>]*>([\s\S]*?)<\/blockquote>/gi, (_, content) => {
|
||||
const lines = stripTags(content).trim().split('\n');
|
||||
return '\n\n' + lines.map(l => `> ${l.trim()}`).join('\n') + '\n\n';
|
||||
});
|
||||
|
||||
md = md.replace(/<ul[^>]*>([\s\S]*?)<\/ul>/gi, (_, items) => {
|
||||
const lis = items.match(/<li[^>]*>([\s\S]*?)<\/li>/gi) || [];
|
||||
const lines = lis.map(li => {
|
||||
const content = li.replace(/<li[^>]*>([\s\S]*?)<\/li>/i, '$1');
|
||||
return `- ${stripTags(content).trim()}`;
|
||||
});
|
||||
return '\n\n' + lines.join('\n') + '\n\n';
|
||||
});
|
||||
|
||||
md = md.replace(/<ol[^>]*>([\s\S]*?)<\/ol>/gi, (_, items) => {
|
||||
const lis = items.match(/<li[^>]*>([\s\S]*?)<\/li>/gi) || [];
|
||||
const lines = lis.map((li, i) => {
|
||||
const content = li.replace(/<li[^>]*>([\s\S]*?)<\/li>/i, '$1');
|
||||
return `${i + 1}. ${stripTags(content).trim()}`;
|
||||
});
|
||||
return '\n\n' + lines.join('\n') + '\n\n';
|
||||
});
|
||||
|
||||
md = md.replace(/<table[^>]*>([\s\S]*?)<\/table>/gi, (_, table) => {
|
||||
const rows: string[][] = [];
|
||||
const trMatches = table.match(/<tr[^>]*>([\s\S]*?)<\/tr>/gi) || [];
|
||||
for (const tr of trMatches) {
|
||||
const cells: string[] = [];
|
||||
const cellMatches = tr.match(/<t[hd][^>]*>([\s\S]*?)<\/t[hd]>/gi) || [];
|
||||
for (const cell of cellMatches) {
|
||||
const content = cell.replace(/<t[hd][^>]*>([\s\S]*?)<\/t[hd]>/i, '$1');
|
||||
cells.push(stripTags(content).trim().replace(/\|/g, '\\|'));
|
||||
}
|
||||
if (cells.length > 0) rows.push(cells);
|
||||
}
|
||||
if (rows.length === 0) return '';
|
||||
const colCount = Math.max(...rows.map(r => r.length));
|
||||
const normalizedRows = rows.map(r => {
|
||||
while (r.length < colCount) r.push('');
|
||||
return r;
|
||||
});
|
||||
const header = `| ${normalizedRows[0].join(' | ')} |`;
|
||||
const sep = `| ${normalizedRows[0].map(() => '---').join(' | ')} |`;
|
||||
const body = normalizedRows.slice(1).map(r => `| ${r.join(' | ')} |`).join('\n');
|
||||
return '\n\n' + header + '\n' + sep + (body ? '\n' + body : '') + '\n\n';
|
||||
});
|
||||
|
||||
md = md.replace(/<p[^>]*>([\s\S]*?)<\/p>/gi, (_, c) => `\n\n${stripTags(c).trim()}\n\n`);
|
||||
md = md.replace(/<div[^>]*>([\s\S]*?)<\/div>/gi, (_, c) => `\n${stripTags(c).trim()}\n`);
|
||||
md = md.replace(/<span[^>]*>([\s\S]*?)<\/span>/gi, (_, c) => stripTags(c));
|
||||
|
||||
md = stripTags(md);
|
||||
md = decodeHtmlEntities(md);
|
||||
md = normalizeWhitespace(md);
|
||||
|
||||
return md;
|
||||
}
|
||||
|
||||
export function formatMetadataYaml(meta: PageMetadata): string {
|
||||
const lines = ['---'];
|
||||
lines.push(`url: ${meta.url}`);
|
||||
lines.push(`title: "${meta.title.replace(/"/g, '\\"')}"`);
|
||||
if (meta.description) lines.push(`description: "${meta.description.replace(/"/g, '\\"')}"`);
|
||||
if (meta.author) lines.push(`author: "${meta.author.replace(/"/g, '\\"')}"`);
|
||||
if (meta.published) lines.push(`published: "${meta.published}"`);
|
||||
lines.push(`captured_at: "${meta.captured_at}"`);
|
||||
lines.push('---');
|
||||
return lines.join('\n');
|
||||
}
|
||||
|
||||
export function createMarkdownDocument(result: ConversionResult): string {
|
||||
const yaml = formatMetadataYaml(result.metadata);
|
||||
const titleRegex = new RegExp(`^#\\s+${result.metadata.title.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\s*\\n`, 'i');
|
||||
const hasTitle = titleRegex.test(result.markdown);
|
||||
const title = result.metadata.title && !hasTitle ? `\n\n# ${result.metadata.title}\n\n` : '\n\n';
|
||||
return yaml + title + result.markdown;
|
||||
}
|
||||
|
|
@ -0,0 +1,176 @@
|
|||
import { createInterface } from "node:readline";
|
||||
import { writeFile, mkdir, access } from "node:fs/promises";
|
||||
import path from "node:path";
|
||||
import process from "node:process";
|
||||
|
||||
import { CdpConnection, getFreePort, launchChrome, waitForChromeDebugPort, waitForNetworkIdle, waitForPageLoad, autoScroll, evaluateScript, killChrome } from "./cdp.js";
|
||||
import { cleanupAndExtractScript, htmlToMarkdown, createMarkdownDocument, type PageMetadata, type ConversionResult } from "./html-to-markdown.js";
|
||||
import { resolveUrlToMarkdownDataDir } from "./paths.js";
|
||||
import { DEFAULT_TIMEOUT_MS, CDP_CONNECT_TIMEOUT_MS, NETWORK_IDLE_TIMEOUT_MS, POST_LOAD_DELAY_MS, SCROLL_STEP_WAIT_MS, SCROLL_MAX_STEPS } from "./constants.js";
|
||||
|
||||
function sleep(ms: number): Promise<void> {
|
||||
return new Promise((resolve) => setTimeout(resolve, ms));
|
||||
}
|
||||
|
||||
async function fileExists(filePath: string): Promise<boolean> {
|
||||
try {
|
||||
await access(filePath);
|
||||
return true;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
interface Args {
|
||||
url: string;
|
||||
output?: string;
|
||||
wait: boolean;
|
||||
timeout: number;
|
||||
}
|
||||
|
||||
function parseArgs(argv: string[]): Args {
|
||||
const args: Args = { url: "", wait: false, timeout: DEFAULT_TIMEOUT_MS };
|
||||
for (let i = 2; i < argv.length; i++) {
|
||||
const arg = argv[i];
|
||||
if (arg === "--wait" || arg === "-w") {
|
||||
args.wait = true;
|
||||
} else if (arg === "-o" || arg === "--output") {
|
||||
args.output = argv[++i];
|
||||
} else if (arg === "--timeout" || arg === "-t") {
|
||||
args.timeout = parseInt(argv[++i], 10) || DEFAULT_TIMEOUT_MS;
|
||||
} else if (!arg.startsWith("-") && !args.url) {
|
||||
args.url = arg;
|
||||
}
|
||||
}
|
||||
return args;
|
||||
}
|
||||
|
||||
function generateSlug(title: string, url: string): string {
|
||||
const text = title || new URL(url).pathname.replace(/\//g, "-");
|
||||
return text
|
||||
.toLowerCase()
|
||||
.replace(/[^\w\s-]/g, "")
|
||||
.replace(/\s+/g, "-")
|
||||
.replace(/-+/g, "-")
|
||||
.replace(/^-|-$/g, "")
|
||||
.slice(0, 50) || "page";
|
||||
}
|
||||
|
||||
function formatTimestamp(): string {
|
||||
const now = new Date();
|
||||
const pad = (n: number) => n.toString().padStart(2, "0");
|
||||
return `${now.getFullYear()}${pad(now.getMonth() + 1)}${pad(now.getDate())}-${pad(now.getHours())}${pad(now.getMinutes())}${pad(now.getSeconds())}`;
|
||||
}
|
||||
|
||||
async function generateOutputPath(url: string, title: string): Promise<string> {
|
||||
const domain = new URL(url).hostname.replace(/^www\./, "");
|
||||
const slug = generateSlug(title, url);
|
||||
const dataDir = resolveUrlToMarkdownDataDir();
|
||||
const basePath = path.join(dataDir, domain, `${slug}.md`);
|
||||
|
||||
if (!(await fileExists(basePath))) {
|
||||
return basePath;
|
||||
}
|
||||
|
||||
const timestampSlug = `${slug}-${formatTimestamp()}`;
|
||||
return path.join(dataDir, domain, `${timestampSlug}.md`);
|
||||
}
|
||||
|
||||
async function waitForUserSignal(): Promise<void> {
|
||||
console.log("Page opened. Press Enter when ready to capture...");
|
||||
const rl = createInterface({ input: process.stdin, output: process.stdout });
|
||||
await new Promise<void>((resolve) => {
|
||||
rl.once("line", () => { rl.close(); resolve(); });
|
||||
});
|
||||
}
|
||||
|
||||
async function captureUrl(args: Args): Promise<ConversionResult> {
|
||||
const port = await getFreePort();
|
||||
const chrome = await launchChrome(args.url, port, false);
|
||||
|
||||
let cdp: CdpConnection | null = null;
|
||||
try {
|
||||
const wsUrl = await waitForChromeDebugPort(port, 30_000);
|
||||
cdp = await CdpConnection.connect(wsUrl, CDP_CONNECT_TIMEOUT_MS);
|
||||
|
||||
const targets = await cdp.send<{ targetInfos: Array<{ targetId: string; type: string; url: string }> }>("Target.getTargets");
|
||||
const pageTarget = targets.targetInfos.find(t => t.type === "page" && t.url.startsWith("http"));
|
||||
if (!pageTarget) throw new Error("No page target found");
|
||||
|
||||
const { sessionId } = await cdp.send<{ sessionId: string }>("Target.attachToTarget", { targetId: pageTarget.targetId, flatten: true });
|
||||
await cdp.send("Network.enable", {}, { sessionId });
|
||||
await cdp.send("Page.enable", {}, { sessionId });
|
||||
|
||||
if (args.wait) {
|
||||
await waitForUserSignal();
|
||||
} else {
|
||||
console.log("Waiting for page to load...");
|
||||
await Promise.race([
|
||||
waitForPageLoad(cdp, sessionId, 15_000),
|
||||
sleep(8_000)
|
||||
]);
|
||||
await waitForNetworkIdle(cdp, sessionId, NETWORK_IDLE_TIMEOUT_MS);
|
||||
await sleep(POST_LOAD_DELAY_MS);
|
||||
console.log("Scrolling to trigger lazy load...");
|
||||
await autoScroll(cdp, sessionId, SCROLL_MAX_STEPS, SCROLL_STEP_WAIT_MS);
|
||||
await sleep(POST_LOAD_DELAY_MS);
|
||||
}
|
||||
|
||||
console.log("Capturing page content...");
|
||||
const extracted = await evaluateScript<{ title: string; description?: string; author?: string; published?: string; html: string }>(
|
||||
cdp, sessionId, cleanupAndExtractScript, args.timeout
|
||||
);
|
||||
|
||||
const metadata: PageMetadata = {
|
||||
url: args.url,
|
||||
title: extracted.title || "",
|
||||
description: extracted.description,
|
||||
author: extracted.author,
|
||||
published: extracted.published,
|
||||
captured_at: new Date().toISOString()
|
||||
};
|
||||
|
||||
const markdown = htmlToMarkdown(extracted.html);
|
||||
return { metadata, markdown };
|
||||
} finally {
|
||||
if (cdp) {
|
||||
try { await cdp.send("Browser.close", {}, { timeoutMs: 5_000 }); } catch {}
|
||||
cdp.close();
|
||||
}
|
||||
killChrome(chrome);
|
||||
}
|
||||
}
|
||||
|
||||
async function main(): Promise<void> {
|
||||
const args = parseArgs(process.argv);
|
||||
if (!args.url) {
|
||||
console.error("Usage: bun main.ts <url> [-o output.md] [--wait] [--timeout ms]");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
try {
|
||||
new URL(args.url);
|
||||
} catch {
|
||||
console.error(`Invalid URL: ${args.url}`);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log(`Fetching: ${args.url}`);
|
||||
console.log(`Mode: ${args.wait ? "wait" : "auto"}`);
|
||||
|
||||
const result = await captureUrl(args);
|
||||
const outputPath = args.output || await generateOutputPath(args.url, result.metadata.title);
|
||||
const outputDir = path.dirname(outputPath);
|
||||
await mkdir(outputDir, { recursive: true });
|
||||
|
||||
const document = createMarkdownDocument(result);
|
||||
await writeFile(outputPath, document, "utf-8");
|
||||
|
||||
console.log(`Saved: ${outputPath}`);
|
||||
console.log(`Title: ${result.metadata.title || "(no title)"}`);
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error("Error:", err instanceof Error ? err.message : String(err));
|
||||
process.exit(1);
|
||||
});
|
||||
|
|
@ -0,0 +1,29 @@
|
|||
import os from "node:os";
|
||||
import path from "node:path";
|
||||
import process from "node:process";
|
||||
|
||||
const APP_DATA_DIR = "baoyu-skills";
|
||||
const URL_TO_MARKDOWN_DATA_DIR = "url-to-markdown";
|
||||
const PROFILE_DIR_NAME = "chrome-profile";
|
||||
|
||||
export function resolveUserDataRoot(): string {
|
||||
if (process.platform === "win32") {
|
||||
return process.env.APPDATA ?? path.join(os.homedir(), "AppData", "Roaming");
|
||||
}
|
||||
if (process.platform === "darwin") {
|
||||
return path.join(os.homedir(), "Library", "Application Support");
|
||||
}
|
||||
return process.env.XDG_DATA_HOME ?? path.join(os.homedir(), ".local", "share");
|
||||
}
|
||||
|
||||
export function resolveUrlToMarkdownDataDir(): string {
|
||||
const override = process.env.URL_DATA_DIR?.trim();
|
||||
if (override) return path.resolve(override);
|
||||
return path.join(process.cwd(), URL_TO_MARKDOWN_DATA_DIR);
|
||||
}
|
||||
|
||||
export function resolveUrlToMarkdownChromeProfileDir(): string {
|
||||
const override = process.env.URL_CHROME_PROFILE_DIR?.trim();
|
||||
if (override) return path.resolve(override);
|
||||
return path.join(resolveUserDataRoot(), APP_DATA_DIR, URL_TO_MARKDOWN_DATA_DIR, PROFILE_DIR_NAME);
|
||||
}
|
||||
|
|
@ -20,7 +20,7 @@ Break down complex content into eye-catching infographic series for Xiaohongshu
|
|||
/baoyu-xhs-images posts/ai-future/article.md --layout dense
|
||||
|
||||
# Combine style and layout
|
||||
/baoyu-xhs-images posts/ai-future/article.md --style tech --layout list
|
||||
/baoyu-xhs-images posts/ai-future/article.md --style notion --layout list
|
||||
|
||||
# Direct content input
|
||||
/baoyu-xhs-images
|
||||
|
|
@ -98,7 +98,7 @@ Each session creates an independent directory named by content slug:
|
|||
xhs-images/{topic-slug}/
|
||||
├── source-{slug}.{ext} # Source files (text, images, etc.)
|
||||
├── analysis.md # Deep analysis results
|
||||
├── outline-style-[slug].md # Variant A (e.g., outline-style-tech.md)
|
||||
├── outline-style-[slug].md # Variant A (e.g., outline-style-notion.md)
|
||||
├── outline-style-[slug].md # Variant B (e.g., outline-style-notion.md)
|
||||
├── outline-style-[slug].md # Variant C (e.g., outline-style-minimal.md)
|
||||
├── outline.md # Final selected
|
||||
|
|
@ -162,7 +162,7 @@ Based on analysis, create three distinct style variants.
|
|||
|
||||
| Variant | Selection Logic | Example Filename |
|
||||
|---------|-----------------|------------------|
|
||||
| A | Primary recommendation | `outline-style-tech.md` |
|
||||
| A | Primary recommendation | `outline-style-notion.md` |
|
||||
| B | Alternative style | `outline-style-notion.md` |
|
||||
| C | Different audience/mood | `outline-style-minimal.md` |
|
||||
|
||||
|
|
@ -188,7 +188,7 @@ Based on analysis, create three distinct style variants.
|
|||
|
||||
```
|
||||
Question 1 (Style): Which style variant?
|
||||
- A: tech + dense (Recommended) - 专业科技感,适合干货
|
||||
- A: notion + dense (Recommended) - 知识卡片风格,适合干货
|
||||
- B: notion + list - 清爽知识卡片
|
||||
- C: minimal + balanced - 简约高端风格
|
||||
- Custom: 自定义风格描述
|
||||
|
|
@ -239,10 +239,10 @@ Location: [directory path]
|
|||
Images: N total
|
||||
|
||||
✓ analysis.md
|
||||
✓ outline-style-tech.md
|
||||
✓ outline-style-notion.md
|
||||
✓ outline-style-chalkboard.md
|
||||
✓ outline-style-minimal.md
|
||||
✓ outline.md (selected: tech + dense)
|
||||
✓ outline.md (selected: notion + dense)
|
||||
|
||||
Files:
|
||||
- 01-cover-[slug].png ✓ Cover (sparse)
|
||||
|
|
|
|||
|
|
@ -25,9 +25,9 @@ Unlike other platforms, Xiaohongshu content must prioritize:
|
|||
| Type | Characteristics | Best Style | Best Layout |
|
||||
|------|----------------|------------|-------------|
|
||||
| 种草/安利 | Product recommendation, benefits focus | cute/fresh | balanced/list |
|
||||
| 干货分享 | Knowledge, tips, how-to | notion/tech | dense/list |
|
||||
| 干货分享 | Knowledge, tips, how-to | notion | dense/list |
|
||||
| 个人故事 | Personal experience, emotional | warm | balanced |
|
||||
| 测评对比 | Review, comparison, pros/cons | tech/bold | comparison |
|
||||
| 测评对比 | Review, comparison, pros/cons | bold/notion | comparison |
|
||||
| 教程步骤 | Step-by-step guide | fresh/notion | flow/list |
|
||||
| 避坑指南 | Warnings, mistakes to avoid | bold | list/comparison |
|
||||
| 清单合集 | Collections, recommendations | cute/minimal | list/dense |
|
||||
|
|
@ -55,10 +55,10 @@ Evaluate title/hook potential using these patterns:
|
|||
| Audience | Interests | Preferred Style | Content Focus |
|
||||
|----------|-----------|-----------------|---------------|
|
||||
| 学生党 | 省钱、学习、校园 | cute/fresh | 平价、教程、学习方法 |
|
||||
| 打工人 | 效率、职场、减压 | minimal/tech | 工具、技巧、摸鱼 |
|
||||
| 打工人 | 效率、职场、减压 | minimal/notion | 工具、技巧、摸鱼 |
|
||||
| 宝妈 | 育儿、家居、省心 | warm/fresh | 实用、安全、经验 |
|
||||
| 精致女孩 | 美妆、穿搭、仪式感 | cute/retro | 好看、氛围、品质 |
|
||||
| 技术宅 | 工具、效率、极客 | tech/notion | 深度、专业、新奇 |
|
||||
| 技术宅 | 工具、效率、极客 | notion/chalkboard | 深度、专业、新奇 |
|
||||
| 美食爱好者 | 探店、食谱、测评 | warm/pop | 好吃、简单、颜值 |
|
||||
| 旅行达人 | 攻略、打卡、小众 | fresh/retro | 省钱、避坑、拍照 |
|
||||
|
||||
|
|
@ -165,7 +165,7 @@ recommended_image_count: 6
|
|||
|
||||
## Content Signals
|
||||
|
||||
- "AI工具" → tech + dense
|
||||
- "AI工具" → notion + dense
|
||||
- "效率" → notion + list
|
||||
- "干货" → minimal + dense
|
||||
|
||||
|
|
@ -180,7 +180,7 @@ recommended_image_count: 6
|
|||
|
||||
## Recommended Approaches
|
||||
|
||||
1. **Tech + Dense** - 专业科技感,适合干货分享 (recommended)
|
||||
1. **Notion + Dense** - 知识卡片风格,适合干货分享 (recommended)
|
||||
2. **Notion + List** - 清爽知识卡片风格
|
||||
3. **Minimal + Balanced** - 简约高端,适合职场人群
|
||||
```
|
||||
|
|
|
|||
|
|
@ -28,4 +28,4 @@ Comparisons, transformations, decision helpers, 对比图
|
|||
|
||||
## Best Style Pairings
|
||||
|
||||
bold (dramatic contrast), tech (data comparison), warm (before/after stories)
|
||||
bold (dramatic contrast), notion (data comparison), warm (before/after stories)
|
||||
|
|
|
|||
|
|
@ -28,4 +28,4 @@ Summary cards, cheat sheets, comprehensive guides, 干货总结
|
|||
|
||||
## Best Style Pairings
|
||||
|
||||
tech, notion, minimal (clean styles prevent visual overload)
|
||||
notion, minimal, chalkboard (clean styles prevent visual overload)
|
||||
|
|
|
|||
|
|
@ -27,4 +27,4 @@ Processes, timelines, cause-effect chains, workflows
|
|||
|
||||
## Best Style Pairings
|
||||
|
||||
tech (process diagrams), notion (simple flows), fresh (organic flows)
|
||||
notion (process diagrams), chalkboard (educational flows), fresh (organic flows)
|
||||
|
|
|
|||
|
|
@ -5,8 +5,8 @@ Template for generating infographic series outlines.
|
|||
## File Naming
|
||||
|
||||
Outline files use style slug in the name:
|
||||
- `outline-style-tech.md` - Tech style variant
|
||||
- `outline-style-notion.md` - Notion style variant
|
||||
- `outline-style-chalkboard.md` - Chalkboard style variant
|
||||
- `outline-style-minimal.md` - Minimal style variant
|
||||
- `outline.md` - Final selected (copied from chosen variant)
|
||||
|
||||
|
|
@ -43,7 +43,7 @@ NN-{type}-[slug].md (in prompts/)
|
|||
# Xiaohongshu Infographic Series Outline
|
||||
|
||||
---
|
||||
style: tech
|
||||
style: notion
|
||||
default_layout: dense
|
||||
image_count: 6
|
||||
generated: YYYY-MM-DD HH:mm
|
||||
|
|
@ -223,6 +223,6 @@ Three variants should differ meaningfully:
|
|||
| Audience | Primary target | Secondary target | Broader appeal |
|
||||
|
||||
**Example for "AI工具推荐"**:
|
||||
- `outline-style-tech.md`: Tech + Dense - 专业极客风
|
||||
- `outline-style-notion.md`: Notion + Dense - 知识卡片风
|
||||
- `outline-style-notion.md`: Notion + List - 清爽知识卡片
|
||||
- `outline-style-cute.md`: Cute + Balanced - 可爱易读风
|
||||
|
|
|
|||
Loading…
Reference in New Issue