feat(baoyu-url-to-markdown): add HTML snapshot saving and Defuddle fallback pipeline
- Save rendered HTML as sibling -captured.html file alongside markdown - Defuddle-first conversion with automatic fallback to legacy Readability/selector extractor - Add rawHtml, conversionMethod, fallbackReason to ConversionResult - Log converter method and fallback reason in CLI output
This commit is contained in:
parent
3b031c7768
commit
5560db595a
|
|
@ -677,7 +677,7 @@ Utility tools for content processing.
|
||||||
|
|
||||||
#### baoyu-url-to-markdown
|
#### baoyu-url-to-markdown
|
||||||
|
|
||||||
Fetch any URL via Chrome CDP and convert to clean markdown. Supports two capture modes for different scenarios.
|
Fetch any URL via Chrome CDP and convert to clean markdown. Saves rendered HTML snapshot alongside the markdown, and automatically falls back to a legacy extractor when Defuddle fails.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Auto mode (default) - capture when page loads
|
# Auto mode (default) - capture when page loads
|
||||||
|
|
|
||||||
|
|
@ -677,7 +677,7 @@ AI 驱动的生成后端。
|
||||||
|
|
||||||
#### baoyu-url-to-markdown
|
#### baoyu-url-to-markdown
|
||||||
|
|
||||||
通过 Chrome CDP 抓取任意 URL 并转换为干净的 Markdown。支持两种抓取模式,适应不同场景。
|
通过 Chrome CDP 抓取任意 URL 并转换为 Markdown。同时保存渲染后的 HTML 快照,Defuddle 失败时自动回退到旧版提取器。
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 自动模式(默认)- 页面加载后立即抓取
|
# 自动模式(默认)- 页面加载后立即抓取
|
||||||
|
|
|
||||||
|
|
@ -1,11 +1,11 @@
|
||||||
---
|
---
|
||||||
name: baoyu-url-to-markdown
|
name: baoyu-url-to-markdown
|
||||||
description: Fetch any URL and convert to markdown using Chrome CDP. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
|
description: Fetch any URL and convert to markdown using Chrome CDP. Saves the rendered HTML snapshot alongside the markdown, and automatically falls back to the pre-Defuddle HTML-to-Markdown pipeline when Defuddle fails. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
|
||||||
---
|
---
|
||||||
|
|
||||||
# URL to Markdown
|
# URL to Markdown
|
||||||
|
|
||||||
Fetches any URL via Chrome CDP and converts HTML to clean markdown.
|
Fetches any URL via Chrome CDP, saves the rendered HTML snapshot, and converts it to clean markdown.
|
||||||
|
|
||||||
## Script Directory
|
## Script Directory
|
||||||
|
|
||||||
|
|
@ -21,6 +21,7 @@ Fetches any URL via Chrome CDP and converts HTML to clean markdown.
|
||||||
| Script | Purpose |
|
| Script | Purpose |
|
||||||
|--------|---------|
|
|--------|---------|
|
||||||
| `scripts/main.ts` | CLI entry point for URL fetching |
|
| `scripts/main.ts` | CLI entry point for URL fetching |
|
||||||
|
| `scripts/html-to-markdown.ts` | Defuddle-first conversion with automatic legacy fallback |
|
||||||
|
|
||||||
## Preferences (EXTEND.md)
|
## Preferences (EXTEND.md)
|
||||||
|
|
||||||
|
|
@ -101,7 +102,9 @@ Full reference: [references/config/first-time-setup.md](references/config/first-
|
||||||
|
|
||||||
- Chrome CDP for full JavaScript rendering
|
- Chrome CDP for full JavaScript rendering
|
||||||
- Two capture modes: auto or wait-for-user
|
- Two capture modes: auto or wait-for-user
|
||||||
|
- Save rendered HTML as a sibling `-captured.html` file
|
||||||
- Clean markdown output with metadata
|
- Clean markdown output with metadata
|
||||||
|
- Defuddle-first markdown conversion with automatic fallback to the pre-Defuddle extractor from git history
|
||||||
- Handles login-required pages via wait mode
|
- Handles login-required pages via wait mode
|
||||||
- Download images and videos to local directories
|
- Download images and videos to local directories
|
||||||
|
|
||||||
|
|
@ -149,13 +152,23 @@ ${BUN_X} ${SKILL_DIR}/scripts/main.ts <url> --download-media
|
||||||
|
|
||||||
## Output Format
|
## Output Format
|
||||||
|
|
||||||
YAML front matter with `url`, `title`, `description`, `author`, `published`, `captured_at` fields, followed by converted markdown content.
|
Each run saves two files side by side:
|
||||||
|
|
||||||
|
- Markdown: YAML front matter with `url`, `title`, `description`, `author`, `published`, optional `coverImage`, and `captured_at`, followed by converted markdown content
|
||||||
|
- HTML snapshot: `*-captured.html`, containing the rendered page HTML captured from Chrome
|
||||||
|
|
||||||
|
The HTML snapshot is saved before any markdown media localization, so it stays a faithful capture of the page DOM used for conversion.
|
||||||
|
|
||||||
## Output Directory
|
## Output Directory
|
||||||
|
|
||||||
Default: `url-to-markdown/<domain>/<slug>.md`
|
Default: `url-to-markdown/<domain>/<slug>.md`
|
||||||
With `--output-dir ./posts/`: `./posts/<domain>/<slug>.md`
|
With `--output-dir ./posts/`: `./posts/<domain>/<slug>.md`
|
||||||
|
|
||||||
|
HTML snapshot path uses the same basename:
|
||||||
|
|
||||||
|
- `url-to-markdown/<domain>/<slug>-captured.html`
|
||||||
|
- `./posts/<domain>/<slug>-captured.html`
|
||||||
|
|
||||||
- `<slug>`: From page title or URL path (kebab-case, 2-6 words)
|
- `<slug>`: From page title or URL path (kebab-case, 2-6 words)
|
||||||
- Conflict resolution: Append timestamp `<slug>-YYYYMMDD-HHMMSS.md`
|
- Conflict resolution: Append timestamp `<slug>-YYYYMMDD-HHMMSS.md`
|
||||||
|
|
||||||
|
|
@ -164,6 +177,19 @@ When `--download-media` is enabled:
|
||||||
- Videos are saved to `videos/` next to the markdown file
|
- Videos are saved to `videos/` next to the markdown file
|
||||||
- Markdown media links are rewritten to local relative paths
|
- Markdown media links are rewritten to local relative paths
|
||||||
|
|
||||||
|
## Conversion Fallback
|
||||||
|
|
||||||
|
Conversion order:
|
||||||
|
|
||||||
|
1. Try Defuddle first
|
||||||
|
2. If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
|
||||||
|
3. The fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
|
||||||
|
|
||||||
|
CLI output will show:
|
||||||
|
|
||||||
|
- `Converter: defuddle` when Defuddle succeeds
|
||||||
|
- `Converter: legacy:...` plus `Fallback used: ...` when fallback was needed
|
||||||
|
|
||||||
## Media Download Workflow
|
## Media Download Workflow
|
||||||
|
|
||||||
Based on `download_media` setting in EXTEND.md:
|
Based on `download_media` setting in EXTEND.md:
|
||||||
|
|
@ -193,7 +219,7 @@ Based on `download_media` setting in EXTEND.md:
|
||||||
| `URL_DATA_DIR` | Custom data directory |
|
| `URL_DATA_DIR` | Custom data directory |
|
||||||
| `URL_CHROME_PROFILE_DIR` | Custom Chrome profile directory |
|
| `URL_CHROME_PROFILE_DIR` | Custom Chrome profile directory |
|
||||||
|
|
||||||
**Troubleshooting**: Chrome not found → set `URL_CHROME_PATH`. Timeout → increase `--timeout`. Complex pages → try `--wait` mode.
|
**Troubleshooting**: Chrome not found → set `URL_CHROME_PATH`. Timeout → increase `--timeout`. Complex pages → try `--wait` mode. If markdown quality is poor, inspect the saved `-captured.html` and check whether the run logged a legacy fallback.
|
||||||
|
|
||||||
## Extension Support
|
## Extension Support
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,191 @@
|
||||||
|
{
|
||||||
|
"lockfileVersion": 1,
|
||||||
|
"workspaces": {
|
||||||
|
"": {
|
||||||
|
"name": "baoyu-url-to-markdown-scripts",
|
||||||
|
"dependencies": {
|
||||||
|
"@mozilla/readability": "^0.6.0",
|
||||||
|
"defuddle": "^0.10.0",
|
||||||
|
"jsdom": "^24.1.3",
|
||||||
|
"linkedom": "^0.18.12",
|
||||||
|
"turndown": "^7.2.2",
|
||||||
|
"turndown-plugin-gfm": "^1.0.2",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"packages": {
|
||||||
|
"@asamuzakjp/css-color": ["@asamuzakjp/css-color@3.2.0", "", { "dependencies": { "@csstools/css-calc": "^2.1.3", "@csstools/css-color-parser": "^3.0.9", "@csstools/css-parser-algorithms": "^3.0.4", "@csstools/css-tokenizer": "^3.0.3", "lru-cache": "^10.4.3" } }, "sha512-K1A6z8tS3XsmCMM86xoWdn7Fkdn9m6RSVtocUrJYIwZnFVkng/PvkEoWtOWmP+Scc6saYWHWZYbndEEXxl24jw=="],
|
||||||
|
|
||||||
|
"@csstools/color-helpers": ["@csstools/color-helpers@5.1.0", "", {}, "sha512-S11EXWJyy0Mz5SYvRmY8nJYTFFd1LCNV+7cXyAgQtOOuzb4EsgfqDufL+9esx72/eLhsRdGZwaldu/h+E4t4BA=="],
|
||||||
|
|
||||||
|
"@csstools/css-calc": ["@csstools/css-calc@2.1.4", "", { "peerDependencies": { "@csstools/css-parser-algorithms": "^3.0.5", "@csstools/css-tokenizer": "^3.0.4" } }, "sha512-3N8oaj+0juUw/1H3YwmDDJXCgTB1gKU6Hc/bB502u9zR0q2vd786XJH9QfrKIEgFlZmhZiq6epXl4rHqhzsIgQ=="],
|
||||||
|
|
||||||
|
"@csstools/css-color-parser": ["@csstools/css-color-parser@3.1.0", "", { "dependencies": { "@csstools/color-helpers": "^5.1.0", "@csstools/css-calc": "^2.1.4" }, "peerDependencies": { "@csstools/css-parser-algorithms": "^3.0.5", "@csstools/css-tokenizer": "^3.0.4" } }, "sha512-nbtKwh3a6xNVIp/VRuXV64yTKnb1IjTAEEh3irzS+HkKjAOYLTGNb9pmVNntZ8iVBHcWDA2Dof0QtPgFI1BaTA=="],
|
||||||
|
|
||||||
|
"@csstools/css-parser-algorithms": ["@csstools/css-parser-algorithms@3.0.5", "", { "peerDependencies": { "@csstools/css-tokenizer": "^3.0.4" } }, "sha512-DaDeUkXZKjdGhgYaHNJTV9pV7Y9B3b644jCLs9Upc3VeNGg6LWARAT6O+Q+/COo+2gg/bM5rhpMAtf70WqfBdQ=="],
|
||||||
|
|
||||||
|
"@csstools/css-tokenizer": ["@csstools/css-tokenizer@3.0.4", "", {}, "sha512-Vd/9EVDiu6PPJt9yAh6roZP6El1xHrdvIVGjyBsHR0RYwNHgL7FJPyIIW4fANJNG6FtyZfvlRPpFI4ZM/lubvw=="],
|
||||||
|
|
||||||
|
"@mixmark-io/domino": ["@mixmark-io/domino@2.2.0", "", {}, "sha512-Y28PR25bHXUg88kCV7nivXrP2Nj2RueZ3/l/jdx6J9f8J4nsEGcgX0Qe6lt7Pa+J79+kPiJU3LguR6O/6zrLOw=="],
|
||||||
|
|
||||||
|
"@mozilla/readability": ["@mozilla/readability@0.6.0", "", {}, "sha512-juG5VWh4qAivzTAeMzvY9xs9HY5rAcr2E4I7tiSSCokRFi7XIZCAu92ZkSTsIj1OPceCifL3cpfteP3pDT9/QQ=="],
|
||||||
|
|
||||||
|
"@xmldom/xmldom": ["@xmldom/xmldom@0.8.11", "", {}, "sha512-cQzWCtO6C8TQiYl1ruKNn2U6Ao4o4WBBcbL61yJl84x+j5sOWWFU9X7DpND8XZG3daDppSsigMdfAIl2upQBRw=="],
|
||||||
|
|
||||||
|
"agent-base": ["agent-base@7.1.4", "", {}, "sha512-MnA+YT8fwfJPgBx3m60MNqakm30XOkyIoH1y6huTQvC0PwZG7ki8NacLBcrPbNoo8vEZy7Jpuk7+jMO+CUovTQ=="],
|
||||||
|
|
||||||
|
"asynckit": ["asynckit@0.4.0", "", {}, "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q=="],
|
||||||
|
|
||||||
|
"boolbase": ["boolbase@1.0.0", "", {}, "sha512-JZOSA7Mo9sNGB8+UjSgzdLtokWAky1zbztM3WRLCbZ70/3cTANmQmOdR7y2g+J0e2WXywy1yS468tY+IruqEww=="],
|
||||||
|
|
||||||
|
"call-bind-apply-helpers": ["call-bind-apply-helpers@1.0.2", "", { "dependencies": { "es-errors": "^1.3.0", "function-bind": "^1.1.2" } }, "sha512-Sp1ablJ0ivDkSzjcaJdxEunN5/XvksFJ2sMBFfq6x0ryhQV/2b/KwFe21cMpmHtPOSij8K99/wSfoEuTObmuMQ=="],
|
||||||
|
|
||||||
|
"combined-stream": ["combined-stream@1.0.8", "", { "dependencies": { "delayed-stream": "~1.0.0" } }, "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg=="],
|
||||||
|
|
||||||
|
"commander": ["commander@12.1.0", "", {}, "sha512-Vw8qHK3bZM9y/P10u3Vib8o/DdkvA2OtPtZvD871QKjy74Wj1WSKFILMPRPSdUSx5RFK1arlJzEtA4PkFgnbuA=="],
|
||||||
|
|
||||||
|
"css-select": ["css-select@5.2.2", "", { "dependencies": { "boolbase": "^1.0.0", "css-what": "^6.1.0", "domhandler": "^5.0.2", "domutils": "^3.0.1", "nth-check": "^2.0.1" } }, "sha512-TizTzUddG/xYLA3NXodFM0fSbNizXjOKhqiQQwvhlspadZokn1KDy0NZFS0wuEubIYAV5/c1/lAr0TaaFXEXzw=="],
|
||||||
|
|
||||||
|
"css-what": ["css-what@6.2.2", "", {}, "sha512-u/O3vwbptzhMs3L1fQE82ZSLHQQfto5gyZzwteVIEyeaY5Fc7R4dapF/BvRoSYFeqfBk4m0V1Vafq5Pjv25wvA=="],
|
||||||
|
|
||||||
|
"cssom": ["cssom@0.5.0", "", {}, "sha512-iKuQcq+NdHqlAcwUY0o/HL69XQrUaQdMjmStJ8JFmUaiiQErlhrmuigkg/CU4E2J0IyUKUrMAgl36TvN67MqTw=="],
|
||||||
|
|
||||||
|
"cssstyle": ["cssstyle@4.6.0", "", { "dependencies": { "@asamuzakjp/css-color": "^3.2.0", "rrweb-cssom": "^0.8.0" } }, "sha512-2z+rWdzbbSZv6/rhtvzvqeZQHrBaqgogqt85sqFNbabZOuFbCVFb8kPeEtZjiKkbrm395irpNKiYeFeLiQnFPg=="],
|
||||||
|
|
||||||
|
"data-urls": ["data-urls@5.0.0", "", { "dependencies": { "whatwg-mimetype": "^4.0.0", "whatwg-url": "^14.0.0" } }, "sha512-ZYP5VBHshaDAiVZxjbRVcFJpc+4xGgT0bK3vzy1HLN8jTO975HEbuYzZJcHoQEY5K1a0z8YayJkyVETa08eNTg=="],
|
||||||
|
|
||||||
|
"debug": ["debug@4.4.3", "", { "dependencies": { "ms": "^2.1.3" } }, "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA=="],
|
||||||
|
|
||||||
|
"decimal.js": ["decimal.js@10.6.0", "", {}, "sha512-YpgQiITW3JXGntzdUmyUR1V812Hn8T1YVXhCu+wO3OpS4eU9l4YdD3qjyiKdV6mvV29zapkMeD390UVEf2lkUg=="],
|
||||||
|
|
||||||
|
"defuddle": ["defuddle@0.10.0", "", { "dependencies": { "commander": "^12.1.0" }, "optionalDependencies": { "mathml-to-latex": "^1.5.0", "temml": "^0.13.1", "turndown": "^7.2.0" }, "peerDependencies": { "jsdom": "^24.0.0" }, "bin": { "defuddle": "dist/cli.js" } }, "sha512-a43juTtHv6Vs4+sxvahVLM5NxoyDsarO1Ag3UxLORI4Fo/nsNFwzDxuQBvosKVGTIRxCwN/mfnWAzNXmQfieqw=="],
|
||||||
|
|
||||||
|
"delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],
|
||||||
|
|
||||||
|
"dom-serializer": ["dom-serializer@2.0.0", "", { "dependencies": { "domelementtype": "^2.3.0", "domhandler": "^5.0.2", "entities": "^4.2.0" } }, "sha512-wIkAryiqt/nV5EQKqQpo3SToSOV9J0DnbJqwK7Wv/Trc92zIAYZ4FlMu+JPFW1DfGFt81ZTCGgDEabffXeLyJg=="],
|
||||||
|
|
||||||
|
"domelementtype": ["domelementtype@2.3.0", "", {}, "sha512-OLETBj6w0OsagBwdXnPdN0cnMfF9opN69co+7ZrbfPGrdpPVNBUj02spi6B1N7wChLQiPn4CSH/zJvXw56gmHw=="],
|
||||||
|
|
||||||
|
"domhandler": ["domhandler@5.0.3", "", { "dependencies": { "domelementtype": "^2.3.0" } }, "sha512-cgwlv/1iFQiFnU96XXgROh8xTeetsnJiDsTc7TYCLFd9+/WNkIqPTxiM/8pSd8VIrhXGTf1Ny1q1hquVqDJB5w=="],
|
||||||
|
|
||||||
|
"domutils": ["domutils@3.2.2", "", { "dependencies": { "dom-serializer": "^2.0.0", "domelementtype": "^2.3.0", "domhandler": "^5.0.3" } }, "sha512-6kZKyUajlDuqlHKVX1w7gyslj9MPIXzIFiz/rGu35uC1wMi+kMhQwGhl4lt9unC9Vb9INnY9Z3/ZA3+FhASLaw=="],
|
||||||
|
|
||||||
|
"dunder-proto": ["dunder-proto@1.0.1", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.1", "es-errors": "^1.3.0", "gopd": "^1.2.0" } }, "sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A=="],
|
||||||
|
|
||||||
|
"entities": ["entities@6.0.1", "", {}, "sha512-aN97NXWF6AWBTahfVOIrB/NShkzi5H7F9r1s9mD3cDj4Ko5f2qhhVoYMibXF7GlLveb/D2ioWay8lxI97Ven3g=="],
|
||||||
|
|
||||||
|
"es-define-property": ["es-define-property@1.0.1", "", {}, "sha512-e3nRfgfUZ4rNGL232gUgX06QNyyez04KdjFrF+LTRoOXmrOgFKDg4BCdsjW8EnT69eqdYGmRpJwiPVYNrCaW3g=="],
|
||||||
|
|
||||||
|
"es-errors": ["es-errors@1.3.0", "", {}, "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw=="],
|
||||||
|
|
||||||
|
"es-object-atoms": ["es-object-atoms@1.1.1", "", { "dependencies": { "es-errors": "^1.3.0" } }, "sha512-FGgH2h8zKNim9ljj7dankFPcICIK9Cp5bm+c2gQSYePhpaG5+esrLODihIorn+Pe6FGJzWhXQotPv73jTaldXA=="],
|
||||||
|
|
||||||
|
"es-set-tostringtag": ["es-set-tostringtag@2.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "get-intrinsic": "^1.2.6", "has-tostringtag": "^1.0.2", "hasown": "^2.0.2" } }, "sha512-j6vWzfrGVfyXxge+O0x5sh6cvxAog0a/4Rdd2K36zCMV5eJ+/+tOAngRO8cODMNWbVRdVlmGZQL2YS3yR8bIUA=="],
|
||||||
|
|
||||||
|
"form-data": ["form-data@4.0.5", "", { "dependencies": { "asynckit": "^0.4.0", "combined-stream": "^1.0.8", "es-set-tostringtag": "^2.1.0", "hasown": "^2.0.2", "mime-types": "^2.1.12" } }, "sha512-8RipRLol37bNs2bhoV67fiTEvdTrbMUYcFTiy3+wuuOnUog2QBHCZWXDRijWQfAkhBj2Uf5UnVaiWwA5vdd82w=="],
|
||||||
|
|
||||||
|
"function-bind": ["function-bind@1.1.2", "", {}, "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA=="],
|
||||||
|
|
||||||
|
"get-intrinsic": ["get-intrinsic@1.3.0", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "es-define-property": "^1.0.1", "es-errors": "^1.3.0", "es-object-atoms": "^1.1.1", "function-bind": "^1.1.2", "get-proto": "^1.0.1", "gopd": "^1.2.0", "has-symbols": "^1.1.0", "hasown": "^2.0.2", "math-intrinsics": "^1.1.0" } }, "sha512-9fSjSaos/fRIVIp+xSJlE6lfwhES7LNtKaCBIamHsjr2na1BiABJPo0mOjjz8GJDURarmCPGqaiVg5mfjb98CQ=="],
|
||||||
|
|
||||||
|
"get-proto": ["get-proto@1.0.1", "", { "dependencies": { "dunder-proto": "^1.0.1", "es-object-atoms": "^1.0.0" } }, "sha512-sTSfBjoXBp89JvIKIefqw7U2CCebsc74kiY6awiGogKtoSGbgjYE/G/+l9sF3MWFPNc9IcoOC4ODfKHfxFmp0g=="],
|
||||||
|
|
||||||
|
"gopd": ["gopd@1.2.0", "", {}, "sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg=="],
|
||||||
|
|
||||||
|
"has-symbols": ["has-symbols@1.1.0", "", {}, "sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ=="],
|
||||||
|
|
||||||
|
"has-tostringtag": ["has-tostringtag@1.0.2", "", { "dependencies": { "has-symbols": "^1.0.3" } }, "sha512-NqADB8VjPFLM2V0VvHUewwwsw0ZWBaIdgo+ieHtK3hasLz4qeCRjYcqfB6AQrBggRKppKF8L52/VqdVsO47Dlw=="],
|
||||||
|
|
||||||
|
"hasown": ["hasown@2.0.2", "", { "dependencies": { "function-bind": "^1.1.2" } }, "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ=="],
|
||||||
|
|
||||||
|
"html-encoding-sniffer": ["html-encoding-sniffer@4.0.0", "", { "dependencies": { "whatwg-encoding": "^3.1.1" } }, "sha512-Y22oTqIU4uuPgEemfz7NDJz6OeKf12Lsu+QC+s3BVpda64lTiMYCyGwg5ki4vFxkMwQdeZDl2adZoqUgdFuTgQ=="],
|
||||||
|
|
||||||
|
"html-escaper": ["html-escaper@3.0.3", "", {}, "sha512-RuMffC89BOWQoY0WKGpIhn5gX3iI54O6nRA0yC124NYVtzjmFWBIiFd8M0x+ZdX0P9R4lADg1mgP8C7PxGOWuQ=="],
|
||||||
|
|
||||||
|
"htmlparser2": ["htmlparser2@10.1.0", "", { "dependencies": { "domelementtype": "^2.3.0", "domhandler": "^5.0.3", "domutils": "^3.2.2", "entities": "^7.0.1" } }, "sha512-VTZkM9GWRAtEpveh7MSF6SjjrpNVNNVJfFup7xTY3UpFtm67foy9HDVXneLtFVt4pMz5kZtgNcvCniNFb1hlEQ=="],
|
||||||
|
|
||||||
|
"http-proxy-agent": ["http-proxy-agent@7.0.2", "", { "dependencies": { "agent-base": "^7.1.0", "debug": "^4.3.4" } }, "sha512-T1gkAiYYDWYx3V5Bmyu7HcfcvL7mUrTWiM6yOfa3PIphViJ/gFPbvidQ+veqSOHci/PxBcDabeUNCzpOODJZig=="],
|
||||||
|
|
||||||
|
"https-proxy-agent": ["https-proxy-agent@7.0.6", "", { "dependencies": { "agent-base": "^7.1.2", "debug": "4" } }, "sha512-vK9P5/iUfdl95AI+JVyUuIcVtd4ofvtrOr3HNtM2yxC9bnMbEdp3x01OhQNnjb8IJYi38VlTE3mBXwcfvywuSw=="],
|
||||||
|
|
||||||
|
"iconv-lite": ["iconv-lite@0.6.3", "", { "dependencies": { "safer-buffer": ">= 2.1.2 < 3.0.0" } }, "sha512-4fCk79wshMdzMp2rH06qWrJE4iolqLhCUH+OiuIgU++RB0+94NlDL81atO7GX55uUKueo0txHNtvEyI6D7WdMw=="],
|
||||||
|
|
||||||
|
"is-potential-custom-element-name": ["is-potential-custom-element-name@1.0.1", "", {}, "sha512-bCYeRA2rVibKZd+s2625gGnGF/t7DSqDs4dP7CrLA1m7jKWz6pps0LpYLJN8Q64HtmPKJ1hrN3nzPNKFEKOUiQ=="],
|
||||||
|
|
||||||
|
"jsdom": ["jsdom@24.1.3", "", { "dependencies": { "cssstyle": "^4.0.1", "data-urls": "^5.0.0", "decimal.js": "^10.4.3", "form-data": "^4.0.0", "html-encoding-sniffer": "^4.0.0", "http-proxy-agent": "^7.0.2", "https-proxy-agent": "^7.0.5", "is-potential-custom-element-name": "^1.0.1", "nwsapi": "^2.2.12", "parse5": "^7.1.2", "rrweb-cssom": "^0.7.1", "saxes": "^6.0.0", "symbol-tree": "^3.2.4", "tough-cookie": "^4.1.4", "w3c-xmlserializer": "^5.0.0", "webidl-conversions": "^7.0.0", "whatwg-encoding": "^3.1.1", "whatwg-mimetype": "^4.0.0", "whatwg-url": "^14.0.0", "ws": "^8.18.0", "xml-name-validator": "^5.0.0" }, "peerDependencies": { "canvas": "^2.11.2" }, "optionalPeers": ["canvas"] }, "sha512-MyL55p3Ut3cXbeBEG7Hcv0mVM8pp8PBNWxRqchZnSfAiES1v1mRnMeFfaHWIPULpwsYfvO+ZmMZz5tGCnjzDUQ=="],
|
||||||
|
|
||||||
|
"linkedom": ["linkedom@0.18.12", "", { "dependencies": { "css-select": "^5.1.0", "cssom": "^0.5.0", "html-escaper": "^3.0.3", "htmlparser2": "^10.0.0", "uhyphen": "^0.2.0" }, "peerDependencies": { "canvas": ">= 2" }, "optionalPeers": ["canvas"] }, "sha512-jalJsOwIKuQJSeTvsgzPe9iJzyfVaEJiEXl+25EkKevsULHvMJzpNqwvj1jOESWdmgKDiXObyjOYwlUqG7wo1Q=="],
|
||||||
|
|
||||||
|
"lru-cache": ["lru-cache@10.4.3", "", {}, "sha512-JNAzZcXrCt42VGLuYz0zfAzDfAvJWW6AfYlDBQyDV5DClI2m5sAmK+OIO7s59XfsRsWHp02jAJrRadPRGTt6SQ=="],
|
||||||
|
|
||||||
|
"math-intrinsics": ["math-intrinsics@1.1.0", "", {}, "sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g=="],
|
||||||
|
|
||||||
|
"mathml-to-latex": ["mathml-to-latex@1.5.0", "", { "dependencies": { "@xmldom/xmldom": "^0.8.10" } }, "sha512-rrWn0eEvcEcdMM4xfHcSGIy+i01DX9byOdXTLWg+w1iJ6O6ohP5UXY1dVzNUZLhzfl3EGcRekWLhY7JT5Omaew=="],
|
||||||
|
|
||||||
|
"mime-db": ["mime-db@1.52.0", "", {}, "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg=="],
|
||||||
|
|
||||||
|
"mime-types": ["mime-types@2.1.35", "", { "dependencies": { "mime-db": "1.52.0" } }, "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw=="],
|
||||||
|
|
||||||
|
"ms": ["ms@2.1.3", "", {}, "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="],
|
||||||
|
|
||||||
|
"nth-check": ["nth-check@2.1.1", "", { "dependencies": { "boolbase": "^1.0.0" } }, "sha512-lqjrjmaOoAnWfMmBPL+XNnynZh2+swxiX3WUE0s4yEHI6m+AwrK2UZOimIRl3X/4QctVqS8AiZjFqyOGrMXb/w=="],
|
||||||
|
|
||||||
|
"nwsapi": ["nwsapi@2.2.23", "", {}, "sha512-7wfH4sLbt4M0gCDzGE6vzQBo0bfTKjU7Sfpqy/7gs1qBfYz2vEJH6vXcBKpO3+6Yu1telwd0t9HpyOoLEQQbIQ=="],
|
||||||
|
|
||||||
|
"parse5": ["parse5@7.3.0", "", { "dependencies": { "entities": "^6.0.0" } }, "sha512-IInvU7fabl34qmi9gY8XOVxhYyMyuH2xUNpb2q8/Y+7552KlejkRvqvD19nMoUW/uQGGbqNpA6Tufu5FL5BZgw=="],
|
||||||
|
|
||||||
|
"psl": ["psl@1.15.0", "", { "dependencies": { "punycode": "^2.3.1" } }, "sha512-JZd3gMVBAVQkSs6HdNZo9Sdo0LNcQeMNP3CozBJb3JYC/QUYZTnKxP+f8oWRX4rHP5EurWxqAHTSwUCjlNKa1w=="],
|
||||||
|
|
||||||
|
"punycode": ["punycode@2.3.1", "", {}, "sha512-vYt7UD1U9Wg6138shLtLOvdAu+8DsC/ilFtEVHcH+wydcSpNE20AfSOduf6MkRFahL5FY7X1oU7nKVZFtfq8Fg=="],
|
||||||
|
|
||||||
|
"querystringify": ["querystringify@2.2.0", "", {}, "sha512-FIqgj2EUvTa7R50u0rGsyTftzjYmv/a3hO345bZNrqabNqjtgiDMgmo4mkUjd+nzU5oF3dClKqFIPUKybUyqoQ=="],
|
||||||
|
|
||||||
|
"requires-port": ["requires-port@1.0.0", "", {}, "sha512-KigOCHcocU3XODJxsu8i/j8T9tzT4adHiecwORRQ0ZZFcp7ahwXuRU1m+yuO90C5ZUyGeGfocHDI14M3L3yDAQ=="],
|
||||||
|
|
||||||
|
"rrweb-cssom": ["rrweb-cssom@0.7.1", "", {}, "sha512-TrEMa7JGdVm0UThDJSx7ddw5nVm3UJS9o9CCIZ72B1vSyEZoziDqBYP3XIoi/12lKrJR8rE3jeFHMok2F/Mnsg=="],
|
||||||
|
|
||||||
|
"safer-buffer": ["safer-buffer@2.1.2", "", {}, "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="],
|
||||||
|
|
||||||
|
"saxes": ["saxes@6.0.0", "", { "dependencies": { "xmlchars": "^2.2.0" } }, "sha512-xAg7SOnEhrm5zI3puOOKyy1OMcMlIJZYNJY7xLBwSze0UjhPLnWfj2GF2EpT0jmzaJKIWKHLsaSSajf35bcYnA=="],
|
||||||
|
|
||||||
|
"symbol-tree": ["symbol-tree@3.2.4", "", {}, "sha512-9QNk5KwDF+Bvz+PyObkmSYjI5ksVUYtjW7AU22r2NKcfLJcXp96hkDWU3+XndOsUb+AQ9QhfzfCT2O+CNWT5Tw=="],
|
||||||
|
|
||||||
|
"temml": ["temml@0.13.1", "", {}, "sha512-/fL1utq8QUD9YpcLeZHPRnp9Cbzbexq5hZl5uSBhf8mNYiKkcS4eYbLidDB+/nF8C+RHAcBQbKw2bKoS83mz1Q=="],
|
||||||
|
|
||||||
|
"tough-cookie": ["tough-cookie@4.1.4", "", { "dependencies": { "psl": "^1.1.33", "punycode": "^2.1.1", "universalify": "^0.2.0", "url-parse": "^1.5.3" } }, "sha512-Loo5UUvLD9ScZ6jh8beX1T6sO1w2/MpCRpEP7V280GKMVUQ0Jzar2U3UJPsrdbziLEMMhu3Ujnq//rhiFuIeag=="],
|
||||||
|
|
||||||
|
"tr46": ["tr46@5.1.1", "", { "dependencies": { "punycode": "^2.3.1" } }, "sha512-hdF5ZgjTqgAntKkklYw0R03MG2x/bSzTtkxmIRw/sTNV8YXsCJ1tfLAX23lhxhHJlEf3CRCOCGGWw3vI3GaSPw=="],
|
||||||
|
|
||||||
|
"turndown": ["turndown@7.2.2", "", { "dependencies": { "@mixmark-io/domino": "^2.2.0" } }, "sha512-1F7db8BiExOKxjSMU2b7if62D/XOyQyZbPKq/nUwopfgnHlqXHqQ0lvfUTeUIr1lZJzOPFn43dODyMSIfvWRKQ=="],
|
||||||
|
|
||||||
|
"turndown-plugin-gfm": ["turndown-plugin-gfm@1.0.2", "", {}, "sha512-vwz9tfvF7XN/jE0dGoBei3FXWuvll78ohzCZQuOb+ZjWrs3a0XhQVomJEb2Qh4VHTPNRO4GPZh0V7VRbiWwkRg=="],
|
||||||
|
|
||||||
|
"uhyphen": ["uhyphen@0.2.0", "", {}, "sha512-qz3o9CHXmJJPGBdqzab7qAYuW8kQGKNEuoHFYrBwV6hWIMcpAmxDLXojcHfFr9US1Pe6zUswEIJIbLI610fuqA=="],
|
||||||
|
|
||||||
|
"universalify": ["universalify@0.2.0", "", {}, "sha512-CJ1QgKmNg3CwvAv/kOFmtnEN05f0D/cn9QntgNOQlQF9dgvVTHj3t+8JPdjqawCHk7V/KA+fbUqzZ9XWhcqPUg=="],
|
||||||
|
|
||||||
|
"url-parse": ["url-parse@1.5.10", "", { "dependencies": { "querystringify": "^2.1.1", "requires-port": "^1.0.0" } }, "sha512-WypcfiRhfeUP9vvF0j6rw0J3hrWrw6iZv3+22h6iRMJ/8z1Tj6XfLP4DsUix5MhMPnXpiHDoKyoZ/bdCkwBCiQ=="],
|
||||||
|
|
||||||
|
"w3c-xmlserializer": ["w3c-xmlserializer@5.0.0", "", { "dependencies": { "xml-name-validator": "^5.0.0" } }, "sha512-o8qghlI8NZHU1lLPrpi2+Uq7abh4GGPpYANlalzWxyWteJOCsr/P+oPBA49TOLu5FTZO4d3F9MnWJfiMo4BkmA=="],
|
||||||
|
|
||||||
|
"webidl-conversions": ["webidl-conversions@7.0.0", "", {}, "sha512-VwddBukDzu71offAQR975unBIGqfKZpM+8ZX6ySk8nYhVoo5CYaZyzt3YBvYtRtO+aoGlqxPg/B87NGVZ/fu6g=="],
|
||||||
|
|
||||||
|
"whatwg-encoding": ["whatwg-encoding@3.1.1", "", { "dependencies": { "iconv-lite": "0.6.3" } }, "sha512-6qN4hJdMwfYBtE3YBTTHhoeuUrDBPZmbQaxWAqSALV/MeEnR5z1xd8UKud2RAkFoPkmB+hli1TZSnyi84xz1vQ=="],
|
||||||
|
|
||||||
|
"whatwg-mimetype": ["whatwg-mimetype@4.0.0", "", {}, "sha512-QaKxh0eNIi2mE9p2vEdzfagOKHCcj1pJ56EEHGQOVxp8r9/iszLUUV7v89x9O1p/T+NlTM5W7jW6+cz4Fq1YVg=="],
|
||||||
|
|
||||||
|
"whatwg-url": ["whatwg-url@14.2.0", "", { "dependencies": { "tr46": "^5.1.0", "webidl-conversions": "^7.0.0" } }, "sha512-De72GdQZzNTUBBChsXueQUnPKDkg/5A5zp7pFDuQAj5UFoENpiACU0wlCvzpAGnTkj++ihpKwKyYewn/XNUbKw=="],
|
||||||
|
|
||||||
|
"ws": ["ws@8.19.0", "", { "peerDependencies": { "bufferutil": "^4.0.1", "utf-8-validate": ">=5.0.2" }, "optionalPeers": ["bufferutil", "utf-8-validate"] }, "sha512-blAT2mjOEIi0ZzruJfIhb3nps74PRWTCz1IjglWEEpQl5XS/UNama6u2/rjFkDDouqr4L67ry+1aGIALViWjDg=="],
|
||||||
|
|
||||||
|
"xml-name-validator": ["xml-name-validator@5.0.0", "", {}, "sha512-EvGK8EJ3DhaHfbRlETOWAS5pO9MZITeauHKJyb8wyajUfQUenkIg2MvLDTZ4T/TgIcm3HU0TFBgWWboAZ30UHg=="],
|
||||||
|
|
||||||
|
"xmlchars": ["xmlchars@2.2.0", "", {}, "sha512-JZnDKK8B0RCDw84FNdDAIpZK+JuJw+s7Lz8nksI7SIuU3UXJJslUthsi+uWBUYOwPFwW7W7PRLRfUKpxjtjFCw=="],
|
||||||
|
|
||||||
|
"cssstyle/rrweb-cssom": ["rrweb-cssom@0.8.0", "", {}, "sha512-guoltQEx+9aMf2gDZ0s62EcV8lsXR+0w8915TC3ITdn2YueuNjdAYh/levpU9nFaoChh9RUS5ZdQMrKfVEN9tw=="],
|
||||||
|
|
||||||
|
"dom-serializer/entities": ["entities@4.5.0", "", {}, "sha512-V0hjH4dGPh9Ao5p0MoRY6BVqtwCjhz6vI5LT8AJ55H+4g9/4vbHx1I54fS0XuclLhDHArPQCiMjDxjaL8fPxhw=="],
|
||||||
|
|
||||||
|
"htmlparser2/entities": ["entities@7.0.1", "", {}, "sha512-TWrgLOFUQTH994YUyl1yT4uyavY5nNB5muff+RtWaqNVCAK408b5ZnnbNAUEWLTCpum9w6arT70i1XdQ4UeOPA=="],
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -1,5 +1,7 @@
|
||||||
import { JSDOM } from "jsdom";
|
import { parseHTML } from "linkedom";
|
||||||
import { Defuddle } from "defuddle/node";
|
import { Readability } from "@mozilla/readability";
|
||||||
|
import TurndownService from "turndown";
|
||||||
|
import { gfm } from "turndown-plugin-gfm";
|
||||||
|
|
||||||
export interface PageMetadata {
|
export interface PageMetadata {
|
||||||
url: string;
|
url: string;
|
||||||
|
|
@ -14,8 +16,103 @@ export interface PageMetadata {
|
||||||
export interface ConversionResult {
|
export interface ConversionResult {
|
||||||
metadata: PageMetadata;
|
metadata: PageMetadata;
|
||||||
markdown: string;
|
markdown: string;
|
||||||
|
rawHtml: string;
|
||||||
|
conversionMethod: string;
|
||||||
|
fallbackReason?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
interface ExtractionCandidate {
|
||||||
|
title: string | null;
|
||||||
|
byline: string | null;
|
||||||
|
excerpt: string | null;
|
||||||
|
published: string | null;
|
||||||
|
html: string | null;
|
||||||
|
textContent: string;
|
||||||
|
method: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
type AnyRecord = Record<string, unknown>;
|
||||||
|
|
||||||
|
const MIN_CONTENT_LENGTH = 120;
|
||||||
|
const GOOD_CONTENT_LENGTH = 900;
|
||||||
|
|
||||||
|
const CONTENT_SELECTORS = [
|
||||||
|
"article",
|
||||||
|
"main article",
|
||||||
|
"[role='main'] article",
|
||||||
|
"[itemprop='articleBody']",
|
||||||
|
".article-content",
|
||||||
|
".article-body",
|
||||||
|
".post-content",
|
||||||
|
".entry-content",
|
||||||
|
".story-body",
|
||||||
|
"main",
|
||||||
|
"[role='main']",
|
||||||
|
"#content",
|
||||||
|
".content",
|
||||||
|
];
|
||||||
|
|
||||||
|
const REMOVE_SELECTORS = [
|
||||||
|
"script",
|
||||||
|
"style",
|
||||||
|
"noscript",
|
||||||
|
"template",
|
||||||
|
"iframe",
|
||||||
|
"svg",
|
||||||
|
"path",
|
||||||
|
"nav",
|
||||||
|
"aside",
|
||||||
|
"footer",
|
||||||
|
"header",
|
||||||
|
"form",
|
||||||
|
".advertisement",
|
||||||
|
".ads",
|
||||||
|
".social-share",
|
||||||
|
".related-articles",
|
||||||
|
".comments",
|
||||||
|
".newsletter",
|
||||||
|
".cookie-banner",
|
||||||
|
".cookie-consent",
|
||||||
|
"[role='navigation']",
|
||||||
|
"[aria-label*='cookie' i]",
|
||||||
|
];
|
||||||
|
|
||||||
|
const PUBLISHED_TIME_SELECTORS = [
|
||||||
|
"meta[property='article:published_time']",
|
||||||
|
"meta[name='pubdate']",
|
||||||
|
"meta[name='publishdate']",
|
||||||
|
"meta[name='date']",
|
||||||
|
"time[datetime]",
|
||||||
|
];
|
||||||
|
|
||||||
|
const ARTICLE_TYPES = new Set([
|
||||||
|
"Article",
|
||||||
|
"NewsArticle",
|
||||||
|
"BlogPosting",
|
||||||
|
"WebPage",
|
||||||
|
"ReportageNewsArticle",
|
||||||
|
]);
|
||||||
|
|
||||||
|
const NEXT_DATA_CONTENT_PATHS = [
|
||||||
|
"props.pageProps.content.body",
|
||||||
|
"props.pageProps.article.body",
|
||||||
|
"props.pageProps.article.content",
|
||||||
|
"props.pageProps.post.body",
|
||||||
|
"props.pageProps.post.content",
|
||||||
|
"props.pageProps.data.body",
|
||||||
|
"props.pageProps.story.body.content",
|
||||||
|
];
|
||||||
|
|
||||||
|
const LOW_QUALITY_MARKERS = [
|
||||||
|
/Join The Conversation/i,
|
||||||
|
/One Community\. Many Voices/i,
|
||||||
|
/Read our community guidelines/i,
|
||||||
|
/Create a free account to share your thoughts/i,
|
||||||
|
/Become a Forbes Member/i,
|
||||||
|
/Subscribe to trusted journalism/i,
|
||||||
|
/\bComments\b/i,
|
||||||
|
];
|
||||||
|
|
||||||
export const absolutizeUrlsScript = String.raw`
|
export const absolutizeUrlsScript = String.raw`
|
||||||
(function() {
|
(function() {
|
||||||
const baseUrl = document.baseURI || location.href;
|
const baseUrl = document.baseURI || location.href;
|
||||||
|
|
@ -53,21 +150,791 @@ export const absolutizeUrlsScript = String.raw`
|
||||||
})()
|
})()
|
||||||
`;
|
`;
|
||||||
|
|
||||||
export async function extractContent(html: string, url: string): Promise<ConversionResult> {
|
function pickString(...values: unknown[]): string | null {
|
||||||
const dom = new JSDOM(html, { url });
|
for (const value of values) {
|
||||||
const result = await Defuddle(dom, url, { markdown: true });
|
if (typeof value === "string") {
|
||||||
|
const trimmed = value.trim();
|
||||||
|
if (trimmed) return trimmed;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
const metadata: PageMetadata = {
|
function normalizeMarkdown(markdown: string): string {
|
||||||
|
return markdown
|
||||||
|
.replace(/\r\n/g, "\n")
|
||||||
|
.replace(/[ \t]+\n/g, "\n")
|
||||||
|
.replace(/\n{3,}/g, "\n\n")
|
||||||
|
.trim();
|
||||||
|
}
|
||||||
|
|
||||||
|
function parseDocument(html: string): Document {
|
||||||
|
const normalized = /<\s*html[\s>]/i.test(html)
|
||||||
|
? html
|
||||||
|
: `<!doctype html><html><body>${html}</body></html>`;
|
||||||
|
return parseHTML(normalized).document as unknown as Document;
|
||||||
|
}
|
||||||
|
|
||||||
|
function sanitizeHtml(html: string): string {
|
||||||
|
const { document } = parseHTML(`<div id="__root">${html}</div>`);
|
||||||
|
const root = document.querySelector("#__root");
|
||||||
|
if (!root) return html;
|
||||||
|
|
||||||
|
for (const selector of ["script", "style", "iframe", "noscript", "template", "svg", "path"]) {
|
||||||
|
for (const el of root.querySelectorAll(selector)) {
|
||||||
|
el.remove();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return root.innerHTML;
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractTextFromHtml(html: string): string {
|
||||||
|
const { document } = parseHTML(`<!doctype html><html><body>${html}</body></html>`);
|
||||||
|
for (const selector of ["script", "style", "noscript", "template", "iframe", "svg", "path"]) {
|
||||||
|
for (const el of document.querySelectorAll(selector)) {
|
||||||
|
el.remove();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return document.body?.textContent?.replace(/\s+/g, " ").trim() ?? "";
|
||||||
|
}
|
||||||
|
|
||||||
|
function getMetaContent(document: Document, names: string[]): string | null {
|
||||||
|
for (const name of names) {
|
||||||
|
const element =
|
||||||
|
document.querySelector(`meta[name="${name}"]`) ??
|
||||||
|
document.querySelector(`meta[property="${name}"]`);
|
||||||
|
const content = element?.getAttribute("content");
|
||||||
|
if (content && content.trim()) return content.trim();
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function flattenJsonLdItems(data: unknown): AnyRecord[] {
|
||||||
|
if (!data || typeof data !== "object") return [];
|
||||||
|
if (Array.isArray(data)) return data.flatMap(flattenJsonLdItems);
|
||||||
|
|
||||||
|
const item = data as AnyRecord;
|
||||||
|
if (Array.isArray(item["@graph"])) {
|
||||||
|
return (item["@graph"] as unknown[]).flatMap(flattenJsonLdItems);
|
||||||
|
}
|
||||||
|
|
||||||
|
return [item];
|
||||||
|
}
|
||||||
|
|
||||||
|
function parseJsonLdScripts(document: Document): AnyRecord[] {
|
||||||
|
const results: AnyRecord[] = [];
|
||||||
|
const scripts = document.querySelectorAll("script[type='application/ld+json']");
|
||||||
|
|
||||||
|
for (const script of scripts) {
|
||||||
|
try {
|
||||||
|
const data = JSON.parse(script.textContent ?? "");
|
||||||
|
results.push(...flattenJsonLdItems(data));
|
||||||
|
} catch {
|
||||||
|
// Ignore malformed blocks.
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return results;
|
||||||
|
}
|
||||||
|
|
||||||
|
function isArticleType(item: AnyRecord): boolean {
|
||||||
|
const value = Array.isArray(item["@type"]) ? item["@type"][0] : item["@type"];
|
||||||
|
return typeof value === "string" && ARTICLE_TYPES.has(value);
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractAuthorFromJsonLd(authorData: unknown): string | null {
|
||||||
|
if (typeof authorData === "string") return authorData;
|
||||||
|
if (!authorData || typeof authorData !== "object") return null;
|
||||||
|
|
||||||
|
if (Array.isArray(authorData)) {
|
||||||
|
const names = authorData
|
||||||
|
.map((author) => extractAuthorFromJsonLd(author))
|
||||||
|
.filter((name): name is string => Boolean(name));
|
||||||
|
return names.length > 0 ? names.join(", ") : null;
|
||||||
|
}
|
||||||
|
|
||||||
|
const author = authorData as AnyRecord;
|
||||||
|
return typeof author.name === "string" ? author.name : null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractPrimaryJsonLdMeta(document: Document): Partial<PageMetadata> {
|
||||||
|
for (const item of parseJsonLdScripts(document)) {
|
||||||
|
if (!isArticleType(item)) continue;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: pickString(item.headline, item.name) ?? undefined,
|
||||||
|
description: pickString(item.description) ?? undefined,
|
||||||
|
author: extractAuthorFromJsonLd(item.author) ?? undefined,
|
||||||
|
published: pickString(item.datePublished, item.dateCreated) ?? undefined,
|
||||||
|
coverImage:
|
||||||
|
pickString(
|
||||||
|
item.image,
|
||||||
|
(item.image as AnyRecord | undefined)?.url,
|
||||||
|
(Array.isArray(item.image) ? item.image[0] : undefined) as unknown
|
||||||
|
) ?? undefined,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
return {};
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractPublishedTime(document: Document): string | null {
|
||||||
|
for (const selector of PUBLISHED_TIME_SELECTORS) {
|
||||||
|
const el = document.querySelector(selector);
|
||||||
|
if (!el) continue;
|
||||||
|
const value = el.getAttribute("content") ?? el.getAttribute("datetime");
|
||||||
|
if (value && value.trim()) return value.trim();
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractTitle(document: Document): string | null {
|
||||||
|
const ogTitle = document.querySelector("meta[property='og:title']")?.getAttribute("content");
|
||||||
|
if (ogTitle && ogTitle.trim()) return ogTitle.trim();
|
||||||
|
|
||||||
|
const twitterTitle = document.querySelector("meta[name='twitter:title']")?.getAttribute("content");
|
||||||
|
if (twitterTitle && twitterTitle.trim()) return twitterTitle.trim();
|
||||||
|
|
||||||
|
const title = document.querySelector("title")?.textContent?.trim();
|
||||||
|
if (title) {
|
||||||
|
const cleaned = title.split(/\s*[-|–—]\s*/)[0]?.trim();
|
||||||
|
if (cleaned) return cleaned;
|
||||||
|
}
|
||||||
|
|
||||||
|
const h1 = document.querySelector("h1")?.textContent?.trim();
|
||||||
|
return h1 || null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractMetadataFromHtml(html: string, url: string, capturedAt: string): PageMetadata {
|
||||||
|
const document = parseDocument(html);
|
||||||
|
const jsonLd = extractPrimaryJsonLdMeta(document);
|
||||||
|
const timeEl = document.querySelector("time[datetime]");
|
||||||
|
|
||||||
|
return {
|
||||||
url,
|
url,
|
||||||
title: result.title || "",
|
title:
|
||||||
description: result.description || undefined,
|
pickString(
|
||||||
author: result.author || undefined,
|
getMetaContent(document, ["og:title", "twitter:title"]),
|
||||||
published: result.published || undefined,
|
jsonLd.title,
|
||||||
coverImage: result.image || undefined,
|
document.querySelector("h1")?.textContent,
|
||||||
captured_at: new Date().toISOString(),
|
document.title
|
||||||
|
) ?? "",
|
||||||
|
description:
|
||||||
|
pickString(
|
||||||
|
getMetaContent(document, ["description", "og:description", "twitter:description"]),
|
||||||
|
jsonLd.description
|
||||||
|
) ?? undefined,
|
||||||
|
author:
|
||||||
|
pickString(
|
||||||
|
getMetaContent(document, ["author", "article:author", "twitter:creator"]),
|
||||||
|
jsonLd.author
|
||||||
|
) ?? undefined,
|
||||||
|
published:
|
||||||
|
pickString(
|
||||||
|
timeEl?.getAttribute("datetime"),
|
||||||
|
getMetaContent(document, ["article:published_time", "datePublished", "publishdate", "date"]),
|
||||||
|
jsonLd.published,
|
||||||
|
extractPublishedTime(document)
|
||||||
|
) ?? undefined,
|
||||||
|
coverImage:
|
||||||
|
pickString(
|
||||||
|
getMetaContent(document, ["og:image", "twitter:image", "twitter:image:src"]),
|
||||||
|
jsonLd.coverImage
|
||||||
|
) ?? undefined,
|
||||||
|
captured_at: capturedAt,
|
||||||
};
|
};
|
||||||
|
}
|
||||||
|
|
||||||
return { metadata, markdown: result.content || "" };
|
function generateExcerpt(excerpt: string | null, textContent: string | null): string | null {
|
||||||
|
if (excerpt) return excerpt;
|
||||||
|
if (!textContent) return null;
|
||||||
|
const trimmed = textContent.trim();
|
||||||
|
if (!trimmed) return null;
|
||||||
|
return trimmed.length > 200 ? `${trimmed.slice(0, 200)}...` : trimmed;
|
||||||
|
}
|
||||||
|
|
||||||
|
function parseJsonLdItem(item: AnyRecord): ExtractionCandidate | null {
|
||||||
|
if (!isArticleType(item)) return null;
|
||||||
|
|
||||||
|
const rawContent =
|
||||||
|
(typeof item.articleBody === "string" && item.articleBody) ||
|
||||||
|
(typeof item.text === "string" && item.text) ||
|
||||||
|
(typeof item.description === "string" && item.description) ||
|
||||||
|
null;
|
||||||
|
|
||||||
|
if (!rawContent) return null;
|
||||||
|
|
||||||
|
const content = rawContent.trim();
|
||||||
|
const htmlLike = /<\/?[a-z][\s\S]*>/i.test(content);
|
||||||
|
const textContent = htmlLike ? extractTextFromHtml(content) : content;
|
||||||
|
|
||||||
|
if (textContent.length < MIN_CONTENT_LENGTH) return null;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: pickString(item.headline, item.name),
|
||||||
|
byline: extractAuthorFromJsonLd(item.author),
|
||||||
|
excerpt: pickString(item.description),
|
||||||
|
published: pickString(item.datePublished, item.dateCreated),
|
||||||
|
html: htmlLike ? content : null,
|
||||||
|
textContent,
|
||||||
|
method: "json-ld",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
function tryJsonLdExtraction(document: Document): ExtractionCandidate | null {
|
||||||
|
for (const item of parseJsonLdScripts(document)) {
|
||||||
|
const extracted = parseJsonLdItem(item);
|
||||||
|
if (extracted) return extracted;
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function getByPath(value: unknown, path: string): unknown {
|
||||||
|
let current = value;
|
||||||
|
for (const part of path.split(".")) {
|
||||||
|
if (!current || typeof current !== "object") return undefined;
|
||||||
|
current = (current as AnyRecord)[part];
|
||||||
|
}
|
||||||
|
return current;
|
||||||
|
}
|
||||||
|
|
||||||
|
function isContentBlockArray(value: unknown): value is AnyRecord[] {
|
||||||
|
if (!Array.isArray(value) || value.length === 0) return false;
|
||||||
|
return value.slice(0, 5).some((item) => {
|
||||||
|
if (!item || typeof item !== "object") return false;
|
||||||
|
const obj = item as AnyRecord;
|
||||||
|
return "type" in obj || "text" in obj || "textHtml" in obj || "content" in obj;
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractTextFromContentBlocks(blocks: AnyRecord[]): string {
|
||||||
|
const parts: string[] = [];
|
||||||
|
|
||||||
|
function pushParagraph(text: string): void {
|
||||||
|
const trimmed = text.trim();
|
||||||
|
if (!trimmed) return;
|
||||||
|
parts.push(trimmed, "\n\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
function walk(node: unknown): void {
|
||||||
|
if (!node || typeof node !== "object") return;
|
||||||
|
const block = node as AnyRecord;
|
||||||
|
|
||||||
|
if (typeof block.text === "string") {
|
||||||
|
pushParagraph(block.text);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (typeof block.textHtml === "string") {
|
||||||
|
pushParagraph(extractTextFromHtml(block.textHtml));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (Array.isArray(block.items)) {
|
||||||
|
for (const item of block.items) {
|
||||||
|
if (item && typeof item === "object") {
|
||||||
|
const text = pickString((item as AnyRecord).text);
|
||||||
|
if (text) parts.push(`- ${text}\n`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
parts.push("\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (Array.isArray(block.components)) {
|
||||||
|
for (const component of block.components) {
|
||||||
|
walk(component);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (Array.isArray(block.content)) {
|
||||||
|
for (const child of block.content) {
|
||||||
|
walk(child);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const block of blocks) {
|
||||||
|
walk(block);
|
||||||
|
}
|
||||||
|
|
||||||
|
return parts.join("").replace(/\n{3,}/g, "\n\n").trim();
|
||||||
|
}
|
||||||
|
|
||||||
|
function tryStringBodyExtraction(
|
||||||
|
content: string,
|
||||||
|
meta: AnyRecord,
|
||||||
|
document: Document,
|
||||||
|
method: string
|
||||||
|
): ExtractionCandidate | null {
|
||||||
|
if (!content || content.length < MIN_CONTENT_LENGTH) return null;
|
||||||
|
|
||||||
|
const isHtml = /<\/?[a-z][\s\S]*>/i.test(content);
|
||||||
|
const html = isHtml ? sanitizeHtml(content) : null;
|
||||||
|
const textContent = isHtml ? extractTextFromHtml(html) : content.trim();
|
||||||
|
|
||||||
|
if (textContent.length < MIN_CONTENT_LENGTH) return null;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: pickString(meta.headline, meta.title, extractTitle(document)),
|
||||||
|
byline: pickString(meta.byline, meta.author),
|
||||||
|
excerpt: pickString(meta.description, meta.excerpt, generateExcerpt(null, textContent)),
|
||||||
|
published: pickString(meta.datePublished, meta.publishedAt, extractPublishedTime(document)),
|
||||||
|
html,
|
||||||
|
textContent,
|
||||||
|
method,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
function tryNextDataExtraction(document: Document): ExtractionCandidate | null {
|
||||||
|
try {
|
||||||
|
const script = document.querySelector("script#__NEXT_DATA__");
|
||||||
|
if (!script?.textContent) return null;
|
||||||
|
|
||||||
|
const data = JSON.parse(script.textContent) as AnyRecord;
|
||||||
|
const pageProps = (getByPath(data, "props.pageProps") ?? {}) as AnyRecord;
|
||||||
|
|
||||||
|
for (const path of NEXT_DATA_CONTENT_PATHS) {
|
||||||
|
const value = getByPath(data, path);
|
||||||
|
|
||||||
|
if (typeof value === "string") {
|
||||||
|
const parentPath = path.split(".").slice(0, -1).join(".");
|
||||||
|
const parent = (getByPath(data, parentPath) ?? {}) as AnyRecord;
|
||||||
|
const meta = {
|
||||||
|
...pageProps,
|
||||||
|
...parent,
|
||||||
|
title: parent.title ?? (pageProps.title as string | undefined),
|
||||||
|
};
|
||||||
|
|
||||||
|
const candidate = tryStringBodyExtraction(value, meta, document, "next-data");
|
||||||
|
if (candidate) return candidate;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (isContentBlockArray(value)) {
|
||||||
|
const textContent = extractTextFromContentBlocks(value);
|
||||||
|
if (textContent.length < MIN_CONTENT_LENGTH) continue;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: pickString(
|
||||||
|
getByPath(data, "props.pageProps.content.headline"),
|
||||||
|
getByPath(data, "props.pageProps.article.headline"),
|
||||||
|
getByPath(data, "props.pageProps.article.title"),
|
||||||
|
getByPath(data, "props.pageProps.post.title"),
|
||||||
|
pageProps.title,
|
||||||
|
extractTitle(document)
|
||||||
|
),
|
||||||
|
byline: pickString(
|
||||||
|
getByPath(data, "props.pageProps.author.name"),
|
||||||
|
getByPath(data, "props.pageProps.article.author.name")
|
||||||
|
),
|
||||||
|
excerpt: pickString(
|
||||||
|
getByPath(data, "props.pageProps.content.description"),
|
||||||
|
getByPath(data, "props.pageProps.article.description"),
|
||||||
|
pageProps.description,
|
||||||
|
generateExcerpt(null, textContent)
|
||||||
|
),
|
||||||
|
published: pickString(
|
||||||
|
getByPath(data, "props.pageProps.content.datePublished"),
|
||||||
|
getByPath(data, "props.pageProps.article.datePublished"),
|
||||||
|
getByPath(data, "props.pageProps.publishedAt"),
|
||||||
|
extractPublishedTime(document)
|
||||||
|
),
|
||||||
|
html: null,
|
||||||
|
textContent,
|
||||||
|
method: "next-data",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function buildReadabilityCandidate(
|
||||||
|
article: ReturnType<Readability["parse"]>,
|
||||||
|
document: Document,
|
||||||
|
method: string
|
||||||
|
): ExtractionCandidate | null {
|
||||||
|
const textContent = article?.textContent?.trim() ?? "";
|
||||||
|
if (textContent.length < MIN_CONTENT_LENGTH) return null;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: pickString(article?.title, extractTitle(document)),
|
||||||
|
byline: pickString((article as { byline?: string } | null)?.byline),
|
||||||
|
excerpt: pickString(article?.excerpt, generateExcerpt(null, textContent)),
|
||||||
|
published: pickString((article as { publishedTime?: string } | null)?.publishedTime, extractPublishedTime(document)),
|
||||||
|
html: article?.content ? sanitizeHtml(article.content) : null,
|
||||||
|
textContent,
|
||||||
|
method,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
function tryReadability(document: Document): ExtractionCandidate | null {
|
||||||
|
try {
|
||||||
|
const strictClone = document.cloneNode(true) as Document;
|
||||||
|
const strictResult = buildReadabilityCandidate(
|
||||||
|
new Readability(strictClone).parse(),
|
||||||
|
document,
|
||||||
|
"readability"
|
||||||
|
);
|
||||||
|
if (strictResult) return strictResult;
|
||||||
|
|
||||||
|
const relaxedClone = document.cloneNode(true) as Document;
|
||||||
|
return buildReadabilityCandidate(
|
||||||
|
new Readability(relaxedClone, { charThreshold: 120 }).parse(),
|
||||||
|
document,
|
||||||
|
"readability-relaxed"
|
||||||
|
);
|
||||||
|
} catch {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function trySelectorExtraction(document: Document): ExtractionCandidate | null {
|
||||||
|
for (const selector of CONTENT_SELECTORS) {
|
||||||
|
const element = document.querySelector(selector);
|
||||||
|
if (!element) continue;
|
||||||
|
|
||||||
|
const clone = element.cloneNode(true) as Element;
|
||||||
|
for (const removeSelector of REMOVE_SELECTORS) {
|
||||||
|
for (const node of clone.querySelectorAll(removeSelector)) {
|
||||||
|
node.remove();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const html = sanitizeHtml(clone.innerHTML);
|
||||||
|
const textContent = extractTextFromHtml(html);
|
||||||
|
if (textContent.length < MIN_CONTENT_LENGTH) continue;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: extractTitle(document),
|
||||||
|
byline: null,
|
||||||
|
excerpt: generateExcerpt(null, textContent),
|
||||||
|
published: extractPublishedTime(document),
|
||||||
|
html,
|
||||||
|
textContent,
|
||||||
|
method: `selector:${selector}`,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function tryBodyExtraction(document: Document): ExtractionCandidate | null {
|
||||||
|
const body = document.body;
|
||||||
|
if (!body) return null;
|
||||||
|
|
||||||
|
const clone = body.cloneNode(true) as Element;
|
||||||
|
for (const removeSelector of REMOVE_SELECTORS) {
|
||||||
|
for (const node of clone.querySelectorAll(removeSelector)) {
|
||||||
|
node.remove();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const html = sanitizeHtml(clone.innerHTML);
|
||||||
|
const textContent = extractTextFromHtml(html);
|
||||||
|
if (!textContent) return null;
|
||||||
|
|
||||||
|
return {
|
||||||
|
title: extractTitle(document),
|
||||||
|
byline: null,
|
||||||
|
excerpt: generateExcerpt(null, textContent),
|
||||||
|
published: extractPublishedTime(document),
|
||||||
|
html,
|
||||||
|
textContent,
|
||||||
|
method: "body-fallback",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
function pickBestCandidate(candidates: ExtractionCandidate[]): ExtractionCandidate | null {
|
||||||
|
if (candidates.length === 0) return null;
|
||||||
|
|
||||||
|
const methodOrder = [
|
||||||
|
"readability",
|
||||||
|
"readability-relaxed",
|
||||||
|
"next-data",
|
||||||
|
"json-ld",
|
||||||
|
"selector:",
|
||||||
|
"body-fallback",
|
||||||
|
];
|
||||||
|
|
||||||
|
function methodRank(method: string): number {
|
||||||
|
const idx = methodOrder.findIndex((entry) =>
|
||||||
|
entry.endsWith(":") ? method.startsWith(entry) : method === entry
|
||||||
|
);
|
||||||
|
return idx === -1 ? methodOrder.length : idx;
|
||||||
|
}
|
||||||
|
|
||||||
|
const ranked = [...candidates].sort((a, b) => {
|
||||||
|
const rankA = methodRank(a.method);
|
||||||
|
const rankB = methodRank(b.method);
|
||||||
|
if (rankA !== rankB) return rankA - rankB;
|
||||||
|
return (b.textContent.length ?? 0) - (a.textContent.length ?? 0);
|
||||||
|
});
|
||||||
|
|
||||||
|
for (const candidate of ranked) {
|
||||||
|
if (candidate.textContent.length >= GOOD_CONTENT_LENGTH) {
|
||||||
|
return candidate;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const candidate of ranked) {
|
||||||
|
if (candidate.textContent.length >= MIN_CONTENT_LENGTH) {
|
||||||
|
return candidate;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return ranked[0];
|
||||||
|
}
|
||||||
|
|
||||||
|
function extractFromHtml(html: string): ExtractionCandidate | null {
|
||||||
|
const document = parseDocument(html);
|
||||||
|
|
||||||
|
const readabilityCandidate = tryReadability(document);
|
||||||
|
const nextDataCandidate = tryNextDataExtraction(document);
|
||||||
|
const jsonLdCandidate = tryJsonLdExtraction(document);
|
||||||
|
const selectorCandidate = trySelectorExtraction(document);
|
||||||
|
const bodyCandidate = tryBodyExtraction(document);
|
||||||
|
|
||||||
|
const candidates = [
|
||||||
|
readabilityCandidate,
|
||||||
|
nextDataCandidate,
|
||||||
|
jsonLdCandidate,
|
||||||
|
selectorCandidate,
|
||||||
|
bodyCandidate,
|
||||||
|
].filter((candidate): candidate is ExtractionCandidate => Boolean(candidate));
|
||||||
|
|
||||||
|
const winner = pickBestCandidate(candidates);
|
||||||
|
if (!winner) return null;
|
||||||
|
|
||||||
|
return {
|
||||||
|
...winner,
|
||||||
|
title: winner.title ?? extractTitle(document),
|
||||||
|
published: winner.published ?? extractPublishedTime(document),
|
||||||
|
excerpt: winner.excerpt ?? generateExcerpt(null, winner.textContent),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
const turndown = new TurndownService({
|
||||||
|
headingStyle: "atx",
|
||||||
|
hr: "---",
|
||||||
|
bulletListMarker: "-",
|
||||||
|
codeBlockStyle: "fenced",
|
||||||
|
emDelimiter: "*",
|
||||||
|
strongDelimiter: "**",
|
||||||
|
linkStyle: "inlined",
|
||||||
|
});
|
||||||
|
|
||||||
|
turndown.use(gfm);
|
||||||
|
turndown.remove(["script", "style", "iframe", "noscript", "template", "svg", "path"]);
|
||||||
|
|
||||||
|
turndown.addRule("collapseFigure", {
|
||||||
|
filter: "figure",
|
||||||
|
replacement(content) {
|
||||||
|
return `\n\n${content.trim()}\n\n`;
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
turndown.addRule("dropInvisibleAnchors", {
|
||||||
|
filter(node) {
|
||||||
|
return node.nodeName === "A" && !(node as Element).textContent?.trim();
|
||||||
|
},
|
||||||
|
replacement() {
|
||||||
|
return "";
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
function convertHtmlToMarkdown(html: string): string {
|
||||||
|
if (!html || !html.trim()) return "";
|
||||||
|
|
||||||
|
try {
|
||||||
|
const sanitized = sanitizeHtml(html);
|
||||||
|
return turndown.turndown(sanitized);
|
||||||
|
} catch {
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function fallbackPlainText(html: string): string {
|
||||||
|
const document = parseDocument(html);
|
||||||
|
for (const selector of ["script", "style", "noscript", "template", "iframe", "svg", "path"]) {
|
||||||
|
for (const el of document.querySelectorAll(selector)) {
|
||||||
|
el.remove();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
const text = document.body?.textContent ?? document.documentElement?.textContent ?? "";
|
||||||
|
return normalizeMarkdown(text.replace(/\s+/g, " "));
|
||||||
|
}
|
||||||
|
|
||||||
|
function countBylines(markdown: string): number {
|
||||||
|
return (markdown.match(/(^|\n)By\s+/g) || []).length;
|
||||||
|
}
|
||||||
|
|
||||||
|
function countUsefulParagraphs(markdown: string): number {
|
||||||
|
const paragraphs = normalizeMarkdown(markdown).split(/\n{2,}/);
|
||||||
|
let count = 0;
|
||||||
|
|
||||||
|
for (const paragraph of paragraphs) {
|
||||||
|
const trimmed = paragraph.trim();
|
||||||
|
if (!trimmed) continue;
|
||||||
|
if (/^!?\[[^\]]*\]\([^)]+\)$/.test(trimmed)) continue;
|
||||||
|
if (/^#{1,6}\s+/.test(trimmed)) continue;
|
||||||
|
if ((trimmed.match(/\b[\p{L}\p{N}']+\b/gu) || []).length < 8) continue;
|
||||||
|
count++;
|
||||||
|
}
|
||||||
|
|
||||||
|
return count;
|
||||||
|
}
|
||||||
|
|
||||||
|
function countMarkerHits(markdown: string, markers: RegExp[]): number {
|
||||||
|
let hits = 0;
|
||||||
|
for (const marker of markers) {
|
||||||
|
if (marker.test(markdown)) hits++;
|
||||||
|
}
|
||||||
|
return hits;
|
||||||
|
}
|
||||||
|
|
||||||
|
function scoreMarkdownQuality(markdown: string): number {
|
||||||
|
const normalized = normalizeMarkdown(markdown);
|
||||||
|
const wordCount = (normalized.match(/\b[\p{L}\p{N}']+\b/gu) || []).length;
|
||||||
|
const usefulParagraphs = countUsefulParagraphs(normalized);
|
||||||
|
const headingCount = (normalized.match(/^#{1,6}\s+/gm) || []).length;
|
||||||
|
const markerHits = countMarkerHits(normalized, LOW_QUALITY_MARKERS);
|
||||||
|
const bylineCount = countBylines(normalized);
|
||||||
|
const staffCount = (normalized.match(/\bForbes Staff\b/gi) || []).length;
|
||||||
|
|
||||||
|
return (
|
||||||
|
Math.min(wordCount, 4000) +
|
||||||
|
usefulParagraphs * 40 +
|
||||||
|
headingCount * 10 -
|
||||||
|
markerHits * 180 -
|
||||||
|
Math.max(0, bylineCount - 1) * 120 -
|
||||||
|
Math.max(0, staffCount - 1) * 80
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
function shouldCompareWithLegacy(markdown: string): boolean {
|
||||||
|
const normalized = normalizeMarkdown(markdown);
|
||||||
|
return (
|
||||||
|
countMarkerHits(normalized, LOW_QUALITY_MARKERS) > 0 ||
|
||||||
|
countBylines(normalized) > 1 ||
|
||||||
|
countUsefulParagraphs(normalized) < 6
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
function isMarkdownUsable(markdown: string, html: string): boolean {
|
||||||
|
const normalized = normalizeMarkdown(markdown);
|
||||||
|
if (!normalized) return false;
|
||||||
|
|
||||||
|
const htmlTextLength = extractTextFromHtml(html).length;
|
||||||
|
if (htmlTextLength < MIN_CONTENT_LENGTH) return true;
|
||||||
|
|
||||||
|
if (normalized.length >= 80) return true;
|
||||||
|
return normalized.length >= Math.min(200, Math.floor(htmlTextLength * 0.2));
|
||||||
|
}
|
||||||
|
|
||||||
|
async function tryDefuddleConversion(
|
||||||
|
html: string,
|
||||||
|
url: string,
|
||||||
|
baseMetadata: PageMetadata
|
||||||
|
): Promise<{ ok: true; result: ConversionResult } | { ok: false; reason: string }> {
|
||||||
|
try {
|
||||||
|
const [{ JSDOM, VirtualConsole }, { Defuddle }] = await Promise.all([
|
||||||
|
import("jsdom"),
|
||||||
|
import("defuddle/node"),
|
||||||
|
]);
|
||||||
|
|
||||||
|
const virtualConsole = new VirtualConsole();
|
||||||
|
virtualConsole.on("jsdomError", (error: Error & { type?: string }) => {
|
||||||
|
if (error.type === "css parsing" || /Could not parse CSS stylesheet/i.test(error.message)) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
console.warn(`[url-to-markdown] jsdom: ${error.message}`);
|
||||||
|
});
|
||||||
|
|
||||||
|
const dom = new JSDOM(html, { url, virtualConsole });
|
||||||
|
const result = await Defuddle(dom, url, { markdown: true });
|
||||||
|
const markdown = normalizeMarkdown(result.content || "");
|
||||||
|
|
||||||
|
if (!isMarkdownUsable(markdown, html)) {
|
||||||
|
return { ok: false, reason: "Defuddle returned empty or incomplete markdown" };
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
ok: true,
|
||||||
|
result: {
|
||||||
|
metadata: {
|
||||||
|
...baseMetadata,
|
||||||
|
title: pickString(result.title, baseMetadata.title) ?? "",
|
||||||
|
description: pickString(result.description, baseMetadata.description) ?? undefined,
|
||||||
|
author: pickString(result.author, baseMetadata.author) ?? undefined,
|
||||||
|
published: pickString(result.published, baseMetadata.published) ?? undefined,
|
||||||
|
coverImage: pickString(result.image, baseMetadata.coverImage) ?? undefined,
|
||||||
|
},
|
||||||
|
markdown,
|
||||||
|
rawHtml: html,
|
||||||
|
conversionMethod: "defuddle",
|
||||||
|
},
|
||||||
|
};
|
||||||
|
} catch (error) {
|
||||||
|
return {
|
||||||
|
ok: false,
|
||||||
|
reason: error instanceof Error ? error.message : String(error),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function convertWithLegacyExtractor(html: string, baseMetadata: PageMetadata): ConversionResult {
|
||||||
|
const extracted = extractFromHtml(html);
|
||||||
|
|
||||||
|
let markdown = extracted?.html ? convertHtmlToMarkdown(extracted.html) : "";
|
||||||
|
if (!markdown.trim()) {
|
||||||
|
markdown = extracted?.textContent?.trim() || fallbackPlainText(html);
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
metadata: {
|
||||||
|
...baseMetadata,
|
||||||
|
title: pickString(extracted?.title, baseMetadata.title) ?? "",
|
||||||
|
description: pickString(extracted?.excerpt, baseMetadata.description) ?? undefined,
|
||||||
|
author: pickString(extracted?.byline, baseMetadata.author) ?? undefined,
|
||||||
|
published: pickString(extracted?.published, baseMetadata.published) ?? undefined,
|
||||||
|
},
|
||||||
|
markdown: normalizeMarkdown(markdown),
|
||||||
|
rawHtml: html,
|
||||||
|
conversionMethod: extracted ? `legacy:${extracted.method}` : "legacy:plain-text",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function extractContent(html: string, url: string): Promise<ConversionResult> {
|
||||||
|
const capturedAt = new Date().toISOString();
|
||||||
|
const baseMetadata = extractMetadataFromHtml(html, url, capturedAt);
|
||||||
|
|
||||||
|
const defuddleResult = await tryDefuddleConversion(html, url, baseMetadata);
|
||||||
|
if (defuddleResult.ok) {
|
||||||
|
if (shouldCompareWithLegacy(defuddleResult.result.markdown)) {
|
||||||
|
const legacyResult = convertWithLegacyExtractor(html, baseMetadata);
|
||||||
|
const legacyScore = scoreMarkdownQuality(legacyResult.markdown);
|
||||||
|
const defuddleScore = scoreMarkdownQuality(defuddleResult.result.markdown);
|
||||||
|
|
||||||
|
if (legacyScore > defuddleScore + 120) {
|
||||||
|
return {
|
||||||
|
...legacyResult,
|
||||||
|
fallbackReason: "Legacy extractor produced higher-quality markdown than Defuddle",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return defuddleResult.result;
|
||||||
|
}
|
||||||
|
|
||||||
|
const fallbackResult = convertWithLegacyExtractor(html, baseMetadata);
|
||||||
|
return {
|
||||||
|
...fallbackResult,
|
||||||
|
fallbackReason: defuddleResult.reason,
|
||||||
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
function escapeYamlValue(value: string): string {
|
function escapeYamlValue(value: string): string {
|
||||||
|
|
|
||||||
|
|
@ -69,6 +69,12 @@ function formatTimestamp(): string {
|
||||||
return `${now.getFullYear()}${pad(now.getMonth() + 1)}${pad(now.getDate())}-${pad(now.getHours())}${pad(now.getMinutes())}${pad(now.getSeconds())}`;
|
return `${now.getFullYear()}${pad(now.getMonth() + 1)}${pad(now.getDate())}-${pad(now.getHours())}${pad(now.getMinutes())}${pad(now.getSeconds())}`;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
function deriveHtmlSnapshotPath(markdownPath: string): string {
|
||||||
|
const parsed = path.parse(markdownPath);
|
||||||
|
const basename = parsed.ext ? parsed.name : parsed.base;
|
||||||
|
return path.join(parsed.dir, `${basename}-captured.html`);
|
||||||
|
}
|
||||||
|
|
||||||
async function generateOutputPath(url: string, title: string, outputDir?: string): Promise<string> {
|
async function generateOutputPath(url: string, title: string, outputDir?: string): Promise<string> {
|
||||||
const domain = new URL(url).hostname.replace(/^www\./, "");
|
const domain = new URL(url).hostname.replace(/^www\./, "");
|
||||||
const slug = generateSlug(title, url);
|
const slug = generateSlug(title, url);
|
||||||
|
|
@ -141,7 +147,7 @@ async function captureUrl(args: Args): Promise<ConversionResult> {
|
||||||
async function main(): Promise<void> {
|
async function main(): Promise<void> {
|
||||||
const args = parseArgs(process.argv);
|
const args = parseArgs(process.argv);
|
||||||
if (!args.url) {
|
if (!args.url) {
|
||||||
console.error("Usage: bun main.ts <url> [-o output.md] [--wait] [--timeout ms]");
|
console.error("Usage: bun main.ts <url> [-o output.md] [--output-dir dir] [--wait] [--timeout ms] [--download-media]");
|
||||||
process.exit(1);
|
process.exit(1);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -166,7 +172,9 @@ async function main(): Promise<void> {
|
||||||
const result = await captureUrl(args);
|
const result = await captureUrl(args);
|
||||||
const outputPath = args.output || await generateOutputPath(args.url, result.metadata.title, args.outputDir);
|
const outputPath = args.output || await generateOutputPath(args.url, result.metadata.title, args.outputDir);
|
||||||
const outputDir = path.dirname(outputPath);
|
const outputDir = path.dirname(outputPath);
|
||||||
|
const htmlSnapshotPath = deriveHtmlSnapshotPath(outputPath);
|
||||||
await mkdir(outputDir, { recursive: true });
|
await mkdir(outputDir, { recursive: true });
|
||||||
|
await writeFile(htmlSnapshotPath, result.rawHtml, "utf-8");
|
||||||
|
|
||||||
let document = createMarkdownDocument(result);
|
let document = createMarkdownDocument(result);
|
||||||
|
|
||||||
|
|
@ -189,7 +197,12 @@ async function main(): Promise<void> {
|
||||||
await writeFile(outputPath, document, "utf-8");
|
await writeFile(outputPath, document, "utf-8");
|
||||||
|
|
||||||
console.log(`Saved: ${outputPath}`);
|
console.log(`Saved: ${outputPath}`);
|
||||||
|
console.log(`Saved HTML: ${htmlSnapshotPath}`);
|
||||||
console.log(`Title: ${result.metadata.title || "(no title)"}`);
|
console.log(`Title: ${result.metadata.title || "(no title)"}`);
|
||||||
|
console.log(`Converter: ${result.conversionMethod}`);
|
||||||
|
if (result.fallbackReason) {
|
||||||
|
console.warn(`Fallback used: ${result.fallbackReason}`);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
main().catch((err) => {
|
main().catch((err) => {
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,13 @@
|
||||||
|
{
|
||||||
|
"name": "baoyu-url-to-markdown-scripts",
|
||||||
|
"private": true,
|
||||||
|
"type": "module",
|
||||||
|
"dependencies": {
|
||||||
|
"@mozilla/readability": "^0.6.0",
|
||||||
|
"defuddle": "^0.10.0",
|
||||||
|
"jsdom": "^24.1.3",
|
||||||
|
"linkedom": "^0.18.12",
|
||||||
|
"turndown": "^7.2.2",
|
||||||
|
"turndown-plugin-gfm": "^1.0.2"
|
||||||
|
}
|
||||||
|
}
|
||||||
Loading…
Reference in New Issue