Compare commits
6 Commits
| Author | SHA1 | Date |
|---|---|---|
|
|
829ae4d0c0 | |
|
|
8d90743584 | |
|
|
9cfa0ac5b1 | |
|
|
ad62e8b8bb | |
|
|
f9968a4e0d | |
|
|
752f555f0c |
|
|
@ -67,3 +67,7 @@ data/
|
|||
|
||||
# SaaS 版本(独立仓库管理)
|
||||
saas/
|
||||
|
||||
# 个人文档和脚本(不提交)
|
||||
docs/
|
||||
scripts/
|
||||
|
|
|
|||
|
|
@ -0,0 +1,382 @@
|
|||
# 微信公众号文章内容类型与识别策略
|
||||
|
||||
本文档说明微信公众号文章的各种内容类型、不可用状态,以及对应的识别和处理策略。
|
||||
|
||||
---
|
||||
|
||||
## 一、文章内容类型
|
||||
|
||||
微信公众号使用 `item_show_type` 参数来区分不同的内容类型。这个参数通常在HTML的JavaScript代码中定义。
|
||||
|
||||
### item_show_type 值说明
|
||||
|
||||
| 值 | 类型 | 说明 |
|
||||
|----|------|------|
|
||||
| `0` | 标准富文本 | 最常见的图文文章 |
|
||||
| `7` | 音频/视频分享 | 动态Vue应用,内容通过JS加载 |
|
||||
| `8` | 图文消息 | 类似小红书的多图+短文风格 |
|
||||
| `10` | 短内容 | 纯文字或转发消息,无 `js_content` 容器 |
|
||||
| 其他 | 未知 | 其他特殊内容,待补充 |
|
||||
|
||||
---
|
||||
|
||||
### 1. 标准富文本文章
|
||||
|
||||
**item_show_type**: `0`(或未定义)
|
||||
|
||||
**特征**:
|
||||
- 包含 `<div id="js_content">` 或 `<div class="rich_media_content">`
|
||||
- 文字 + 图片混合
|
||||
- HTML大小:通常 > 100KB
|
||||
|
||||
**提取策略**:
|
||||
- 提取 `js_content` 区域的完整HTML
|
||||
- 按顺序提取所有图片URL(`data-src` 或 `src` 属性)
|
||||
- 生成纯文本(`plain_content`)供RSS阅读器使用
|
||||
- 图片URL通过代理服务转发(避免防盗链)
|
||||
|
||||
---
|
||||
|
||||
### 1.5. 音频分享文章(Audio Share)
|
||||
|
||||
**item_show_type**: `7`
|
||||
|
||||
**特征**:
|
||||
- 动态Vue应用(使用 `common_share_audio` 模块)
|
||||
- **无传统的 `js_content` 容器**
|
||||
- `og:image` 和 `og:description` 通常为空
|
||||
- HTML中包含 `window.item_show_type = '7'`
|
||||
- 内容通过JavaScript动态加载,静态HTML中看不到实际音频内容
|
||||
|
||||
**典型公众号**:
|
||||
- 播客节目(如"马刺进步报告")
|
||||
- 音频节目分享
|
||||
- 视频号音频内容
|
||||
|
||||
**提取策略**:
|
||||
```python
|
||||
# 检测逻辑
|
||||
if get_item_show_type(html) == '7':
|
||||
# 这是音频分享页面
|
||||
return _extract_audio_share_content(html)
|
||||
```
|
||||
|
||||
**可提取内容**:
|
||||
- ✅ 标题(从 `og:title` 或 `window.msg_title`)
|
||||
- ✅ 作者(从 `og:article:author` 或 `var nickname`)
|
||||
- ✅ 封面图(从 `og:image`,如果有)
|
||||
- ❌ 音频URL(需要JavaScript执行才能获取)
|
||||
- ❌ 播放时长
|
||||
- ❌ 音频播放器
|
||||
|
||||
**RSS展示效果**:
|
||||
```html
|
||||
<div style="background:#f6f6f6;padding:20px;border-radius:8px">
|
||||
<p>🎵 音频内容 / Audio Content</p>
|
||||
<p>这是微信音频分享文章,内容通过JavaScript动态加载,无法直接提取。</p>
|
||||
<p>请在微信中查看完整内容</p>
|
||||
</div>
|
||||
```
|
||||
|
||||
**已知限制**:
|
||||
- 无法提取真实音频URL(需要浏览器环境执行JS)
|
||||
- 只能提供标题、作者和封面图的基本信息
|
||||
- RSS阅读器中显示占位符,引导用户到微信查看原文
|
||||
|
||||
**未来改进方向**:
|
||||
- 使用无头浏览器(Playwright/Puppeteer)执行JavaScript
|
||||
- 逆向分析微信音频API
|
||||
- 提供更丰富的元数据展示
|
||||
|
||||
---
|
||||
|
||||
### 2. 纯图片文章
|
||||
|
||||
**item_show_type**: `0`
|
||||
|
||||
**特征**:
|
||||
- 有 `<div id="js_content">` 容器
|
||||
- 内容区域只有 `<img>` 标签,**没有任何文字**
|
||||
- HTML大小:2-3MB(正常大小)
|
||||
|
||||
**处理策略**:
|
||||
- 正常提取HTML和图片列表
|
||||
- `plain_content` 生成占位文本:`[纯图片文章,共 X 张图片]`
|
||||
|
||||
**注意**:
|
||||
- 必须使用严格的音频检测逻辑,避免误判为音频文章
|
||||
|
||||
---
|
||||
|
||||
### 3. 图文消息
|
||||
|
||||
**item_show_type**: `8`
|
||||
|
||||
**特征**:
|
||||
- 类似"小红书"的多图+短文风格
|
||||
- 包含特殊的图文混排结构
|
||||
- 通常是手机端创作的内容
|
||||
|
||||
**识别**:`is_image_text_message(html)` → `get_item_show_type(html) == '8'`
|
||||
|
||||
**提取**:`_extract_image_text_content(html)`
|
||||
|
||||
---
|
||||
|
||||
### 4. 短内容消息
|
||||
|
||||
**item_show_type**: `10`
|
||||
|
||||
**特征**:
|
||||
- 纯文字,无 `js_content` div
|
||||
- 类似"朋友圈"的短文本或转发内容
|
||||
- HTML结构简单,内容在特殊的容器中
|
||||
|
||||
**识别**:`is_short_content_message(html)` → `get_item_show_type(html) == '10'`
|
||||
|
||||
**提取**:`_extract_short_content(html)`
|
||||
|
||||
---
|
||||
|
||||
### 5. 音频文章(待完善)
|
||||
|
||||
**item_show_type**: `0`
|
||||
|
||||
**特征**:
|
||||
- 包含音频播放器组件
|
||||
- 可能包含 `<mpvoice>` 标签或 `<mp-common-mpaudio>` 标签
|
||||
- 可能同时包含配图(图+音频混合)
|
||||
|
||||
**识别**:`is_audio_message(html)`
|
||||
- 匹配真实的 `<mpvoice>` 标签
|
||||
- 匹配 `<mp-common-mpaudio>` 标签
|
||||
- 匹配 `<div id="js_editor_audio_xxx">` 容器
|
||||
|
||||
**重要**:
|
||||
- 必须使用严格的正则匹配HTML标签
|
||||
- 不要匹配JS代码中的 `voice_encode_fileid` 等字符串(会误判纯图片文章)
|
||||
|
||||
**当前状态**:
|
||||
- 基础识别逻辑已实现
|
||||
- 内容提取待完善(图+音频混合场景)
|
||||
|
||||
---
|
||||
|
||||
## 二、文章不可用状态
|
||||
|
||||
### 1. 验证页面(可重试)⚠️
|
||||
|
||||
**特征**:
|
||||
- HTML大小:1.5-2MB(很大)
|
||||
- 包含完整的验证组件代码
|
||||
- 关键标记:`"环境异常"` + `"完成验证后即可继续访问"` + `"去验证"`
|
||||
|
||||
**原因**:
|
||||
- 代理IP被微信风控
|
||||
- 或服务器IP请求过于频繁
|
||||
|
||||
**处理**:
|
||||
- ❌ **不应**标记为永久失效
|
||||
- ✅ **应该**标记为可重试(`failed`)
|
||||
- ✅ 切换代理或等待冷却后重试
|
||||
|
||||
---
|
||||
|
||||
### 2. 暂时无法查看(永久失效)❌
|
||||
|
||||
**特征**:
|
||||
- HTML极小:< 1KB
|
||||
- `<title>该内容暂时无法查看</title>`
|
||||
- 页面只有一句提示
|
||||
|
||||
**处理**:
|
||||
- ✅ 标记为永久失效(`permanent_fail`)
|
||||
- 原因:`"暂时无法查看"`
|
||||
|
||||
---
|
||||
|
||||
### 3. 根据作者隐私设置不可查看(永久失效)❌
|
||||
|
||||
**特征**:
|
||||
- HTML大小:10-20KB
|
||||
- 空的Vue应用:`<div id="app"></div>`
|
||||
- 空的 `<title></title>`
|
||||
- 无任何文章内容容器
|
||||
- 页面显示:"根据作者隐私设置,无法查看该内容"(通过JS动态加载)
|
||||
|
||||
**原因**:
|
||||
- 作者设置了文章隐私权限
|
||||
- 通常是会员专属内容
|
||||
|
||||
**处理**:
|
||||
- ✅ 标记为永久失效(`permanent_fail`)
|
||||
- 原因:`"根据作者隐私设置不可查看"`
|
||||
|
||||
**注意**:
|
||||
- 这种页面的错误提示不在静态HTML中
|
||||
- 需要检查空Vue应用 + 无内容容器 + 空title的组合特征
|
||||
|
||||
---
|
||||
|
||||
### 4. 已被发布者删除(永久失效)❌
|
||||
|
||||
**标记**:
|
||||
- `"该内容已被发布者删除"`
|
||||
- `"内容已删除"`
|
||||
|
||||
**处理**:
|
||||
- ✅ 标记为永久失效
|
||||
- 原因:`"已被发布者删除"`
|
||||
|
||||
---
|
||||
|
||||
### 5. 违规内容(永久失效)❌
|
||||
|
||||
**标记**:
|
||||
- `"此内容因违规无法查看"`
|
||||
- `"涉嫌违反相关法律法规和政策"`
|
||||
- `"此内容发送失败无法查看"`
|
||||
- `"接相关投诉,此内容违反"`
|
||||
|
||||
**处理**:
|
||||
- ✅ 标记为永久失效
|
||||
- 原因:`"因违规无法查看"` 或 `"涉嫌违规被限制"`
|
||||
|
||||
---
|
||||
|
||||
### 6. 第三方辟谣(永久失效)❌
|
||||
|
||||
**标记**:
|
||||
- `"该文章已被第三方辟谣"`
|
||||
|
||||
**处理**:
|
||||
- ✅ 标记为永久失效
|
||||
- 原因:`"已被第三方辟谣"`
|
||||
|
||||
---
|
||||
|
||||
## 三、提取流程
|
||||
|
||||
```
|
||||
获取HTML
|
||||
↓
|
||||
检查是否不可用 (is_article_unavailable)
|
||||
├─ 是 → 标记 permanent_fail + 原因
|
||||
└─ 否 → 继续
|
||||
↓
|
||||
检查是否有内容容器 (has_article_content)
|
||||
├─ 否 → 标记 failed(可重试)
|
||||
└─ 是 → 继续
|
||||
↓
|
||||
按类型提取内容
|
||||
├─ 图文消息 (type=8) → _extract_image_text_content()
|
||||
├─ 短内容 (type=10) → _extract_short_content()
|
||||
├─ 音频文章 → _extract_audio_content()
|
||||
└─ 标准文章 → extract_content()
|
||||
↓
|
||||
提取图片 (extract_images_in_order)
|
||||
↓
|
||||
生成纯文本 (html_to_text)
|
||||
↓
|
||||
检查是否纯图片文章
|
||||
├─ 是 → plain_content = "[纯图片文章,共 X 张图片]"
|
||||
└─ 否 → 保持原有纯文本
|
||||
↓
|
||||
返回结果
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、关键函数
|
||||
|
||||
### 1. `get_unavailable_reason(html) -> str | None`
|
||||
|
||||
检测文章是否永久不可用。
|
||||
|
||||
**返回值**:
|
||||
- `None` - 文章正常或可重试
|
||||
- `str` - 不可用原因
|
||||
|
||||
**检测顺序**:
|
||||
1. 优先排除:验证页面
|
||||
2. 静态标记:删除、违规、辟谣等
|
||||
3. 特殊页面:"暂时无法查看"、隐私设置页面
|
||||
|
||||
---
|
||||
|
||||
### 2. `is_audio_message(html) -> bool`
|
||||
|
||||
检测是否为音频文章。
|
||||
|
||||
**要点**:
|
||||
- ✅ 匹配真实的 `<mpvoice>` 标签
|
||||
- ✅ 用正则匹配 `<mp-common-mpaudio>` 标签
|
||||
- ✅ 用正则匹配 `<div id="js_editor_audio_xxx">` 容器
|
||||
- ❌ 不要用简单的 `in` 检查(会误判JS代码)
|
||||
|
||||
---
|
||||
|
||||
### 3. `has_article_content(html) -> bool`
|
||||
|
||||
快速检查HTML是否包含文章内容容器。
|
||||
|
||||
**容器标记**:
|
||||
- `id="js_content"`
|
||||
- `class="rich_media_content"`
|
||||
- `id="page-content"`(政府/机构账号)
|
||||
- 或特殊类型标记
|
||||
|
||||
---
|
||||
|
||||
## 五、代理和反爬策略
|
||||
|
||||
1. **代理池轮转**:
|
||||
- 使用 SOCKS5 代理
|
||||
- 失败后冷却120秒
|
||||
- 所有代理失败后使用直连
|
||||
|
||||
2. **TLS指纹伪装**:
|
||||
- 使用 `curl_cffi` 库
|
||||
- 模拟 Chrome 120 浏览器:`impersonate="chrome120"`
|
||||
|
||||
3. **请求头**:
|
||||
- `Referer: https://mp.weixin.qq.com/`
|
||||
- 必要时添加 `Cookie`(微信token)
|
||||
|
||||
---
|
||||
|
||||
## 六、数据库字段
|
||||
|
||||
### `articles` 表关键字段
|
||||
|
||||
| 字段 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `status` | `VARCHAR` | 文章状态:`pending`(等待)/ `fetched`(已获取)/ `failed`(失败,可重试)/ `permanent_fail`(永久失效) |
|
||||
| `fetch_retry_count` | `INTEGER` | 重试次数(最多3次) |
|
||||
| `content` | `TEXT` | HTML内容 |
|
||||
| `plain_content` | `TEXT` | 纯文本内容(供RSS使用) |
|
||||
| `unavailable_reason` | `VARCHAR` | 不可用原因(仅 `permanent_fail` 时有值) |
|
||||
|
||||
---
|
||||
|
||||
## 七、贡献指南
|
||||
|
||||
如果你发现新的文章类型或错误页面,欢迎提交Issue或PR:
|
||||
|
||||
1. **提供详细信息**:
|
||||
- 文章URL(至少3个样本)
|
||||
- 完整的HTML源码
|
||||
- 期望的提取结果
|
||||
|
||||
2. **遵循代码规范**:
|
||||
- 使用严格的正则匹配(避免误判)
|
||||
- 添加详细的注释说明
|
||||
|
||||
3. **测试充分**:
|
||||
- 测试正常文章不受影响
|
||||
- 测试新类型能正确识别
|
||||
|
||||
---
|
||||
|
||||
**最后更新**:2026-03-24
|
||||
**维护者**:WeChat RSS API 项目组
|
||||
12
Dockerfile
12
Dockerfile
|
|
@ -21,15 +21,14 @@ FROM python:3.11-slim
|
|||
|
||||
LABEL maintainer="tmwgsicp"
|
||||
LABEL description="WeChat Official Account Article Download API with RSS Support"
|
||||
LABEL version="1.0.0"
|
||||
LABEL version="1.0.5"
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install runtime dependencies (curl for healthcheck)
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
curl \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& useradd -m -u 1000 appuser
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy wheels from builder and install
|
||||
COPY --from=builder /app/wheels /wheels
|
||||
|
|
@ -38,11 +37,8 @@ RUN pip install --no-cache-dir /wheels/* && rm -rf /wheels
|
|||
# Copy application code
|
||||
COPY . .
|
||||
|
||||
# Create data directory for SQLite and set permissions
|
||||
RUN mkdir -p /app/data && chown -R appuser:appuser /app
|
||||
|
||||
# Switch to non-root user
|
||||
USER appuser
|
||||
# Create data directory
|
||||
RUN mkdir -p /app/data
|
||||
|
||||
# Environment variables with sensible defaults
|
||||
ENV PYTHONUNBUFFERED=1 \
|
||||
|
|
|
|||
31
README.md
31
README.md
|
|
@ -27,7 +27,7 @@
|
|||
- **公众号搜索** — 按名称搜索公众号,获取 FakeID
|
||||
- **扫码登录** — 微信公众平台扫码登录,凭证自动保存,4 天有效期
|
||||
- **图片代理** — 代理微信 CDN 图片,解决防盗链问题
|
||||
- **Webhook 通知** — 登录过期、触发验证等事件自动推送(支持企业微信机器人)
|
||||
- **Webhook 通知** — 登录过期提醒(提前24h/6h预警+已过期通知)、触发验证等事件自动推送(支持企业微信机器人)
|
||||
- **API 文档** — 自动生成 Swagger UI / ReDoc,在线调试所有接口
|
||||
|
||||
<div align="center">
|
||||
|
|
@ -361,7 +361,9 @@ cp env.example .env
|
|||
| `WECHAT_TOKEN` | 微信 Token(登录后自动填充) | - |
|
||||
| `WECHAT_COOKIE` | 微信 Cookie(登录后自动填充) | - |
|
||||
| `WECHAT_FAKEID` | 公众号 FakeID(登录后自动填充) | - |
|
||||
| `WEBHOOK_URL` | Webhook 通知地址(可选) | 空 |
|
||||
| `WECHAT_EXPIRE_TIME` | 凭证过期时间(登录后自动填充) | - |
|
||||
| `WEBHOOK_URL` | Webhook 通知地址(支持企业微信机器人) | 空 |
|
||||
| `WEBHOOK_NOTIFICATION_INTERVAL` | 同一事件通知最小间隔(秒) | 300 |
|
||||
| `RATE_LIMIT_GLOBAL` | 全局每分钟请求上限 | 10 |
|
||||
| `RATE_LIMIT_PER_IP` | 单 IP 每分钟请求上限 | 5 |
|
||||
| `RATE_LIMIT_ARTICLE_INTERVAL` | 文章请求最小间隔(秒) | 3 |
|
||||
|
|
@ -493,6 +495,7 @@ PROXY_URLS=socks5://myuser:mypass@vps1-ip:1080,socks5://myuser:mypass@vps2-ip:10
|
|||
│ ├── rate_limiter.py # 限频器
|
||||
│ ├── rss_store.py # RSS 数据存储(SQLite)
|
||||
│ ├── rss_poller.py # RSS 后台轮询器
|
||||
│ ├── login_reminder.py # 登录过期提醒(主动检测)
|
||||
│ ├── content_processor.py # 内容处理与图片代理
|
||||
│ ├── image_proxy.py # 图片URL代理工具
|
||||
│ ├── article_fetcher.py # 批量并发获取文章
|
||||
|
|
@ -502,6 +505,21 @@ PROXY_URLS=socks5://myuser:mypass@vps1-ip:1080,socks5://myuser:mypass@vps2-ip:10
|
|||
|
||||
---
|
||||
|
||||
## 内容类型与获取策略
|
||||
|
||||
本项目支持多种微信公众号内容类型,包括标准富文本、纯图片文章、图文消息、短内容、音频文章等。
|
||||
|
||||
详细说明请查看:**[CONTENT_TYPES.md](CONTENT_TYPES.md)**
|
||||
|
||||
**文档内容**:
|
||||
- 所有支持的内容类型及 `item_show_type` 值
|
||||
- 不可用状态识别(删除、违规、隐私、验证页面等)
|
||||
- 反爬策略与代理配置
|
||||
- 关键函数说明
|
||||
- 开发贡献指南
|
||||
|
||||
---
|
||||
|
||||
## 常见问题
|
||||
|
||||
<details>
|
||||
|
|
@ -525,9 +543,14 @@ PROXY_URLS=socks5://myuser:mypass@vps1-ip:1080,socks5://myuser:mypass@vps2-ip:10
|
|||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Token 多久过期</b></summary>
|
||||
<summary><b>Token 多久过期?如何提前知道?</b></summary>
|
||||
|
||||
Cookie 登录有效期约 4 天,过期后需重新扫码登录。配置 `WEBHOOK_URL` 可以在过期时收到通知。
|
||||
Cookie 登录有效期约 4 天,系统会:
|
||||
1. 前端显示到期时间(`/api/admin/status` 接口返回 `expireTime` 和 `isExpired` 字段)
|
||||
2. **后台每 6 小时主动检测**,提前 24h / 6h 通过 Webhook 预警
|
||||
3. 过期后立即通过 Webhook 通知
|
||||
|
||||
配置 `WEBHOOK_URL`(支持企业微信群机器人)可收到实时提醒,避免因凭证过期导致 RSS 轮询失败或搜索功能不可用。
|
||||
</details>
|
||||
|
||||
<details>
|
||||
|
|
|
|||
10
app.py
10
app.py
|
|
@ -10,6 +10,9 @@
|
|||
"""
|
||||
|
||||
from contextlib import asynccontextmanager
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
|
|
@ -56,7 +59,14 @@ async def lifespan(app: FastAPI):
|
|||
|
||||
init_db()
|
||||
await rss_poller.start()
|
||||
|
||||
# 启动登录过期提醒器(自动检测凭证有效期并 webhook 通知)
|
||||
from utils.login_reminder import login_reminder
|
||||
await login_reminder.start()
|
||||
|
||||
yield
|
||||
|
||||
await login_reminder.stop()
|
||||
await rss_poller.stop()
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -6,6 +6,11 @@
|
|||
# 2. Edit .env and set SITE_URL to your actual URL
|
||||
# 3. Run: docker-compose up -d
|
||||
# 4. Visit http://localhost:5000/login.html to scan QR code
|
||||
#
|
||||
# Note for NAS users (Synology/QNAP):
|
||||
# If you encounter permission issues, run on NAS:
|
||||
# - chmod -R 777 ./data
|
||||
# - Credentials are automatically saved to ./data directory
|
||||
|
||||
services:
|
||||
wechat-api:
|
||||
|
|
@ -17,10 +22,10 @@ services:
|
|||
ports:
|
||||
- "5000:5000"
|
||||
volumes:
|
||||
# Persist SQLite database
|
||||
# Persist SQLite database and credentials
|
||||
- ./data:/app/data
|
||||
# Config file (writable - login saves credentials here)
|
||||
- ./.env:/app/.env
|
||||
# Config file (read-only - credentials saved to data/)
|
||||
- ./.env:/app/.env:ro
|
||||
environment:
|
||||
- TZ=Asia/Shanghai
|
||||
healthcheck:
|
||||
|
|
|
|||
|
|
@ -497,8 +497,8 @@ async def biz_login(request: Request):
|
|||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
# 计算过期时间(30天后)
|
||||
expire_time = int((time.time() + 30 * 24 * 3600) * 1000)
|
||||
# 计算过期时间(4天后,与微信实际有效期一致)
|
||||
expire_time = int((time.time() + 4 * 24 * 3600) * 1000)
|
||||
|
||||
# 保存凭证
|
||||
auth_manager.save_credentials(
|
||||
|
|
|
|||
|
|
@ -11,6 +11,7 @@ RSS 订阅路由
|
|||
|
||||
import csv
|
||||
import io
|
||||
import os
|
||||
import time
|
||||
import logging
|
||||
from datetime import datetime, timezone
|
||||
|
|
@ -28,6 +29,23 @@ from utils.image_proxy import proxy_image_url
|
|||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def get_base_url(request: Request) -> str:
|
||||
"""
|
||||
获取服务的基础 URL,优先使用环境变量 SITE_URL,
|
||||
支持反向代理(检测 X-Forwarded-Proto 和 X-Forwarded-Host)
|
||||
"""
|
||||
# 优先使用配置的 SITE_URL
|
||||
site_url = os.getenv("SITE_URL", "").strip()
|
||||
if site_url:
|
||||
return site_url.rstrip("/")
|
||||
|
||||
# 检测反向代理头部
|
||||
proto = request.headers.get("X-Forwarded-Proto", "http")
|
||||
host = request.headers.get("X-Forwarded-Host") or request.headers.get("Host", "localhost:5000")
|
||||
|
||||
return f"{proto}://{host}"
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
|
|
@ -118,7 +136,7 @@ async def get_subscriptions(request: Request):
|
|||
返回每个订阅的基本信息、缓存文章数和 RSS 地址。
|
||||
"""
|
||||
subs = rss_store.list_subscriptions()
|
||||
base_url = str(request.base_url).rstrip("/")
|
||||
base_url = get_base_url(request)
|
||||
|
||||
items = []
|
||||
for s in subs:
|
||||
|
|
@ -195,7 +213,7 @@ async def get_aggregated_rss_feed(
|
|||
|
||||
articles = rss_store.get_all_articles(limit=limit) if subs else []
|
||||
|
||||
base_url = str(request.base_url).rstrip("/")
|
||||
base_url = get_base_url(request)
|
||||
xml = _build_aggregated_rss_xml(articles, nickname_map, base_url)
|
||||
return Response(
|
||||
content=xml,
|
||||
|
|
@ -218,7 +236,7 @@ async def export_subscriptions(
|
|||
- **opml**: 标准 OPML 格式,可直接导入 RSS 阅读器
|
||||
"""
|
||||
subs = rss_store.list_subscriptions()
|
||||
base_url = str(request.base_url).rstrip("/")
|
||||
base_url = get_base_url(request)
|
||||
|
||||
if format == "opml":
|
||||
return _build_opml_response(subs, base_url)
|
||||
|
|
@ -448,7 +466,7 @@ async def get_rss_feed(fakeid: str, request: Request,
|
|||
raise HTTPException(status_code=404, detail="未找到该订阅,请先添加订阅")
|
||||
|
||||
articles = rss_store.get_articles(fakeid, limit=limit)
|
||||
base_url = str(request.base_url).rstrip("/")
|
||||
base_url = get_base_url(request)
|
||||
xml = _build_rss_xml(fakeid, sub, articles, base_url)
|
||||
|
||||
return Response(
|
||||
|
|
|
|||
|
|
@ -8,6 +8,7 @@
|
|||
搜索路由 - FastAPI版本
|
||||
"""
|
||||
|
||||
import os
|
||||
from fastapi import APIRouter, Query, Request
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional, List
|
||||
|
|
@ -18,6 +19,21 @@ from utils.image_proxy import proxy_image_url
|
|||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
def get_base_url(request: Request) -> str:
|
||||
"""
|
||||
获取服务的基础 URL,优先使用环境变量 SITE_URL,
|
||||
支持反向代理(检测 X-Forwarded-Proto 和 X-Forwarded-Host)
|
||||
"""
|
||||
site_url = os.getenv("SITE_URL", "").strip()
|
||||
if site_url:
|
||||
return site_url.rstrip("/")
|
||||
|
||||
proto = request.headers.get("X-Forwarded-Proto", "http")
|
||||
host = request.headers.get("X-Forwarded-Host") or request.headers.get("Host", "localhost:5000")
|
||||
|
||||
return f"{proto}://{host}"
|
||||
|
||||
class Account(BaseModel):
|
||||
"""公众号模型"""
|
||||
id: str
|
||||
|
|
@ -80,7 +96,7 @@ async def search_accounts(query: str = Query(..., description="公众号名称
|
|||
accounts = result.get("list", [])
|
||||
|
||||
# 获取 base_url 用于图片代理
|
||||
base_url = str(request.base_url).rstrip("/") if request else ""
|
||||
base_url = get_base_url(request) if request else ""
|
||||
|
||||
# 格式化返回数据
|
||||
formatted_accounts = []
|
||||
|
|
|
|||
|
|
@ -34,12 +34,31 @@ class AuthManager:
|
|||
self.base_dir = Path(__file__).parent.parent
|
||||
self.env_path = self.base_dir / ".env"
|
||||
|
||||
# Docker环境下的凭证文件(存储在data目录,权限更可靠)
|
||||
self.credentials_file = self.base_dir / "data" / ".credentials.json"
|
||||
|
||||
# 加载环境变量
|
||||
self._load_credentials()
|
||||
self._initialized = True
|
||||
|
||||
def _load_credentials(self):
|
||||
"""从.env文件加载凭证"""
|
||||
"""
|
||||
从多个来源加载凭证,优先级:
|
||||
1. data/.credentials.json (Docker环境推荐)
|
||||
2. .env 文件 (本地部署)
|
||||
3. 环境变量
|
||||
"""
|
||||
# 先尝试从 JSON 凭证文件加载(Docker 环境)
|
||||
if self.credentials_file.exists():
|
||||
try:
|
||||
import json
|
||||
with open(self.credentials_file, 'r', encoding='utf-8') as f:
|
||||
self.credentials = json.load(f)
|
||||
return
|
||||
except Exception as e:
|
||||
print(f"Warning: Failed to load credentials from {self.credentials_file}: {e}")
|
||||
|
||||
# 回退到 .env 文件(本地部署)
|
||||
if self.env_path.exists():
|
||||
load_dotenv(self.env_path, override=True)
|
||||
|
||||
|
|
@ -54,7 +73,9 @@ class AuthManager:
|
|||
def save_credentials(self, token: str, cookie: str, fakeid: str,
|
||||
nickname: str, expire_time: int) -> bool:
|
||||
"""
|
||||
保存凭证到.env文件
|
||||
保存凭证,支持双存储策略:
|
||||
1. 优先保存到 data/.credentials.json (Docker环境推荐,权限可靠)
|
||||
2. 同时尝试保存到 .env (本地部署兼容)
|
||||
|
||||
Args:
|
||||
token: 微信Token
|
||||
|
|
@ -66,21 +87,33 @@ class AuthManager:
|
|||
Returns:
|
||||
保存是否成功
|
||||
"""
|
||||
# 更新内存中的凭证
|
||||
self.credentials.update({
|
||||
"token": token,
|
||||
"cookie": cookie,
|
||||
"fakeid": fakeid,
|
||||
"nickname": nickname,
|
||||
"expire_time": expire_time
|
||||
})
|
||||
|
||||
success = False
|
||||
|
||||
# 策略1: 保存到 data/.credentials.json (Docker 环境优先)
|
||||
try:
|
||||
import json
|
||||
self.credentials_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(self.credentials_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(self.credentials, f, indent=2, ensure_ascii=False)
|
||||
print(f"[OK] 凭证已保存到: {self.credentials_file}")
|
||||
success = True
|
||||
except Exception as e:
|
||||
print(f"[WARN] 无法保存到凭证文件: {e}")
|
||||
|
||||
# 策略2: 同时尝试保存到 .env 文件(本地部署兼容)
|
||||
try:
|
||||
# 更新内存中的凭证
|
||||
self.credentials.update({
|
||||
"token": token,
|
||||
"cookie": cookie,
|
||||
"fakeid": fakeid,
|
||||
"nickname": nickname,
|
||||
"expire_time": expire_time
|
||||
})
|
||||
|
||||
# 确保.env文件存在
|
||||
if not self.env_path.exists():
|
||||
self.env_path.touch()
|
||||
|
||||
# 保存到.env文件
|
||||
env_file = str(self.env_path)
|
||||
set_key(env_file, "WECHAT_TOKEN", token)
|
||||
set_key(env_file, "WECHAT_COOKIE", cookie)
|
||||
|
|
@ -88,11 +121,17 @@ class AuthManager:
|
|||
set_key(env_file, "WECHAT_NICKNAME", nickname)
|
||||
set_key(env_file, "WECHAT_EXPIRE_TIME", str(expire_time))
|
||||
|
||||
print(f"✅ 凭证已保存到: {self.env_path}")
|
||||
return True
|
||||
print(f"[OK] 凭证已同步到: {self.env_path}")
|
||||
success = True
|
||||
except Exception as e:
|
||||
print(f"❌ 保存凭证失败: {e}")
|
||||
print(f"[WARN] 无法写入 .env 文件 (Docker环境正常): {e}")
|
||||
# Docker 环境下 .env 可能只读,不影响功能
|
||||
|
||||
if not success:
|
||||
print(f"[ERROR] 凭证保存完全失败")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def get_credentials(self) -> Optional[Dict[str, any]]:
|
||||
"""
|
||||
|
|
@ -155,7 +194,7 @@ class AuthManager:
|
|||
|
||||
def clear_credentials(self) -> bool:
|
||||
"""
|
||||
清除凭证
|
||||
清除凭证(双存储都清除)
|
||||
|
||||
Returns:
|
||||
清除是否成功
|
||||
|
|
@ -178,12 +217,20 @@ class AuthManager:
|
|||
for key in env_keys:
|
||||
os.environ.pop(key, None)
|
||||
|
||||
# 删除凭证文件
|
||||
if self.credentials_file.exists():
|
||||
self.credentials_file.unlink()
|
||||
print(f"[OK] 凭证文件已删除: {self.credentials_file}")
|
||||
|
||||
# 清空 .env 文件中的凭证字段(保留其他配置)
|
||||
if self.env_path.exists():
|
||||
env_file = str(self.env_path)
|
||||
for key in env_keys:
|
||||
set_key(env_file, key, "")
|
||||
print(f"✅ 凭证已清除: {self.env_path}")
|
||||
try:
|
||||
if self.env_path.exists():
|
||||
env_file = str(self.env_path)
|
||||
for key in env_keys:
|
||||
set_key(env_file, key, "")
|
||||
print(f"[OK] .env 凭证已清除: {self.env_path}")
|
||||
except Exception as e:
|
||||
print(f"[WARN] 无法清除 .env 文件 (Docker环境正常): {e}")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
|
|
|
|||
|
|
@ -53,6 +53,11 @@ def process_article_content(html: str, proxy_base_url: str = None) -> Dict:
|
|||
# 5. 生成纯文本
|
||||
plain_content = html_to_text(content)
|
||||
|
||||
# 6. 纯图片文章处理:如果没有文字但有图片,生成图片描述
|
||||
if not plain_content.strip() and images:
|
||||
plain_content = f"[纯图片文章,共 {len(images)} 张图片]"
|
||||
logger.info(f"检测到纯图片文章: {len(images)} 张图片,无文字内容")
|
||||
|
||||
return {
|
||||
'content': content,
|
||||
'plain_content': plain_content,
|
||||
|
|
@ -100,15 +105,22 @@ def extract_content(html: str) -> str:
|
|||
Extract article body, trying multiple container patterns.
|
||||
Different WeChat account types (government, media, personal) use
|
||||
different HTML structures. We try them in order of specificity.
|
||||
For image-text messages (item_show_type=8) and short posts (item_show_type=10),
|
||||
delegates to helpers.
|
||||
For image-text messages (item_show_type=8), short posts (item_show_type=10),
|
||||
and audio share pages (item_show_type=7), delegates to helpers.
|
||||
"""
|
||||
from utils.helpers import (
|
||||
is_image_text_message, _extract_image_text_content,
|
||||
is_short_content_message, _extract_short_content,
|
||||
is_audio_message, _extract_audio_content,
|
||||
get_item_show_type, _extract_audio_share_content,
|
||||
)
|
||||
|
||||
# Check for audio/video share pages (item_show_type=7) FIRST
|
||||
# These pages use Vue apps and have no js_content div
|
||||
if get_item_show_type(html) == '7':
|
||||
result = _extract_audio_share_content(html)
|
||||
return result.get('content', '')
|
||||
|
||||
if is_image_text_message(html):
|
||||
result = _extract_image_text_content(html)
|
||||
return result.get('content', '')
|
||||
|
|
|
|||
214
utils/helpers.py
214
utils/helpers.py
|
|
@ -82,11 +82,26 @@ def is_audio_message(html: str) -> bool:
|
|||
"""
|
||||
Detect audio articles (voice messages embedded via mpvoice / mp-common-mpaudio).
|
||||
检测是否为音频文章(包含 mpvoice 标签或音频播放器组件)。
|
||||
|
||||
Important: Must check for ACTUAL audio tags, not just JS code that mentions audio.
|
||||
"""
|
||||
return ('voice_encode_fileid' in html or
|
||||
'<mpvoice' in html or
|
||||
'mp-common-mpaudio' in html or
|
||||
'js_editor_audio' in html)
|
||||
# 方法1: 检查是否有真实的 <mpvoice> 标签(注意:mpvoice 是自定义标签)
|
||||
if '<mpvoice' in html:
|
||||
return True
|
||||
|
||||
# 方法2: 检查是否有音频播放器组件的 **HTML标签**(不是JS代码)
|
||||
# 使用更严格的正则,确保匹配的是标签而不是JS变量
|
||||
import re
|
||||
|
||||
# 匹配实际的音频标签:<mp-common-mpaudio ...>
|
||||
if re.search(r'<mp-common-mpaudio[^>]*>', html, re.IGNORECASE):
|
||||
return True
|
||||
|
||||
# 匹配实际的音频容器:<div id="js_editor_audio_...">
|
||||
if re.search(r'<div[^>]+id=["\']js_editor_audio[^"\']*["\']', html, re.IGNORECASE):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def _extract_image_text_content(html: str) -> Dict:
|
||||
|
|
@ -346,12 +361,14 @@ def _extract_audio_content(html: str) -> Dict:
|
|||
dur_str = f' ({minutes}:{seconds:02d})'
|
||||
|
||||
display_name = audio['name'] or f'Audio {i + 1}'
|
||||
# 友好提示:音频需要微信鉴权,不提供无法播放的URL
|
||||
html_parts.append(
|
||||
f'<div style="margin:12px 0;padding:12px 16px;background:#f6f6f6;border-radius:8px">'
|
||||
f'<p style="margin:0 0 4px;font-size:15px;font-weight:500">'
|
||||
f'{html_module.escape(display_name)}{dur_str}</p>'
|
||||
f'<a href="{audio["url"]}" style="color:#1890ff;font-size:14px">'
|
||||
f'[Play Audio / Click to Listen]</a>'
|
||||
f'<div style="margin:12px 0;padding:12px 16px;background:#fff9e6;'
|
||||
f'border-left:4px solid #fa8c16;border-radius:4px">'
|
||||
f'<p style="margin:0 0 4px;font-size:14px;color:#595959;font-weight:500">'
|
||||
f'音频内容: {html_module.escape(display_name)}{dur_str}</p>'
|
||||
f'<p style="margin:0;font-size:13px;color:#8c8c8c">'
|
||||
f'此文章包含音频,需要在微信中查看完整内容</p>'
|
||||
f'</div>'
|
||||
)
|
||||
|
||||
|
|
@ -372,6 +389,104 @@ def _extract_audio_content(html: str) -> Dict:
|
|||
}
|
||||
|
||||
|
||||
def _extract_audio_share_content(html: str) -> Dict:
|
||||
"""
|
||||
Extract content from item_show_type=7 audio/video share pages.
|
||||
|
||||
These pages use dynamic Vue applications (common_share_audio module),
|
||||
so most content is loaded via JavaScript. We can only extract basic
|
||||
metadata from the static HTML.
|
||||
|
||||
Example: Podcast episodes, audio shows (e.g., 马刺进步报告)
|
||||
"""
|
||||
import html as html_module
|
||||
|
||||
# 提取标题
|
||||
title = ''
|
||||
title_match = (
|
||||
re.search(r'<meta\s+property="og:title"\s+content="([^"]+)"', html) or
|
||||
re.search(r"window\.msg_title\s*=\s*window\.title\s*=\s*'([^']*)'", html)
|
||||
)
|
||||
if title_match:
|
||||
title = html_module.unescape(title_match.group(1))
|
||||
|
||||
# 提取作者
|
||||
author = ''
|
||||
author_match = (
|
||||
re.search(r'<meta\s+property="og:article:author"\s+content="([^"]+)"', html) or
|
||||
re.search(r'var\s+nickname\s*=\s*"([^"]+)"', html)
|
||||
)
|
||||
if author_match:
|
||||
author = html_module.unescape(author_match.group(1))
|
||||
|
||||
# 提取封面图(如果有)
|
||||
images = []
|
||||
og_image_match = re.search(r'<meta\s+property="og:image"\s+content="([^"]+)"', html)
|
||||
if og_image_match:
|
||||
img_url = og_image_match.group(1)
|
||||
if img_url and ('mmbiz' in img_url or img_url.startswith('http')):
|
||||
images.append(img_url)
|
||||
|
||||
# 生成内容
|
||||
content_parts = []
|
||||
|
||||
# 标题(如果有)
|
||||
if title:
|
||||
content_parts.append(
|
||||
f'<div style="margin:20px 0;text-align:center">'
|
||||
f'<h2 style="margin:0;font-size:22px;font-weight:600;color:#262626">{title}</h2>'
|
||||
f'</div>'
|
||||
)
|
||||
|
||||
# 作者(如果有)
|
||||
if author:
|
||||
content_parts.append(
|
||||
f'<div style="margin:12px 0;text-align:center">'
|
||||
f'<p style="margin:0;font-size:14px;color:#8c8c8c">作者: {author}</p>'
|
||||
f'</div>'
|
||||
)
|
||||
|
||||
# 封面图
|
||||
if images:
|
||||
for img_url in images:
|
||||
content_parts.append(
|
||||
f'<div style="text-align:center;margin:16px 0">'
|
||||
f'<img src="{img_url}" data-src="{img_url}" '
|
||||
f'style="max-width:100%;height:auto;border-radius:8px" />'
|
||||
f'</div>'
|
||||
)
|
||||
|
||||
# 音频占位符(使用中英双语,适配RSS阅读器)
|
||||
content_parts.append(
|
||||
'<div style="background:#f6f6f6;padding:20px;border-radius:8px;'
|
||||
'text-align:center;margin:20px 0;border:2px dashed #d9d9d9">'
|
||||
'<p style="margin:0;font-size:18px;color:#333">🎵 音频内容 / Audio Content</p>'
|
||||
'<p style="margin:12px 0;font-size:14px;color:#666;line-height:1.6">'
|
||||
'这是微信音频分享文章,内容通过JavaScript动态加载,无法直接提取。<br>'
|
||||
'This is a WeChat audio share article. Content is loaded dynamically via JavaScript.</p>'
|
||||
'<p style="margin:8px 0;font-size:13px;color:#999">'
|
||||
'请在微信中查看完整内容 / Please view in WeChat app</p>'
|
||||
'</div>'
|
||||
)
|
||||
|
||||
content = '\n'.join(content_parts)
|
||||
|
||||
# 纯文本
|
||||
plain_content = f"[音频分享文章 / Audio Share Article]\n\n"
|
||||
if title:
|
||||
plain_content += f"标题 / Title: {title}\n"
|
||||
if author:
|
||||
plain_content += f"作者 / Author: {author}\n"
|
||||
plain_content += "\n(此音频内容无法直接提取,请在微信中查看)"
|
||||
plain_content += "\n(Audio content cannot be extracted directly, please view in WeChat)"
|
||||
|
||||
return {
|
||||
'content': content,
|
||||
'plain_content': plain_content,
|
||||
'images': images,
|
||||
}
|
||||
|
||||
|
||||
def extract_article_info(html: str, params: Optional[Dict] = None) -> Dict:
|
||||
"""
|
||||
从HTML中提取文章信息
|
||||
|
|
@ -427,18 +542,29 @@ def extract_article_info(html: str, params: Optional[Dict] = None) -> Dict:
|
|||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
# 检测特殊内容类型
|
||||
if is_image_text_message(html):
|
||||
# 优先处理特殊类型(按 item_show_type 判断)
|
||||
item_type = get_item_show_type(html)
|
||||
|
||||
if item_type == '7':
|
||||
# item_show_type=7: 音频/视频分享页面(动态Vue应用)
|
||||
audio_share_data = _extract_audio_share_content(html)
|
||||
content = audio_share_data['content']
|
||||
images = audio_share_data['images']
|
||||
plain_content = audio_share_data['plain_content']
|
||||
elif item_type == '8' or is_image_text_message(html):
|
||||
# item_show_type=8: 图文消息
|
||||
img_text_data = _extract_image_text_content(html)
|
||||
content = img_text_data['content']
|
||||
images = img_text_data['images']
|
||||
plain_content = img_text_data['plain_content']
|
||||
elif is_short_content_message(html):
|
||||
elif item_type == '10' or is_short_content_message(html):
|
||||
# item_show_type=10: 短内容/转发消息
|
||||
short_data = _extract_short_content(html)
|
||||
content = short_data['content']
|
||||
images = short_data['images']
|
||||
plain_content = short_data['plain_content']
|
||||
elif is_audio_message(html):
|
||||
# 音频文章(mpvoice / mp-common-mpaudio)
|
||||
audio_data = _extract_audio_content(html)
|
||||
content = audio_data['content']
|
||||
images = audio_data['images']
|
||||
|
|
@ -520,6 +646,12 @@ def has_article_content(html: str) -> bool:
|
|||
return True
|
||||
if is_image_text_message(html) or is_short_content_message(html) or is_audio_message(html):
|
||||
return True
|
||||
|
||||
# item_show_type=7: Audio/video share pages (dynamic Vue app)
|
||||
# These pages have no traditional content container, but are valid articles
|
||||
if get_item_show_type(html) == '7':
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
|
|
@ -554,21 +686,77 @@ def get_unavailable_reason(html: str) -> Optional[str]:
|
|||
"""
|
||||
Return human-readable reason if article is permanently unavailable, else None.
|
||||
返回文章不可用的原因,如果文章正常则返回 None。
|
||||
|
||||
Important: Must distinguish between:
|
||||
1. Verification pages (environment error) - NOT unavailable, should retry
|
||||
2. "暂时无法查看" standalone page - IS unavailable (HTML < 1KB, minimal structure)
|
||||
3. Privacy/payment pages (empty Vue app) - IS unavailable
|
||||
4. Truly unavailable articles (deleted/censored) - permanently unavailable
|
||||
"""
|
||||
# 优先排除:微信验证页面(这不是文章不可用,而是IP风控)
|
||||
# 特征:包含"环境异常"+"完成验证"+"去验证",且HTML较大(>1.5MB)
|
||||
verification_markers = ["环境异常", "完成验证后即可继续访问", "去验证"]
|
||||
if all(marker in html for marker in verification_markers):
|
||||
return None
|
||||
|
||||
# 真正的不可用标记(静态HTML中的明确文字)
|
||||
# 注意:微信的正常文章HTML中可能在JS代码里包含"已删除"/"违规"等字符串
|
||||
# 需要确保这些关键字是在实际内容中,而不是在JS字符串字面量中
|
||||
markers = [
|
||||
("该内容已被发布者删除", "已被发布者删除"),
|
||||
("内容已删除", "已被发布者删除"),
|
||||
("此内容因违规无法查看", "因违规无法查看"),
|
||||
("涉嫌违反相关法律法规和政策", "涉嫌违规被限制"),
|
||||
("此内容发送失败无法查看", "发送失败无法查看"),
|
||||
("该内容暂时无法查看", "暂时无法查看"),
|
||||
("根据作者隐私设置,无法查看该内容", "作者隐私设置不可见"),
|
||||
("接相关投诉,此内容违反", "因投诉违规被限制"),
|
||||
("该文章已被第三方辟谣", "已被第三方辟谣"),
|
||||
]
|
||||
for keyword, reason in markers:
|
||||
if keyword in html:
|
||||
# 额外验证:如果HTML很大(>1MB) 且有真实的内容容器,
|
||||
# 说明是正常文章,"已删除"/"违规"可能只是JS代码中的字符串
|
||||
if len(html) > 1000000:
|
||||
has_real_content = (
|
||||
'id="js_content"' in html or
|
||||
'class="rich_media_content' in html
|
||||
)
|
||||
if has_real_content:
|
||||
# 进一步确认:检查关键字是否在 <body> 的前10KB可见区域
|
||||
# 如果只在后面的 <script> 中出现,跳过
|
||||
import re
|
||||
body_match = re.search(r'<body[^>]*>(.*?)(?:<script|$)', html[:50000], re.DOTALL | re.IGNORECASE)
|
||||
if body_match and keyword not in body_match.group(1):
|
||||
# 关键字不在body前部,可能是JS代码,跳过此marker
|
||||
continue
|
||||
return reason
|
||||
|
||||
# 特殊处理:"该内容暂时无法查看"独立页面
|
||||
# 特征:HTML很小(<2KB)+ <title>标签包含此文字 = 独立错误页面
|
||||
# 必须同时满足两个条件,避免误判正常文章中包含这句话的情况
|
||||
if "该内容暂时无法查看" in html and len(html) < 2000:
|
||||
import re
|
||||
title_match = re.search(r'<title>(.*?)</title>', html, re.IGNORECASE)
|
||||
if title_match and "该内容暂时无法查看" in title_match.group(1):
|
||||
return "暂时无法查看"
|
||||
|
||||
# 特殊处理:空Vue应用(隐私设置的动态错误页面)
|
||||
# 特征:<div id="app"></div> 是空的 + 无文章内容容器 + HTML不超大(<200KB)
|
||||
# 这种页面的错误提示通过JS动态加载,静态HTML中看不到
|
||||
# 实际显示:"根据作者隐私设置,无法查看该内容"
|
||||
if '<div id="app">' in html and len(html) < 200000:
|
||||
import re
|
||||
# 检查是否有实际的文章内容容器
|
||||
has_content_container = (
|
||||
'id="js_content"' in html or
|
||||
'class="rich_media_content' in html or
|
||||
'class="rich_media_area_primary_inner' in html
|
||||
)
|
||||
# 如果没有内容容器,且title为空,是隐私限制页面
|
||||
title_match = re.search(r'<title>(.*?)</title>', html, re.IGNORECASE)
|
||||
if not has_content_container and title_match and not title_match.group(1).strip():
|
||||
return "根据作者隐私设置不可查看"
|
||||
|
||||
return None
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,150 @@
|
|||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (C) 2026 tmwgsicp
|
||||
# Licensed under the GNU Affero General Public License v3.0
|
||||
# See LICENSE file in the project root for full license text.
|
||||
# SPDX-License-Identifier: AGPL-3.0-only
|
||||
"""
|
||||
登录过期提醒(开源版)
|
||||
定期检查本地微信登录凭证过期状态,提前 webhook 通知。
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
from utils.webhook import webhook
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class LoginReminder:
|
||||
"""登录过期提醒管理器(开源版单账号架构)"""
|
||||
|
||||
def __init__(self):
|
||||
self.check_interval = 6 * 3600 # 每 6 小时检查一次
|
||||
self.warning_threshold = 24 * 3600 # 提前 24 小时预警
|
||||
self.critical_threshold = 6 * 3600 # 提前 6 小时严重警告
|
||||
self._running = False
|
||||
self._task: Optional[asyncio.Task] = None
|
||||
self._last_warning_level = None # 记录最后一次警告级别,避免重复
|
||||
|
||||
async def start(self):
|
||||
"""启动提醒服务"""
|
||||
if self._running:
|
||||
logger.warning("登录提醒服务已在运行")
|
||||
return
|
||||
|
||||
self._running = True
|
||||
self._task = asyncio.create_task(self._run())
|
||||
logger.info("登录提醒服务已启动,检查间隔: %d 秒", self.check_interval)
|
||||
|
||||
async def stop(self):
|
||||
"""停止提醒服务"""
|
||||
self._running = False
|
||||
if self._task:
|
||||
self._task.cancel()
|
||||
try:
|
||||
await self._task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
logger.info("登录提醒服务已停止")
|
||||
|
||||
async def _run(self):
|
||||
"""后台任务循环"""
|
||||
while self._running:
|
||||
try:
|
||||
await self._check_login_status()
|
||||
except Exception as e:
|
||||
logger.error("检查登录状态失败: %s", e, exc_info=True)
|
||||
|
||||
await asyncio.sleep(self.check_interval)
|
||||
|
||||
async def _check_login_status(self):
|
||||
"""检查本地登录凭证的过期状态"""
|
||||
from utils.auth_manager import auth_manager
|
||||
|
||||
# 获取凭证信息
|
||||
creds = auth_manager.get_credentials()
|
||||
if not creds or not creds.get("token"):
|
||||
logger.debug("无登录凭证,跳过检查")
|
||||
return
|
||||
|
||||
expire_time = creds.get("expire_time", 0)
|
||||
if expire_time <= 0:
|
||||
logger.debug("凭证无过期时间,跳过检查")
|
||||
return
|
||||
|
||||
nickname = creds.get("nickname", "未知账号")
|
||||
now = int(time.time() * 1000) # 毫秒时间戳
|
||||
time_left_ms = expire_time - now
|
||||
time_left_sec = time_left_ms / 1000
|
||||
|
||||
# 已过期
|
||||
if time_left_sec <= 0:
|
||||
if self._last_warning_level != 'expired':
|
||||
await self._notify_expired(nickname)
|
||||
self._last_warning_level = 'expired'
|
||||
return
|
||||
|
||||
# 严重警告(6 小时内过期)
|
||||
if time_left_sec <= self.critical_threshold:
|
||||
if self._last_warning_level not in ['critical', 'expired']:
|
||||
await self._notify_critical(nickname, time_left_sec)
|
||||
self._last_warning_level = 'critical'
|
||||
return
|
||||
|
||||
# 一般警告(24 小时内过期)
|
||||
if time_left_sec <= self.warning_threshold:
|
||||
if self._last_warning_level not in ['warning', 'critical', 'expired']:
|
||||
await self._notify_warning(nickname, time_left_sec)
|
||||
self._last_warning_level = 'warning'
|
||||
return
|
||||
|
||||
# 状态正常,重置警告级别
|
||||
if self._last_warning_level is not None:
|
||||
self._last_warning_level = None
|
||||
logger.info("登录状态已恢复正常: %s", nickname)
|
||||
|
||||
async def _notify_warning(self, nickname: str, time_left: float):
|
||||
"""发送一般警告通知"""
|
||||
hours = time_left / 3600
|
||||
logger.warning(
|
||||
"登录凭证即将过期 [%s] - 剩余 %.1f 小时",
|
||||
nickname, hours
|
||||
)
|
||||
|
||||
await webhook.notify('login_expiring_soon', {
|
||||
'nickname': nickname,
|
||||
'hours_left': round(hours, 1),
|
||||
'level': 'warning',
|
||||
'message': f'登录凭证将在 {round(hours, 1)} 小时后过期,请及时重新登录',
|
||||
})
|
||||
|
||||
async def _notify_critical(self, nickname: str, time_left: float):
|
||||
"""发送严重警告通知"""
|
||||
hours = time_left / 3600
|
||||
logger.error(
|
||||
"登录凭证即将过期(紧急)[%s] - 剩余 %.1f 小时",
|
||||
nickname, hours
|
||||
)
|
||||
|
||||
await webhook.notify('login_expiring_critical', {
|
||||
'nickname': nickname,
|
||||
'hours_left': round(hours, 1),
|
||||
'level': 'critical',
|
||||
'message': f'登录凭证将在 {round(hours, 1)} 小时后过期(紧急),请立即重新登录',
|
||||
})
|
||||
|
||||
async def _notify_expired(self, nickname: str):
|
||||
"""发送已过期通知"""
|
||||
logger.error("登录凭证已过期 [%s]", nickname)
|
||||
|
||||
await webhook.notify('login_expired', {
|
||||
'nickname': nickname,
|
||||
'message': '登录凭证已过期,API 功能将受限,请重新登录',
|
||||
})
|
||||
|
||||
|
||||
# 全局单例
|
||||
login_reminder = LoginReminder()
|
||||
|
|
@ -21,6 +21,8 @@ logger = logging.getLogger("webhook")
|
|||
EVENT_LABELS = {
|
||||
"login_success": "登录成功",
|
||||
"login_expired": "登录过期",
|
||||
"login_expiring_soon": "登录即将过期",
|
||||
"login_expiring_critical": "登录即将过期(紧急)",
|
||||
"verification_required": "触发验证",
|
||||
"content_fetch_failed": "文章内容获取失败",
|
||||
}
|
||||
|
|
|
|||
Loading…
Reference in New Issue