256 lines
5.7 KiB
Markdown
256 lines
5.7 KiB
Markdown
# 热点数据模块
|
||
|
||
> 多平台热点数据采集和管理
|
||
|
||
## 一、模块结构
|
||
|
||
```
|
||
domain/hotspot/
|
||
├── __init__.py # 模块入口
|
||
├── models.py # 数据模型 (HotTopic, HotTopicSource)
|
||
├── manager.py # 热点管理器 (缓存、聚合)
|
||
└── crawlers/ # 爬虫模块
|
||
├── base.py # 爬虫基类
|
||
├── weibo.py # 微博热搜 (需要优化)
|
||
├── baidu.py # 百度热搜 ✅ (含旅游榜)
|
||
├── bing.py # Bing 搜索建议 ✅
|
||
├── calendar.py # 节日日历 ✅
|
||
├── xiaohongshu.py # 小红书热门 ✅
|
||
└── mediacrawler/ # MediaCrawler 集成
|
||
├── __init__.py
|
||
└── xhs_crawler.py # 小红书实时爬虫
|
||
|
||
libs/MediaCrawler/ # MediaCrawler 项目 (子模块)
|
||
api/routers/hotspot.py # API 路由
|
||
```
|
||
|
||
## 二、API 接口
|
||
|
||
### 2.1 获取所有热点
|
||
|
||
```bash
|
||
GET /api/v2/hotspot/all?force=false
|
||
```
|
||
|
||
### 2.2 按平台获取
|
||
|
||
```bash
|
||
# 微博热搜 (待优化)
|
||
GET /api/v2/hotspot/weibo?limit=50
|
||
|
||
# 百度热搜
|
||
GET /api/v2/hotspot/baidu?limit=50
|
||
|
||
# 小红书热门 (文旅相关)
|
||
GET /api/v2/hotspot/xiaohongshu?limit=20
|
||
|
||
# 节日日历
|
||
GET /api/v2/hotspot/calendar?days=30
|
||
```
|
||
|
||
### 2.3 聚合查询
|
||
|
||
```bash
|
||
# 旅游相关热点 (所有来源)
|
||
GET /api/v2/hotspot/travel?limit=10
|
||
|
||
# 热门话题 (去重合并)
|
||
GET /api/v2/hotspot/trending?limit=20
|
||
```
|
||
|
||
### 2.4 自定义热点
|
||
|
||
```bash
|
||
# 获取
|
||
GET /api/v2/hotspot/custom
|
||
|
||
# 添加
|
||
POST /api/v2/hotspot/custom
|
||
{
|
||
"title": "冬季温泉推荐",
|
||
"tags": ["温泉", "冬季", "度假"],
|
||
"category": "travel"
|
||
}
|
||
|
||
# 删除
|
||
DELETE /api/v2/hotspot/custom/{title}
|
||
```
|
||
|
||
## 三、数据模型
|
||
|
||
### HotTopic
|
||
|
||
```python
|
||
@dataclass
|
||
class HotTopic:
|
||
title: str # 话题标题
|
||
source: HotTopicSource # 来源 (weibo/baidu/xhs/calendar/custom)
|
||
rank: Optional[int] # 排名
|
||
heat: Optional[int] # 热度值
|
||
category: HotTopicCategory # 分类 (travel/food/festival/trending)
|
||
url: Optional[str] # 原始链接
|
||
description: Optional[str] # 描述
|
||
tags: List[str] # 标签
|
||
fetched_at: datetime # 获取时间
|
||
expires_at: Optional[datetime] # 过期时间
|
||
```
|
||
|
||
### HotTopicCategory
|
||
|
||
- `travel` - 旅游相关
|
||
- `food` - 美食
|
||
- `festival` - 节日节气
|
||
- `event` - 热门事件
|
||
- `trending` - 热门话题
|
||
- `season` - 季节性
|
||
- `other` - 其他
|
||
|
||
## 四、缓存策略
|
||
|
||
| 来源 | 缓存时间 | 说明 |
|
||
|-----|---------|------|
|
||
| 微博 | 5分钟 | 实时性高 |
|
||
| 百度 | 10分钟 | 实时性高 |
|
||
| 小红书 | 30分钟 | 预设话题 |
|
||
| 日历 | 1小时 | 静态数据 |
|
||
| 自定义 | 24小时 | 手动管理 |
|
||
|
||
## 五、百度多榜单
|
||
|
||
百度爬虫支持多个榜单:
|
||
|
||
```python
|
||
# 默认获取: 实时热点 + 旅游榜
|
||
crawler = BaiduCrawler() # tabs=['realtime', 'travel']
|
||
|
||
# 自定义榜单
|
||
crawler = BaiduCrawler(tabs=['realtime', 'travel', 'movie'])
|
||
```
|
||
|
||
**支持的榜单**:
|
||
- `realtime` - 实时热点 (50条)
|
||
- `travel` - 旅游榜 (30条) ⭐
|
||
- `movie` - 电影榜
|
||
- `teleplay` - 电视剧榜
|
||
- `novel` - 小说榜
|
||
- `car` - 汽车榜
|
||
- `game` - 游戏榜
|
||
|
||
## 六、小红书模块说明
|
||
|
||
基于 MediaCrawler 项目获取实时数据,需要扫码登录。
|
||
|
||
### 6.1 使用爬虫
|
||
|
||
```python
|
||
from domain.hotspot.crawlers import XiaohongshuCrawler
|
||
|
||
crawler = XiaohongshuCrawler()
|
||
|
||
# 扫码登录 (首次需要,Cookie 会缓存)
|
||
await crawler.login()
|
||
|
||
# 获取热门话题
|
||
topics = await crawler.fetch()
|
||
|
||
# 搜索笔记
|
||
notes = await crawler.search_notes("旅游攻略", page_size=20)
|
||
|
||
# 获取笔记详情
|
||
detail = await crawler.get_note_detail("note_id")
|
||
```
|
||
|
||
### 6.2 直接使用桥接器
|
||
|
||
```python
|
||
from domain.hotspot.crawlers.mediacrawler import get_xhs_bridge
|
||
|
||
bridge = get_xhs_bridge()
|
||
|
||
# 登录
|
||
await bridge.login()
|
||
|
||
# 搜索笔记
|
||
notes = await bridge.search_notes("旅游攻略", page_size=20)
|
||
|
||
# 获取笔记详情
|
||
detail = await bridge.get_note_detail("note_id")
|
||
```
|
||
|
||
### 6.3 MediaCrawler 项目
|
||
|
||
位置: `libs/MediaCrawler/`
|
||
|
||
来源: https://github.com/NanmiCoder/MediaCrawler
|
||
|
||
支持平台: 小红书、抖音、微博、B站、快手、知乎、贴吧
|
||
|
||
### 6.4 搜索关键词
|
||
|
||
默认搜索以下文旅相关关键词:
|
||
|
||
```python
|
||
SEARCH_KEYWORDS = [
|
||
"旅游攻略", "周末去哪玩", "亲子游推荐", "自驾游路线",
|
||
"网红打卡地", "小众景点", "酒店推荐", "民宿推荐",
|
||
"冬季旅行", "滑雪攻略", "温泉度假",
|
||
]
|
||
```
|
||
|
||
可自定义关键词:
|
||
|
||
```python
|
||
crawler = XiaohongshuCrawler(keywords=["三亚旅游", "哈尔滨冰雪"])
|
||
```
|
||
|
||
## 六、使用示例
|
||
|
||
### Python 调用
|
||
|
||
```python
|
||
from domain.hotspot import get_hotspot_manager
|
||
|
||
async def get_travel_hotspots():
|
||
manager = get_hotspot_manager()
|
||
|
||
# 获取旅游相关热点
|
||
topics = await manager.get_travel_topics(limit=10)
|
||
|
||
for topic in topics:
|
||
print(f"[{topic.source.value}] {topic.title}")
|
||
|
||
return topics
|
||
```
|
||
|
||
### 内容生成集成
|
||
|
||
```python
|
||
# 在内容生成时使用热点
|
||
from domain.hotspot import get_hotspot_manager
|
||
|
||
async def generate_with_hotspots():
|
||
manager = get_hotspot_manager()
|
||
|
||
# 获取热点
|
||
topics = await manager.get_travel_topics(limit=5)
|
||
hot_topics = [t.title for t in topics]
|
||
|
||
# 传递给内容生成引擎
|
||
params = {
|
||
"subject": {...},
|
||
"hot_topics": {
|
||
"events": hot_topics[:2],
|
||
"festivals": ["元旦", "圣诞"],
|
||
"trending": hot_topics[2:],
|
||
}
|
||
}
|
||
```
|
||
|
||
## 七、待优化
|
||
|
||
1. **微博爬虫**: 需要 Cookie 或使用 Playwright
|
||
2. **抖音热搜**: 待添加
|
||
3. **MediaCrawler 集成**: 获取小红书实时数据
|
||
4. **定时更新**: 后台定时刷新缓存
|
||
5. **持久化**: 热点数据存储到数据库
|