298 lines
6.3 KiB
Markdown
298 lines
6.3 KiB
Markdown
|
# Whisper语音识别模型使用指南
|
|||
|
|
|||
|
这是一个使用OpenAI Whisper模型进行语音识别的完整解决方案,支持中文语音转文字。
|
|||
|
|
|||
|
## 🚀 快速开始
|
|||
|
|
|||
|
### 1. 安装依赖
|
|||
|
|
|||
|
```bash
|
|||
|
# 安装Whisper和相关依赖
|
|||
|
pip install -r whisper_requirements.txt
|
|||
|
|
|||
|
# 或者单独安装
|
|||
|
pip install openai-whisper torch torchaudio
|
|||
|
```
|
|||
|
|
|||
|
### 2. 安装FFmpeg(必需)
|
|||
|
|
|||
|
**Ubuntu/Debian:**
|
|||
|
```bash
|
|||
|
sudo apt update
|
|||
|
sudo apt install ffmpeg
|
|||
|
```
|
|||
|
|
|||
|
**CentOS/RHEL:**
|
|||
|
```bash
|
|||
|
sudo yum install ffmpeg
|
|||
|
# 或者
|
|||
|
sudo dnf install ffmpeg
|
|||
|
```
|
|||
|
|
|||
|
**macOS:**
|
|||
|
```bash
|
|||
|
brew install ffmpeg
|
|||
|
```
|
|||
|
|
|||
|
**Windows:**
|
|||
|
下载FFmpeg并添加到PATH环境变量
|
|||
|
|
|||
|
## 📋 模型大小选择
|
|||
|
|
|||
|
| 模型 | 参数量 | 英文准确率 | 多语言准确率 | 相对速度 | 内存需求 |
|
|||
|
|------|--------|------------|--------------|----------|----------|
|
|||
|
| tiny | 39M | ~32% | ~32% | ~32x | ~1GB |
|
|||
|
| base | 74M | ~34% | ~34% | ~16x | ~1GB |
|
|||
|
| small | 244M | ~36% | ~36% | ~6x | ~2GB |
|
|||
|
| medium | 769M | ~40% | ~40% | ~2x | ~5GB |
|
|||
|
| large | 1550M | ~45% | ~44% | 1x | ~10GB |
|
|||
|
| large-v2 | 1550M | ~47% | ~47% | 1x | ~10GB |
|
|||
|
| large-v3 | 1550M | ~51% | ~52% | 1x | ~10GB |
|
|||
|
|
|||
|
**推荐选择:**
|
|||
|
- **开发测试**: `base` 或 `small`
|
|||
|
- **生产环境**: `medium` 或 `large-v3`
|
|||
|
- **资源受限**: `tiny` 或 `base`
|
|||
|
|
|||
|
## 🎯 使用方法
|
|||
|
|
|||
|
### 命令行使用
|
|||
|
|
|||
|
#### 基本用法
|
|||
|
```bash
|
|||
|
# 转录单个音频文件
|
|||
|
python whisper_audio_transcribe.py audio.wav
|
|||
|
|
|||
|
# 指定模型大小
|
|||
|
python whisper_audio_transcribe.py audio.wav -m medium
|
|||
|
|
|||
|
# 指定语言
|
|||
|
python whisper_audio_transcribe.py audio.wav -l zh
|
|||
|
|
|||
|
# 批量处理目录
|
|||
|
python whisper_audio_transcribe.py /path/to/audio/directory/
|
|||
|
```
|
|||
|
|
|||
|
#### 高级用法
|
|||
|
```bash
|
|||
|
# 生成SRT字幕文件
|
|||
|
python whisper_audio_transcribe.py audio.wav -f srt
|
|||
|
|
|||
|
# 翻译为英文
|
|||
|
python whisper_audio_transcribe.py audio.wav -t translate
|
|||
|
|
|||
|
# 自动检测语言
|
|||
|
python whisper_audio_transcribe.py audio.wav -l auto
|
|||
|
|
|||
|
# 指定输出目录
|
|||
|
python whisper_audio_transcribe.py audio.wav -o ./transcripts/
|
|||
|
```
|
|||
|
|
|||
|
### Python代码使用
|
|||
|
|
|||
|
```python
|
|||
|
from whisper_audio_transcribe import WhisperTranscriber
|
|||
|
|
|||
|
# 创建转录器
|
|||
|
transcriber = WhisperTranscriber(model_size="base")
|
|||
|
|
|||
|
# 转录音频文件
|
|||
|
result = transcriber.transcribe_audio("audio.wav", language="zh")
|
|||
|
|
|||
|
# 获取转录文本
|
|||
|
text = result["text"]
|
|||
|
print(f"转录结果: {text}")
|
|||
|
|
|||
|
# 获取分段信息
|
|||
|
for segment in result["segments"]:
|
|||
|
print(f"{segment['start']:.2f}s - {segment['end']:.2f}s: {segment['text']}")
|
|||
|
|
|||
|
# 保存结果
|
|||
|
transcriber.save_transcript(result, "output.json", format="json")
|
|||
|
transcriber.save_transcript(result, "output.srt", format="srt")
|
|||
|
```
|
|||
|
|
|||
|
## 📊 输出格式
|
|||
|
|
|||
|
### JSON格式
|
|||
|
```json
|
|||
|
{
|
|||
|
"text": "完整的转录文本",
|
|||
|
"language": "zh",
|
|||
|
"segments": [
|
|||
|
{
|
|||
|
"id": 0,
|
|||
|
"start": 0.0,
|
|||
|
"end": 3.5,
|
|||
|
"text": "第一段文字",
|
|||
|
"confidence": -0.23
|
|||
|
}
|
|||
|
],
|
|||
|
"transcribe_time": 2.34,
|
|||
|
"model_size": "base",
|
|||
|
"timestamp": "2024-06-03T16:30:00",
|
|||
|
"stats": {
|
|||
|
"total_segments": 10,
|
|||
|
"total_duration": 45.6,
|
|||
|
"text_length": 234,
|
|||
|
"words_count": 67
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
### SRT字幕格式
|
|||
|
```
|
|||
|
1
|
|||
|
00:00:00,000 --> 00:00:03,500
|
|||
|
第一段文字
|
|||
|
|
|||
|
2
|
|||
|
00:00:03,500 --> 00:00:07,200
|
|||
|
第二段文字
|
|||
|
```
|
|||
|
|
|||
|
## 🎵 支持的音频格式
|
|||
|
|
|||
|
- WAV (.wav)
|
|||
|
- MP3 (.mp3)
|
|||
|
- M4A (.m4a)
|
|||
|
- FLAC (.flac)
|
|||
|
- AAC (.aac)
|
|||
|
- OGG (.ogg)
|
|||
|
|
|||
|
## 🌍 支持的语言
|
|||
|
|
|||
|
Whisper支持99种语言,主要包括:
|
|||
|
|
|||
|
- **中文**: zh
|
|||
|
- **英文**: en
|
|||
|
- **日文**: ja
|
|||
|
- **韩文**: ko
|
|||
|
- **法文**: fr
|
|||
|
- **德文**: de
|
|||
|
- **西班牙文**: es
|
|||
|
- **俄文**: ru
|
|||
|
- **阿拉伯文**: ar
|
|||
|
- **自动检测**: auto
|
|||
|
|
|||
|
## ⚡ 性能优化建议
|
|||
|
|
|||
|
### 1. GPU加速
|
|||
|
```python
|
|||
|
# 检查GPU可用性
|
|||
|
import torch
|
|||
|
print(f"CUDA可用: {torch.cuda.is_available()}")
|
|||
|
print(f"GPU数量: {torch.cuda.device_count()}")
|
|||
|
|
|||
|
# Whisper会自动使用GPU(如果可用)
|
|||
|
```
|
|||
|
|
|||
|
### 2. 批量处理
|
|||
|
```python
|
|||
|
# 批量处理多个文件
|
|||
|
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
|
|||
|
results = transcriber.transcribe_multiple_files(audio_files)
|
|||
|
```
|
|||
|
|
|||
|
### 3. 内存管理
|
|||
|
```python
|
|||
|
# 对于大文件,使用较小的模型
|
|||
|
transcriber = WhisperTranscriber(model_size="base") # 而不是large
|
|||
|
|
|||
|
# 处理完成后清理内存
|
|||
|
import gc
|
|||
|
gc.collect()
|
|||
|
```
|
|||
|
|
|||
|
## 🔧 故障排除
|
|||
|
|
|||
|
### 常见问题
|
|||
|
|
|||
|
1. **ImportError: No module named 'whisper'**
|
|||
|
```bash
|
|||
|
pip install openai-whisper
|
|||
|
```
|
|||
|
|
|||
|
2. **FFmpeg not found**
|
|||
|
```bash
|
|||
|
# Ubuntu/Debian
|
|||
|
sudo apt install ffmpeg
|
|||
|
|
|||
|
# 或者使用conda
|
|||
|
conda install ffmpeg
|
|||
|
```
|
|||
|
|
|||
|
3. **CUDA out of memory**
|
|||
|
```python
|
|||
|
# 使用较小的模型
|
|||
|
transcriber = WhisperTranscriber(model_size="base")
|
|||
|
|
|||
|
# 或者强制使用CPU
|
|||
|
import os
|
|||
|
os.environ["CUDA_VISIBLE_DEVICES"] = ""
|
|||
|
```
|
|||
|
|
|||
|
4. **转录结果不准确**
|
|||
|
- 尝试使用更大的模型(medium, large)
|
|||
|
- 确保音频质量良好
|
|||
|
- 指定正确的语言代码
|
|||
|
|
|||
|
### 性能基准测试
|
|||
|
|
|||
|
在不同硬件上的性能表现:
|
|||
|
|
|||
|
| 硬件配置 | 模型 | 1分钟音频转录时间 |
|
|||
|
|----------|------|------------------|
|
|||
|
| CPU (Intel i7) | base | ~30秒 |
|
|||
|
| CPU (Intel i7) | medium | ~60秒 |
|
|||
|
| GPU (RTX 3080) | base | ~5秒 |
|
|||
|
| GPU (RTX 3080) | large-v3 | ~15秒 |
|
|||
|
|
|||
|
## 📝 使用示例
|
|||
|
|
|||
|
### 示例1: 转录抖音视频音频
|
|||
|
```bash
|
|||
|
# 首先提取音频(使用video2audio.py)
|
|||
|
python video2audio.py douyin_video.mp4 -o audio_output
|
|||
|
|
|||
|
# 然后转录音频
|
|||
|
python whisper_audio_transcribe.py audio_output/douyin_video.wav -m medium -l zh
|
|||
|
```
|
|||
|
|
|||
|
### 示例2: 批量处理并生成字幕
|
|||
|
```bash
|
|||
|
# 批量转录并生成SRT字幕
|
|||
|
python whisper_audio_transcribe.py ./audio_files/ -f srt -m large-v3 -o ./subtitles/
|
|||
|
```
|
|||
|
|
|||
|
### 示例3: 多语言检测
|
|||
|
```bash
|
|||
|
# 自动检测语言并转录
|
|||
|
python whisper_audio_transcribe.py mixed_language.wav -l auto -m medium
|
|||
|
```
|
|||
|
|
|||
|
## 🔗 集成到现有项目
|
|||
|
|
|||
|
可以将Whisper转录功能集成到视频分析流程中:
|
|||
|
|
|||
|
```python
|
|||
|
# 完整的视频分析流程
|
|||
|
from video2audio import Video2AudioExtractor
|
|||
|
from whisper_audio_transcribe import WhisperTranscriber
|
|||
|
|
|||
|
# 1. 提取音频
|
|||
|
extractor = Video2AudioExtractor()
|
|||
|
video_path, audio_path = extractor.extract_audio_from_video("video.mp4")
|
|||
|
|
|||
|
# 2. 语音识别
|
|||
|
transcriber = WhisperTranscriber(model_size="medium")
|
|||
|
result = transcriber.transcribe_audio(audio_path, language="zh")
|
|||
|
|
|||
|
# 3. 获取转录文本
|
|||
|
speech_text = result["text"]
|
|||
|
print(f"口播内容: {speech_text}")
|
|||
|
```
|
|||
|
|
|||
|
## 📄 许可证
|
|||
|
|
|||
|
本项目基于MIT许可证,Whisper模型遵循OpenAI的使用条款。
|