# Whisper语音识别模型使用指南

这是一个使用OpenAI Whisper模型进行语音识别的完整解决方案，支持中文语音转文字。

## 🚀 快速开始

### 1. 安装依赖

```bash
# 安装Whisper和相关依赖
pip install -r whisper_requirements.txt

# 或者单独安装
pip install openai-whisper torch torchaudio
```

### 2. 安装FFmpeg（必需）

**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install ffmpeg
```

**CentOS/RHEL:**
```bash
sudo yum install ffmpeg
# 或者
sudo dnf install ffmpeg
```

**macOS:**
```bash
brew install ffmpeg
```

**Windows:**
下载FFmpeg并添加到PATH环境变量

## 📋 模型大小选择

| 模型 | 参数量 | 英文准确率 | 多语言准确率 | 相对速度 | 内存需求 |
|------|--------|------------|--------------|----------|----------|
| tiny | 39M | ~32% | ~32% | ~32x | ~1GB |
| base | 74M | ~34% | ~34% | ~16x | ~1GB |
| small | 244M | ~36% | ~36% | ~6x | ~2GB |
| medium | 769M | ~40% | ~40% | ~2x | ~5GB |
| large | 1550M | ~45% | ~44% | 1x | ~10GB |
| large-v2 | 1550M | ~47% | ~47% | 1x | ~10GB |
| large-v3 | 1550M | ~51% | ~52% | 1x | ~10GB |

**推荐选择：**
- **开发测试**: `base` 或 `small`
- **生产环境**: `medium` 或 `large-v3`
- **资源受限**: `tiny` 或 `base`

## 🎯 使用方法

### 命令行使用

#### 基本用法
```bash
# 转录单个音频文件
python whisper_audio_transcribe.py audio.wav

# 指定模型大小
python whisper_audio_transcribe.py audio.wav -m medium

# 指定语言
python whisper_audio_transcribe.py audio.wav -l zh

# 批量处理目录
python whisper_audio_transcribe.py /path/to/audio/directory/
```

#### 高级用法
```bash
# 生成SRT字幕文件
python whisper_audio_transcribe.py audio.wav -f srt

# 翻译为英文
python whisper_audio_transcribe.py audio.wav -t translate

# 自动检测语言
python whisper_audio_transcribe.py audio.wav -l auto

# 指定输出目录
python whisper_audio_transcribe.py audio.wav -o ./transcripts/
```

### Python代码使用

```python
from whisper_audio_transcribe import WhisperTranscriber

# 创建转录器
transcriber = WhisperTranscriber(model_size="base")

# 转录音频文件
result = transcriber.transcribe_audio("audio.wav", language="zh")

# 获取转录文本
text = result["text"]
print(f"转录结果: {text}")

# 获取分段信息
for segment in result["segments"]:
    print(f"{segment['start']:.2f}s - {segment['end']:.2f}s: {segment['text']}")

# 保存结果
transcriber.save_transcript(result, "output.json", format="json")
transcriber.save_transcript(result, "output.srt", format="srt")
```

## 📊 输出格式

### JSON格式
```json
{
  "text": "完整的转录文本",
  "language": "zh",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "第一段文字",
      "confidence": -0.23
    }
  ],
  "transcribe_time": 2.34,
  "model_size": "base",
  "timestamp": "2024-06-03T16:30:00",
  "stats": {
    "total_segments": 10,
    "total_duration": 45.6,
    "text_length": 234,
    "words_count": 67
  }
}
```

### SRT字幕格式
```
1
00:00:00,000 --> 00:00:03,500
第一段文字

2
00:00:03,500 --> 00:00:07,200
第二段文字
```

## 🎵 支持的音频格式

- WAV (.wav)
- MP3 (.mp3)
- M4A (.m4a)
- FLAC (.flac)
- AAC (.aac)
- OGG (.ogg)

## 🌍 支持的语言

Whisper支持99种语言，主要包括：

- **中文**: zh
- **英文**: en
- **日文**: ja
- **韩文**: ko
- **法文**: fr
- **德文**: de
- **西班牙文**: es
- **俄文**: ru
- **阿拉伯文**: ar
- **自动检测**: auto

## ⚡ 性能优化建议

### 1. GPU加速
```python
# 检查GPU可用性
import torch
print(f"CUDA可用: {torch.cuda.is_available()}")
print(f"GPU数量: {torch.cuda.device_count()}")

# Whisper会自动使用GPU（如果可用）
```

### 2. 批量处理
```python
# 批量处理多个文件
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = transcriber.transcribe_multiple_files(audio_files)
```

### 3. 内存管理
```python
# 对于大文件，使用较小的模型
transcriber = WhisperTranscriber(model_size="base")  # 而不是large

# 处理完成后清理内存
import gc
gc.collect()
```

## 🔧 故障排除

### 常见问题

1. **ImportError: No module named 'whisper'**
   ```bash
   pip install openai-whisper
   ```

2. **FFmpeg not found**
   ```bash
   # Ubuntu/Debian
   sudo apt install ffmpeg
   
   # 或者使用conda
   conda install ffmpeg
   ```

3. **CUDA out of memory**
   ```python
   # 使用较小的模型
   transcriber = WhisperTranscriber(model_size="base")
   
   # 或者强制使用CPU
   import os
   os.environ["CUDA_VISIBLE_DEVICES"] = ""
   ```

4. **转录结果不准确**
   - 尝试使用更大的模型（medium, large）
   - 确保音频质量良好
   - 指定正确的语言代码

### 性能基准测试

在不同硬件上的性能表现：

| 硬件配置 | 模型 | 1分钟音频转录时间 |
|----------|------|------------------|
| CPU (Intel i7) | base | ~30秒 |
| CPU (Intel i7) | medium | ~60秒 |
| GPU (RTX 3080) | base | ~5秒 |
| GPU (RTX 3080) | large-v3 | ~15秒 |

## 📝 使用示例

### 示例1: 转录抖音视频音频
```bash
# 首先提取音频（使用video2audio.py）
python video2audio.py douyin_video.mp4 -o audio_output

# 然后转录音频
python whisper_audio_transcribe.py audio_output/douyin_video.wav -m medium -l zh
```

### 示例2: 批量处理并生成字幕
```bash
# 批量转录并生成SRT字幕
python whisper_audio_transcribe.py ./audio_files/ -f srt -m large-v3 -o ./subtitles/
```

### 示例3: 多语言检测
```bash
# 自动检测语言并转录
python whisper_audio_transcribe.py mixed_language.wav -l auto -m medium
```

## 🔗 集成到现有项目

可以将Whisper转录功能集成到视频分析流程中：

```python
# 完整的视频分析流程
from video2audio import Video2AudioExtractor
from whisper_audio_transcribe import WhisperTranscriber

# 1. 提取音频
extractor = Video2AudioExtractor()
video_path, audio_path = extractor.extract_audio_from_video("video.mp4")

# 2. 语音识别
transcriber = WhisperTranscriber(model_size="medium")
result = transcriber.transcribe_audio(audio_path, language="zh")

# 3. 获取转录文本
speech_text = result["text"]
print(f"口播内容: {speech_text}")
```

## 📄 许可证

本项目基于MIT许可证，Whisper模型遵循OpenAI的使用条款。