Imagine this scenario: you have an hour-long meeting recording that needs to be transcribed into text. The traditional approach is to open an online transcription service, upload your audio file, wait for processing, and then download the results. Not only is this time-consuming, but it also means your voice data is uploaded to a third-party server.
What if your AI assistant could transcribe speech to text directly on your local machine, and your audio files never leave your computer?
That’s exactly what Whisper MCP aims to solve.
🤔 What is MCP? Why Does It Matter?
MCP (Model Context Protocol) is an open protocol launched by Anthropic, designed to standardize how AI assistants connect with external tools and data sources.
Simply put, MCP is like the “USB interface” for the AI world:
- Before: Each AI assistant had its own plugin system, incompatible with each other
- Now: Tools following the MCP protocol can be used by any AI assistant that supports MCP
This means once you’ve configured Whisper MCP, Claude, ChatGPT, Cursor, and other MCP-compatible AI assistants can all directly use its speech-to-text capabilities.
🚀 Core Features of Whisper MCP
1. Local-First, Privacy-First
What sets Whisper MCP apart is that all processing happens on your own computer.
- Audio files are never uploaded to any server
- No internet connection required to use it
- Your voice data is completely under your control
For users handling sensitive meetings, personal recordings, or confidential business audio, this is a crucial security guarantee.
2. Dual Backend Architecture, Cross-Platform Support
No matter what operating system you use, Whisper MCP runs smoothly:
| Platform | Backend Engine | Acceleration |
|---|---|---|
| macOS | whisper.cpp | Metal / CoreML |
| Linux | whisper.cpp / faster-whisper | CUDA |
| Windows | faster-whisper | CUDA / CPU |
On macOS, it can even leverage the Apple Silicon Neural Engine to achieve over 10x real-time transcription speed.
3. Automatic Hardware Acceleration
No manual configuration needed — Whisper MCP automatically detects and uses the best available hardware acceleration:
- Have an NVIDIA GPU? It automatically uses CUDA acceleration
- Using an Apple Silicon Mac? It automatically enables Metal or CoreML
- Only have a CPU? It still runs smoothly, automatically selecting the optimal approach
4. High-Quality Transcription Model
By default, Whisper MCP uses OpenAI’s large-v3-turbo model, which offers the best balance of quality and speed in the Whisper family:
| Model | Size | Accuracy | Best For |
|---|---|---|---|
| large-v3-turbo | ~1.6 GB | Highest | Default recommended |
| large-v3-turbo-q8_0 | ~874 MB | High | Balance of speed & quality |
| large-v3-turbo-q5_0 | ~574 MB | Good | Speed-first scenarios |
If your computer has limited resources, you can also choose lighter models like medium or small.
5. Multiple Audio Format Support
No need to manually convert formats — direct support for:
- MP3, WAV, M4A, WEBM
- And many more common audio formats
Built-in ffmpeg integration automatically handles format conversion and audio preprocessing.
6. Timestamps and Segmented Output
Transcription results include not just text, but also:
- Segment-level timestamps: Know exactly where each sentence appears in the audio
- Word-level timestamps (optional): Precision down to individual words
- Output as JSON, plain text, or SRT subtitles
7. Smart Splitting for Long Audio
Facing a recording over an hour long? The transcribe_with_split tool automatically segments the audio for processing, preventing memory issues while maintaining transcription quality.
🛠️ MCP Tools Overview
Once configured, your AI assistant can directly call the following tools:
transcribe — Transcribe Audio Files
Tell the AI: “Please transcribe this recording: /Users/myname/Documents/meeting.mp3”
The AI will automatically call the transcription tool and return the full text with timestamps.
transcribe_with_split — Long Audio Segmented Transcription
Perfect for podcasts, meeting records, interviews, and other long-form audio content.
get_model_info — View Current Model Information
Learn about the current backend engine, model version, and running device.
check_health — Service Health Check
Confirm that the Whisper MCP service is running normally.
📦 Quick Start
macOS One-Click Launch
The easiest way is to use the provided launch script:
chmod +x scripts/start_macos.command
open scripts/start_macos.command
This script automatically completes environment checks, dependency installation, model downloads, and service startup.
Windows Global Command
After installing with uv, you can run it from any directory:
uv tool install --editable .
whisper-mcp --check
Configure Claude Desktop
Edit the configuration file (Windows: %APPDATA%\Claude\claude_desktop_config.json):
{
"mcpServers": {
"whisper": {
"command": "/path/to/venv/bin/python",
"args": ["-m", "whisper_mcp.main"],
"cwd": "/path/to/whisper-mcp"
}
}
}
After restarting Claude Desktop, you can directly ask Claude to transcribe audio for you!
💡 Usage Examples
Example 1: Transcribe Local Audio
“Please help me transcribe this meeting recording: /Users/myname/Documents/meeting_2025.mp3”
Claude will automatically call the transcribe tool and return the complete transcript.
Example 2: Generate Subtitle Files
“Help me convert this video to SRT subtitles: /Users/myname/Documents/video.mp4”
Claude will output standard SRT format subtitle files, ready to import into video editing software.
Example 3: Transcribe Foreign Language Content
“What does this Japanese podcast say? /Users/myname/Documents/podcast_jp.m4a”
After specifying the language code, Whisper MCP will automatically recognize and transcribe the content.
🔧 Advanced Configuration
Customize behavior through environment variables or a .env file:
# Set default language
LANGUAGE=zh
# Select model
MODEL_NAME=large-v3-turbo
# Enable GPU acceleration
USE_GPU=true
# CPU thread count
THREADS=8
# Log level
LOG_LEVEL=INFO
🎯 Use Cases
1. Meeting Notes Organization
Directly transcribe meeting recordings into text, quickly generating meeting minutes without manual note-taking.
2. Podcast and Video Subtitles
Automatically generate subtitles for content creators, with SRT export support for direct use in video production.
3. Interview Content Processing
Journalists and researchers can quickly convert interview recordings into text transcripts for easier editing and citation.
4. Personal Voice Notes
Record ideas and inspirations on the go, and let AI help you organize them into structured text notes.
5. Learning Material Processing
Convert audio from online courses and lecture recordings into searchable text for easier review.
🔮 Future Roadmap
- Real-time streaming transcription support
- Speaker diarization
- More output format support
- Model hot-swapping
- Batch audio processing optimization
🤝 Contributing
Whisper MCP is an open-source project, and all forms of contribution are welcome:
- Submit Issues to report bugs or suggestions
- Submit Pull Requests to improve code
- Improve documentation and tutorials
- Share with more people who need it
Project URL: https://github.com/bitfarer/whisper-mcp
📄 License
MIT License — You are free to use, modify, and distribute this project.
**Let AI truly “hear