Report Bug / Feature Request

Speech to Text

Transcribe audio and video to text with AI. Supports 99 languages, timestamps, and speaker detection.

Upload Audio or Video

Drag & drop your file here, or browse

Supports MP3, WAV, FLAC, OGG, M4A, MP4, WebM. Max 100MB.

— or record from your microphone —

00:00

Settings

Model

Language

Include timestamps

Speaker diarization

1,000/min characters — Sign up to track usage

Transcription

Upload an audio file and click Transcribe to get started

How It Works

1. Upload Audio

Upload your audio or video file. We support MP3, WAV, FLAC, OGG, M4A, MP4, and WebM formats up to 100MB.

2. AI Transcribes

Our AI models process your audio, detecting language, identifying speakers, and generating accurate text with timestamps.

3. Get Your Text

Copy your transcription or download it as TXT or SRT subtitle format. Edit and refine as needed.

Use Cases

Speech to text for every industry and workflow

Meetings & Conferences

Automatically transcribe Zoom, Teams, and Google Meet recordings. Never miss an action item again. Export as meeting notes or subtitles.

Interviews & Journalism

Transcribe interviews for articles, research papers, and documentaries. Speaker diarization identifies who said what for easy attribution.

Podcasts & Media

Generate transcripts and show notes for podcast episodes. Create searchable archives of your audio content. Add subtitles to video podcasts.

Lectures & Education

Convert recorded lectures into study notes. Make educational content accessible with accurate captions. Support students with hearing impairments.

Medical Dictation

Transcribe doctor-patient consultations, clinical notes, and medical dictation. Save hours of manual documentation with AI-powered accuracy.

Legal Proceedings

Transcribe depositions, hearings, and client meetings. Accurate timestamps for legal reference. Export in formats suitable for court documentation.

STT Model Comparison

Whisper

OpenAI's robust speech recognition model supporting 99 languages.

99 languages
Translation
Timestamps
Robust to noise

OpenAI

Faster Whisper

4x faster than Whisper with CTranslate2 optimization, same accuracy.

4x faster
Lower memory
All model sizes
Batch processing
VAD filtering

SYSTRAN

SenseVoice

Speech understanding model with emotion detection, 50+ languages.

50+ languages
Emotion detection
Audio events
Speaker analysis
Rich metadata

Alibaba (FunAudioLLM)

Speech-to-Text Plans

Start free, upgrade when you need more

Free

1-minute audio limit
Faster Whisper model
Basic transcription
100+ languages

Frequently Asked Questions

Speech to text (STT), also called automatic speech recognition (ASR), converts spoken language into written text. Our models use AI to accurately transcribe audio from meetings, interviews, podcasts, lectures, and more.

Faster Whisper is recommended for most use cases — it's 4x faster than the original Whisper while maintaining the same accuracy. Use SenseVoice if you need emotion detection or audio event detection alongside transcription.

We support MP3, WAV, M4A, OGG, FLAC, WEBM, and most common audio/video formats. Maximum file size is 50MB. For larger files, consider splitting the audio first.

Free users can transcribe up to 5 minutes of audio. Paid plans support audio files up to 2 hours. For longer recordings, use our API with batch processing.

Our models achieve 95%+ accuracy on clear English speech. Accuracy varies by language, audio quality, and background noise. Faster Whisper and Whisper support 99 languages with varying accuracy levels.

Yes, our advanced transcription modes can identify and label different speakers in the audio. Speaker diarization is especially useful for meeting transcripts, interviews, and multi-person podcasts where you need to know who said what.

Real-time streaming transcription is available through our API using Faster Whisper. Audio is processed in chunks as it arrives, delivering partial transcripts with low latency. This is ideal for live captioning and real-time note-taking.

Yes, our transcription output includes word-level timestamps that can be exported as SRT, VTT, or ASS subtitle files. This is perfect for adding captions to YouTube videos, online courses, and social media content.

Yes, all transcription results include segment-level timestamps by default. Word-level timestamps are also available, showing the exact start and end time for each word in the audio.

Faster Whisper is trained on diverse audio and handles moderate background noise well. For very noisy recordings, we recommend running the audio through our Audio Enhancer first to improve clarity before transcription.

Yes, uploaded audio files are processed on our secure GPU servers and automatically deleted after transcription is complete. We do not store, share, or use your audio for training purposes. All transfers are encrypted.

Free users can transcribe up to 5 minutes of audio at no cost. Paid plans use characters based on audio duration: approximately 1,000 characters per minute of audio. Check our pricing page for detailed plan information and character packs.

5.0/5 (1)

Transcribe Audio with AI

Get accurate transcriptions in 99 languages. Sign up free and get 15,000 characters to start.

Speech to Text

Upload Audio or Video

Settings

Transcription

How It Works

1. Upload Audio

2. AI Transcribes

3. Get Your Text

Use Cases

Meetings & Conferences

Interviews & Journalism

Podcasts & Media

Lectures & Education

Medical Dictation

Legal Proceedings

STT Model Comparison

Whisper

Faster Whisper

SenseVoice

Speech-to-Text Plans

Frequently Asked Questions

What is speech to text (STT)?

Which transcription model is best?

What audio formats can I upload?

Is there a time limit for transcription?

How accurate is the transcription?

Does speech to text support speaker diarization?

Can I get real-time transcription?

Can I generate subtitles or SRT files?

Does the transcription include timestamps?

How does the tool handle background noise?

Is my audio data kept private?

How much does speech to text cost?

Transcribe Audio with AI