// AI PODCASTING

Best podcast transcription in 2026: Whisper vs Descript vs Otter vs AssemblyAI (real WER tests)

Honest WER benchmark across 6 transcription engines on a real 30-min podcast. Pricing per audio-hour. When Whisper API wins, when AssemblyAI/Deepgram win on diarization, when Rev human-in-loop is the only correct answer.

Last verified · 2026-05-21 · by Moe Ameen

The direct answer

For most English podcasts in 2026, OpenAI Whisper API (from ~$0.006/min — verify on OpenAI pricing page) is the price-quality champion — WER 7-9% on clean studio audio, native timestamps, no per-month cap. AssemblyAI Universal-2 ($0.15/hr, ~$0.0025/min) and Deepgram Nova-3 ($0.0077/min pre-recorded) win when you need speaker diarization that actually works. Descript wins if you edit in Descript anyway. Rev human-in-loop ($1.50+/min) is the only honest answer for legal, medical, or court-grade transcripts. Otter is fine for meetings, undersized for production podcast workflows.

Transcription quality is the load-bearing input for every downstream podcast output. Bad transcripts produce wrong show notes, mis-attributed quotes, off-topic clip selection, hallucinated blog drafts, and SEO that ranks for nothing. Most podcasters never look at their transcripts because the AI-generated outputs LOOK fine — until a misheard name lands in a blog headline or a clip caption misquotes a guest by one word that flips the meaning.

The gap between "AI transcription is 95% accurate" marketing and reality is wide. Out-of-the-box Word Error Rate (WER) on real podcast audio — meaning multi-speaker, occasional crosstalk, a guest with a non-American accent, two technical terms per minute — sits in the 8-15% range across every major tool. The "98% accuracy" claims you see on vendor pages are measured on clean read-speech corpora that look nothing like a podcast. This guide unpacks what actually matters when you pick a transcription engine for podcast production: how WER is gamed, where speaker diarization quietly falls apart, what custom vocabulary really fixes, and which tool is the right answer for which job.

What "transcription accuracy" actually means

"Accuracy" gets quoted as a single percentage on every vendor landing page. It is never a single percentage. There are at least five independent dimensions that matter for podcast work, and a tool can be excellent on three and useless on the other two.

1. Word Error Rate (WER)

The industry-standard metric. WER counts every word the engine substituted, deleted, or inserted vs the reference transcript, divided by total words. WER of 8% means 8 errors per 100 words — about one error every 2-3 sentences in a normal conversation. On clean studio audio of a single American speaker reading prepared text, modern engines hit 3-6% WER. On a real podcast — two people, remote recording, one Indian English accent, a guest who keeps saying "Loom" and "Notion" — the same engines land at 8-15%.

The marketing claim "98% accuracy" maps to 2% WER, which no general-purpose engine actually achieves on production podcast audio without per-episode tuning. Discount any vendor that quotes a single accuracy number with no reference corpus.

2. Speaker diarization

Diarization is the engine's ability to label which speaker said what. A perfect transcript with broken diarization is unusable for podcast work — every quote graphic, every clip caption, every blog pull-quote relies on knowing who said the line. Most engines hit 85-92% diarization accuracy on two-speaker remote audio and degrade to 70-80% on three or more speakers. Diarization is the single most over-promised feature in the category.

3. Timestamps

Word-level vs segment-level timestamps. Word-level timestamps are required for clip generation, caption animation, and timestamp-linked show notes. Segment-level timestamps (sentence or paragraph granularity) are fine for blog drafts but useless for short-form clipping. Whisper API exposes both via the `timestamp_granularities` parameter; AssemblyAI and Deepgram default to word-level; Otter only exposes segment-level on most plans.

4. Punctuation and formatting

Capitalization, sentence boundaries, paragraph breaks, comma placement. Good punctuation makes a transcript readable as a blog post without rewriting. Whisper and AssemblyAI ship the cleanest punctuation. Deepgram's Smart Formatting handles numbers, dates, and currencies well. Otter's punctuation drifts when speakers talk over each other.

5. Technical-term and proper-noun handling

The "Notion / Loom / Postmark / Trigger.dev" problem. Every engine mishears the same 30-50 words episode after episode unless you feed it a custom vocabulary list. This is the single highest-leverage fix in the whole stack — 30 minutes of setup that compounds across every future episode.

The tools, deep dive

OpenAI Whisper API

Whisper-large-v3 hosted at OpenAI. From ~$0.006 per minute of audio (verify on OpenAI pricing page), billed by the second. No monthly minimum, no plan tier required, native word-level timestamps, native language detection across 99 languages, native code-switching (a Spanish phrase in an English podcast transcribes correctly). The `prompt` parameter accepts up to 244 tokens of priming text — paste your custom vocabulary in here and recurring jargon stops getting misheard. No native speaker diarization; you bolt it on with a separate library (pyannote-audio is the standard).

Honest framing: for English podcasts where you have one accurate transcript per episode and you fan it out into 25-35 downstream outputs, Whisper API is the right answer. Kompozy itself transcribes source podcast audio with Whisper API because no other engine matches the price-quality ratio at scale. Where it fails: anything requiring real-time streaming, anything requiring speaker labels without extra plumbing, and any workflow where the per-call API friction matters more than the per-minute cost.

AssemblyAI

Universal-2 at $0.15/hour ($0.0025/min) — cheapest competent option in the category. Universal-3 Pro at $0.21/hour adds a measurable accuracy bump on noisy audio. Speaker diarization is a $0.02/hour add-on and is the best diarization in the category for two-to-four-speaker podcasts. Native word_boost parameter accepts a custom vocabulary array (priority terms get a confidence weighting); no token cap like Whisper's prompt field. Sentiment analysis, chapter detection, content moderation, and PII redaction all available as add-ons.

Honest framing: AssemblyAI is what you use when you need diarization that actually works AND you want API-driven integration into a content pipeline. If you have one host, no guest, and clean audio, Whisper API is cheaper and equally accurate. The moment you have two distinct speakers, AssemblyAI's diarization quality justifies the price gap.

Deepgram

Nova-3 Monolingual pre-recorded at $0.0077/min ($0.46/hour), streaming at $0.0048/min for real-time use cases. Nova-3 Multilingual at $0.0092/min. Speaker diarization is a $0.0020/min add-on. Smart Formatting is included (numbers, dates, currencies, profanity filtering). Keyterm Prompting at $0.0013/min replaces the custom-vocabulary concept with per-call priority terms — useful for product names that change episode to episode.

Honest framing: Deepgram's niche is real-time streaming. For pre-recorded podcasts, Whisper API is cheaper and AssemblyAI has better diarization. Where Deepgram wins is live broadcast podcasts, AI-voice-agent integrations, or any workflow where the audio is hitting the API as it's being recorded. The Flux English model at $0.0065/min streaming is purpose-built for conversational voice agents.

Descript

Creator plan $35/mo ($24/mo annual) for 30 hours of media per month. Built-in transcription is a near-Whisper accuracy engine wrapped in a text-based audio/video editor — you literally edit the audio by deleting words from the transcript. Multi-track transcription with 8+ speaker detection on Creator and above. Filler-word removal, eye-contact correction, studio-sound enhancement, overdub voice cloning all bundled.

Honest framing: Descript is the right answer when you EDIT in Descript. If you record in Riverside, edit in Descript, and just need the transcript that falls out of editing — you're already paying for it. If you're looking for a standalone transcription engine, Whisper API at ~$0.006/min (verify) beats Descript's per-hour rate decisively (the Creator plan works out to ~$0.013/min on the annual rate, more than 2x Whisper). Descript is a workflow product that happens to transcribe, not a transcription product.

Rev

AI transcription: Essentials at $25.49/seat/month (annual) for 5,000 minutes — works out to roughly $0.005/min effective. Pro at $47.99/seat/month for 10,000 minutes. Human transcription: pricing not openly listed on the pricing page but historically $1.50/min for standard turnaround and $2.50/min for rush. Human transcripts hit 99%+ accuracy out of the box, with speaker labels handled by a human listener.

Honest framing: Rev human-in-loop is the only honest answer for transcripts where the transcript IS the asset — legal depositions, medical records, accessibility-compliant captions for broadcast, court-grade interview recordings. For weekly podcast production, paying Rev humans $90 per 60-minute episode vs paying Whisper API ~$0.36 per 60-minute episode (verify) is a 250x cost gap with a 5-7 point accuracy delta you can close with 15 minutes of cleanup. Rev AI subscription is competitive with Whisper API on per-minute math but you're paying for a seat-based product with a monthly minimum — only worth it above 5,000 minutes of monthly volume.

Otter.ai

Pro plan $8.33/user/month (annual) or $16.99/month with 1,200 transcription minutes per user. Business plan $19.99/user/month annual for unlimited meeting and in-app recording transcription. Designed primarily for live meeting transcription — Zoom/Teams/Meet bot integration is the headline feature. Speaker diarization is solid for two-speaker calls, fragile beyond that. Word-level timestamps require Business tier or above.

Honest framing: Otter is a meeting product, not a podcast product. The 1,200-minute Pro cap (20 hours/month) is undersized for any podcast producing more than five 60-minute episodes per month. The pricing math doesn't work for production volume. Where Otter wins: founders who want their internal meetings transcribed AND their solo-host podcast occasionally transcribed in the same workflow.

Sonix

$10/hour pay-as-you-go, or subscription tiers: Core $25/mo for 5hrs, Advanced $50/mo for 20hrs, Pro $80/mo for 40hrs. Web-based editor with collaborative review, multi-language support across 50+ languages, and automated translation. Speaker diarization included on all tiers.

Honest framing: Sonix is overpriced relative to API-first engines. $10/hour pay-as-you-go is ~16x Whisper API's ~$0.006/min (~$0.36/hour). Sonix's sell is the editor and the team collaboration layer — if you have a non-technical team that needs to review and correct transcripts in a web UI, Sonix is fine. If you're building a content pipeline, the per-minute economics don't work.

Happy Scribe

AI: Basic $8.50/mo for 120 min, Pro $19/mo for 600 min, Business $59/mo for 6,000 min. Extra AI credits $0.20/min over plan. Human transcription "from $2.00/min" standard, $1.90/min on Business tier. 65+ languages, EU-hosted for data-residency-sensitive customers.

Honest framing: Happy Scribe is positioned for EU customers and translation-heavy workflows. The Business tier at $59/mo for 100 hours works out to $0.0098/min — competitive with Deepgram pre-recorded but pricier than Whisper API. The data-residency story is genuinely useful for European podcasters serving EU audiences who care about GDPR-friendly processing.

Castmagic

Hobby $21/mo annual for 5hrs of AI transcription. Starter $79/mo annual for 20hrs. Business $790/mo annual for 80hrs. All plans include unlimited longform AI outputs (show notes, blog posts, social drafts) bundled with the transcription.

Honest framing: Castmagic is not a transcription product — it's a transcript-to-content product. The $79/mo Starter at 20 hours works out to ~$0.066/min for transcription alone, which is 11x Whisper API. You pay for the content-generation layer on top. If you want one tool that does transcribe-then-fanout end-to-end, Castmagic is a clean answer. If you already run Kompozy or Castmagic-equivalent fan-out, you're double-paying.

Speechmatics

Free tier 480 min/month. Pro tier from $0.24/hr ($0.004/min) with a 20% discount above 500 hours/month. 50+ languages with industry-leading accuracy on non-American English accents. Enterprise tier for very high volume.

Honest framing: Speechmatics is the right answer for accent-heavy podcasts. Their model handles Scottish, Indian, Nigerian, Australian, and South African English noticeably better than Whisper, AssemblyAI, or Deepgram in independent benchmarks. If your show interviews international guests every week, Speechmatics earns its slot.

Feature matrix

Engine	Speaker diarization	Word timestamps	Punctuation	Custom vocab	Profanity filter	Languages	API
Whisper API	No (bolt-on)	Yes	Excellent	Prompt param (244 tok)	No	99	Yes
AssemblyAI	Excellent ($0.02/hr)	Yes	Excellent	word_boost (array)	Yes	100+	Yes
Deepgram Nova-3	Good ($0.002/min)	Yes	Good (Smart Formatting)	Keyterm Prompting	Yes	36 mono / 10 multi	Yes
Descript	Good (8+ speakers)	Yes	Good	Vocabulary UI	Yes	25	Limited
Rev AI	Good	Yes	Good	Custom vocab UI	Yes	36	Yes
Otter Pro	Fair (2 spk strong)	Business+ only	Fair	Custom vocab UI	No	English-heavy	No (Zoom bot)
Sonix	Good	Yes	Good	Custom vocab UI	Yes	50+	Yes
Happy Scribe	Good	Yes	Good	Glossary UI	Yes	65+	Yes
Speechmatics	Good	Yes	Good	Yes	Yes	50+	Yes
Rev human	Excellent (human)	Yes	Excellent	Per-job notes	Configurable	English-primary	Yes

Transcription engine feature matrix, verified 2026-05-21. "API" = first-class REST API for pipeline integration.

Pricing per audio-hour, normalized

Vendor pricing pages mix minutes, hours, seats, and monthly caps. Here is the same data normalized to dollars per hour of audio processed at typical production volume (4-8 episodes per month, 60 min average length, ~6 hours of audio monthly). Whisper API rates marked "(verify)" because OpenAI pricing page returned a 403 at audit time.

Engine	Effective $/audio-hour	Billing model	Notes
Whisper API	~$0.36 (verify)	Per second, no minimum	Cheapest competent option for English
AssemblyAI Universal-2	$0.15	Per second, no minimum	Cheaper than Whisper before diarization add-on
AssemblyAI U-2 + diarization	$0.17	Per second, no minimum	Best diarization-included price in category
Deepgram Nova-3 pre-recorded	$0.46	Per second, no minimum	$0.58/hr with diarization add-on
Speechmatics Pro	$0.24	Per second, monthly billing	20% discount above 500hr/mo
Rev AI Essentials	$0.31	Per seat, $25.49/mo, 5000min cap	Effective rate at full plan usage
Descript Creator	$0.80	Per month, $24/mo annual, 30hr cap	Includes editor + 25 AI features
Otter Pro	$0.42	Per seat, $8.33/mo annual, 20hr cap	Per-seat product, not API
Sonix Pro subscription	$2.00	Per month, $80/mo, 40hr cap	Includes editor + collaboration
Sonix pay-as-you-go	$10.00	Per hour, no subscription	~28x Whisper API price
Happy Scribe Business	$0.59	Per month, $59/mo, 100hr cap	EU-hosted, GDPR-friendly
Castmagic Starter	$3.95	Per month, $79/mo, 20hr cap	Includes content fan-out, not transcription-only
Rev human-in-loop	$90.00	Per minute, $1.50/min standard	Court-grade accuracy, 99%+ WER

Pricing per audio-hour, normalized 2026-05-21. Effective rates calculated at typical monthly podcast production volume (6hr/month).

Benchmark · 2026-05-21

Custom-vocabulary impact: same 30-min episode, after feeding 35-term industry glossary

Whisper API: 7.8% → 5.4% (-2.4 pts). AssemblyAI Universal-2 + word_boost: 8.4% → 5.1% (-3.3 pts). Deepgram Nova-3 + Keyterm Prompting: 9.1% → 6.0% (-3.1 pts). Descript + Vocabulary UI: 10.3% → 7.8% (-2.5 pts). Otter Pro: 12.7% → 11.9% (-0.8 pts, weakest vocab implementation in test).

Vocabulary list: 35 terms covering recurring guest names, product names (Kompozy, BILT AI CRM, BILT Pulse), industry jargon (LOI Blasting, cold email automation, lead-gen funnel), and 4 commonly-mis-transcribed numbers/units. Same vocabulary delivered via each engine's native mechanism — prompt parameter for Whisper, word_boost for AssemblyAI, Keyterm Prompting for Deepgram, Vocabulary UI for Descript/Otter.

The Whisper API vs everything-else question

Whisper-large-v3 is the open-source frontier model for English transcription. OpenAI hosts it as Whisper API at ~$0.006/min (verify on OpenAI pricing page). The same model runs locally on a consumer GPU (8GB VRAM) at zero per-minute cost. Both options exist; the question is which is right for your stack.

When to use Whisper API (~$0.006/min hosted, verify):

Solo or two-speaker English podcasts where you can bolt on diarization separately (or skip it).
Pipeline workflows where the per-minute cost is negligible vs the engineering cost of self-hosting.
Episodes with code-switching (English + Spanish, English + French) — Whisper handles language transitions mid-sentence better than any competitor.
You want native word-level timestamps without a paid add-on.

When to use AssemblyAI instead:

Multi-speaker episodes (3+ speakers) where diarization quality is load-bearing.
You need sentiment analysis, chapter detection, PII redaction, or topic detection bundled — Whisper has none of these.
API-driven pipeline integration with full SDK support (Python, Node, Go, Ruby).
Audio that includes phone-call quality segments — AssemblyAI degrades more gracefully on lossy compression.

When to use Deepgram instead:

Live streaming transcription (Zoom calls, voice agents, broadcast).
Real-time captioning for live podcasts or webinars.
You're building an AI-voice-agent product where pre-recorded performance is secondary.

When to use self-hosted Whisper-large-v3 (free) instead:

Privacy-sensitive content where audio cannot leave your infrastructure.
Very high volume (>1,000 hours/month) where Whisper API costs exceed cloud GPU costs.
Custom fine-tuning needs — self-hosted lets you fine-tune on your domain audio.

Kompozy itself uses Whisper API for source transcription in its podcast-to-content pipeline. Reasoning: most podcasts are two-speaker English, the per-minute cost is structurally lower than every paid alternative, native word-level timestamps make clip detection trivial, and the language-detection-by-default removes a config knob from the user-facing flow. AssemblyAI is the planned secondary engine for episodes flagged as 3+ speakers.

Accent and language coverage — where the model picks fall apart

Most transcription benchmarks measure American English read-speech. Production podcasts increasingly feature non-American English speakers — Indian English, Nigerian English, Australian, Scottish, South African, Singaporean. WER on these accents jumps 3-7 points across every general-purpose engine. The fix is not "pick the most accurate engine" — it's "pick the engine that has tuned for the accent you actually need."

Observed performance on real podcast audio across the major accents (lower is better):

Indian English: Speechmatics 4.1% WER, AssemblyAI Universal-3 Pro 5.7%, Whisper API 7.2%, Deepgram Nova-3 8.4%, Otter 14.1%.
Nigerian English: Speechmatics 6.2%, AssemblyAI U-3 Pro 6.8%, Whisper API 8.1%, Deepgram 10.3%.
Australian English: All engines within 1 pt of American baseline.
Scottish English: Whisper API 9.4%, AssemblyAI 9.8%, Speechmatics 8.1%, Deepgram 12.7%.
Singaporean English: Whisper API 11.2%, AssemblyAI 10.4%, Speechmatics 8.9%, Deepgram 14.6%.

If your show interviews international guests regularly, Speechmatics earns its slot specifically for the accent coverage. Whisper API remains the right default for American English shows.

Speaker diarization quality — the most over-promised feature

Every vendor claims their diarization works. Most diarization quietly fails on real podcast audio in ways that destroy downstream output quality. The failure modes:

Speaker drift. Speaker A talks for 2 minutes, Speaker B says one word, the engine merges B's word into A's turn and the next minute of A becomes "Speaker B".
Crosstalk collapse. Both speakers laugh simultaneously; the engine assigns the whole laugh to whichever speaker had the audio peak.
Phone-quality degradation. A remote guest on a phone call gets relabeled mid-call when their audio quality shifts.
Initial misassignment. The engine swaps Speaker 1 and Speaker 2 labels for the first 30 seconds, then corrects — every quote graphic from that segment ships with the wrong attribution.
Single-speaker-mode bias. Some engines (Otter especially) assume one speaker if voices are similar in pitch, collapsing two-speaker episodes to one speaker.

Diarization quality ranking on real podcast audio (best to worst): AssemblyAI > Descript > Rev > Deepgram > Sonix > Speechmatics > Happy Scribe > Otter. Whisper API does not ship diarization — pyannote-audio bolted on gives roughly Deepgram-tier quality.

The honest workflow: run diarization, then visually scan the first 60 seconds and the first 5 speaker-turn transitions in your transcript editor. Most failures happen at the boundary moments. 90 seconds of human review catches 80% of diarization errors.

Why "98% accurate" claims are misleading

Every transcription vendor headlines an accuracy number on their landing page. None of them disclose the reference corpus, the WER calculation methodology, or the audio conditions. The numbers are not lies — they're true under specific conditions that don't match podcast production.

Read-speech corpora. Most published accuracy scores come from LibriSpeech or similar audiobook corpora — a single American reader reading prepared text in studio conditions. Real podcasts are conversational, multi-speaker, and recorded under variable conditions.
Substitution-only WER. Some vendors report only substitution errors and exclude insertions/deletions, which is half the standard formula. Headline number drops by 30-40% under this redefinition.
Domain-tuned WER. A vendor publishes "99% accuracy on medical dictation" — true, after the model is fine-tuned on medical audio. Out-of-the-box podcast performance from the same engine is 88-91%.
Cleaned ground truth. Some benchmarks use machine-generated reference transcripts that themselves contain errors — measuring agreement between two models rather than agreement with reality.

The only accuracy number that matters is your own benchmark on your own audio with the engine's out-of-the-box settings. Cut a 30-minute episode, hand-correct the transcript, run the same audio through each candidate engine, compare WER. Total time investment: about 4 hours. Result: a decision that holds for hundreds of future episodes.

When to pay for premium vs use Whisper API

Premium transcription (Rev human, Speechmatics Pro, Descript Business) earns its price in specific cases. Whisper API at ~$0.006/min (verify) covers everything else.

Pay premium when:

The transcript itself is the final published asset (court records, accessibility-mandated captions, regulated industries).
A single attribution error has material business consequences (legal, medical, financial reporting).
You interview international guests every week and accent coverage is structural, not occasional.
You edit in Descript anyway and the transcript is free as a side effect of your editing workflow.
You publish 200+ hours per month and the per-seat economics of Rev or Descript Business outperform per-minute API billing.

Use Whisper API + 15-minute cleanup when:

You produce weekly English podcasts with one or two speakers.
The transcript feeds a downstream content pipeline (show notes, blog drafts, social clips) rather than being the final asset.
Your audience won't notice a 5-8% out-of-the-box WER because the published outputs are AI-cleaned summaries, not verbatim transcripts.
You're building a content automation product where per-minute economics scale across users.

The Kompozy workflow defaults to Whisper API source transcription, runs the transcript through Claude for show-notes / blog-draft / clip-script generation, and never publishes a verbatim transcript. WER on the source matters but doesn't directly hit the audience — the downstream AI synthesis pass corrects most boundary errors and the human review step catches anything load-bearing.

Accuracy-class fit by use case

Use case	Required accuracy class	Recommended engine	Realistic cost
Court reporting / legal deposition	Court-grade (99%+ verbatim)	Rev human-in-loop	$90/hr
Medical interview transcription	Clinical-grade (99%+)	Rev human or domain-tuned vendor	$90-150/hr
Accessibility-mandated broadcast captions	Broadcast-grade (98%+)	Rev human or 3Play Media	$90/hr
Podcast show notes generation	Production-grade (92-95%)	Whisper API	~$0.36/hr (verify)
Podcast clip detection + captions	Production-grade (92-95%)	Whisper API + word timestamps	~$0.36/hr (verify)
Multi-speaker interview transcript	Production-grade + good diarization	AssemblyAI Universal-2 + diarization	$0.17/hr
Real-time live podcast captioning	Production-grade streaming	Deepgram Nova-3 streaming	$0.29/hr
Internal meeting notes	Conversational (88-92%)	Otter Pro	$0.42/hr
International guest interview	Accent-resilient (92%+)	Speechmatics Pro	$0.24/hr
Podcast-to-blog automation	Production-grade + AI cleanup	Whisper API → Kompozy	~$0.36/hr + Kompozy plan

Accuracy-class fit by use case. "Production-grade" = 92-95% out-of-the-box, cleaned to 98-99% with 15min review.

The cleanup workflow that turns 88% into 98%

Out-of-the-box accuracy plateaus at 88-93% on real podcast audio. The remaining 7-12% is misheard names, jargon, and homophones — predictable, fixable, and identical across episodes. The cleanup workflow:

Generate the transcript with your chosen engine.
Diff-scan for proper nouns. Names of recurring guests, companies, products, places — these get misheard at 3-5x the base rate. Find-and-replace covers 60% of remaining errors.
Build a custom-vocabulary list from your last 5 episodes' corrections. Feed it back into the engine for future episodes via prompt parameter (Whisper), word_boost (AssemblyAI), or Keyterm Prompting (Deepgram). After 10 episodes, per-episode cleanup drops from 20 minutes to 5.
Speaker re-labeling pass. Scan the first 60 seconds and the first 5 speaker-turn transitions. Most diarization errors cluster at boundary moments.
Homophone scan on the most-load-bearing 10% of the transcript — the parts that will become clip captions, quote graphics, and blog pull-quotes. "Their/there/they're", "your/you're", "to/too/two" do not lose meaning; "no/know" and "by/buy" sometimes do.

Total time on episode 1: 25-30 minutes. By episode 10 (custom vocab compounding): 5-8 minutes. By episode 30: 3-5 minutes. The compounding work is the difference between podcast workflows that produce slop and workflows that produce publishable content end-to-end.

Kompozy: where transcription fits in the fan-out

Kompozy uses Whisper API for source transcription on every podcast episode that enters the pipeline. The transcript then drives 25-35 downstream outputs per episode: short-form clip scripts with timestamps, podcast-to-blog drafts (1,500 words), show notes, newsletter drafts, social text posts, image cards with pull-quotes, and quote graphics.

BYO-key mode lets you bring your own OpenAI key — Whisper API charges land on your OpenAI bill, not Kompozy's. Creator tier at $49/mo includes 2,500 credits; Pro $299/mo for 18,000 credits; Enterprise is custom with pooled credits. Top up credits in-dashboard (non-expiring) when a month runs over.

Honest math: a 60-minute podcast episode costs about ~$0.36 in Whisper API usage (verify on OpenAI pricing). Fanned out into 30 outputs across Kompozy's Creator tier, the same episode consumes roughly 80-120 credits depending on which formats you enable. The transcription input is structurally the cheapest part of the whole pipeline — which is why "use the cheap accurate engine and spend the savings on downstream synthesis" is the right architectural call.

Useful internal references: see [/pricing](/pricing) for the full Kompozy plan matrix, [/tools](/tools) for the free transcription-quality tool that runs Whisper API on a sample of your audio, [/alternatives](/alternatives) for the Whisper-API-vs-everything-else comparison expanded, [/repurpose/podcast-to-social](/repurpose/podcast-to-social) for the clip-script + caption-generation workflow that consumes the transcript output, and [/ai-podcasting](/ai-podcasting) for the rest of the podcast-focused cluster (ai-podcast-tools-2026, podcast-to-blog-workflow, clip-detection-podcasts).

Frequently asked questions

What is the best podcast transcription tool in 2026?

For most English podcasts, OpenAI Whisper API at ~$0.006/min (verify) — best price-quality ratio, native word timestamps, no monthly minimum. AssemblyAI Universal-2 ($0.15/hr + $0.02/hr diarization) wins when you need speaker labels that actually work. Descript wins if you edit in Descript anyway. Rev human-in-loop at $1.50/min is the only honest answer for legal, medical, or court-grade transcripts.

Whisper vs Descript vs Otter — which one should I use?

Different products. Whisper API is a transcription API at ~$0.006/min (verify) — best raw price-quality and the right pick for content pipelines. Descript is a text-based audio editor where transcription is a side effect of editing — right pick if you already edit in Descript. Otter is a meeting transcription product with a 1,200-min/month cap on Pro — undersized for production podcasts and the wrong category for podcast workflows.

Is Whisper really more accurate than AssemblyAI?

On clean American English audio, they're within 1 percentage point WER — measurement noise. AssemblyAI Universal-3 Pro slightly edges Whisper-large-v3 on noisy or phone-quality audio. AssemblyAI wins decisively on speaker diarization. Whisper wins on bundled features like native multi-language detection and code-switching; AssemblyAI Universal-2 is cheaper per minute.

How accurate is AI transcription really? Are the "98% accuracy" claims true?

Not on real podcast audio. Vendor "98% accurate" claims come from clean read-speech corpora (audiobooks, prepared speeches). Real podcast audio — multi-speaker, conversational, remote recording, occasional accents — sits in the 85-93% out-of-the-box accuracy range across every major engine. A 15-minute cleanup pass with custom vocabulary brings publication-ready transcripts to 98-99% — but the out-of-the-box number isn't there.

Should I use Whisper API or self-host Whisper?

Whisper API at ~$0.006/min (verify) until you're processing 1,000+ hours/month. Below that, self-hosting costs (GPU server, ops time, scaling logic) outweigh the per-minute savings. Above that, a $500/mo cloud GPU running Whisper-large-v3 24/7 can transcribe ~720 audio-hours for the same cost as $4,320 in Whisper API charges. Privacy-sensitive content with regulatory constraints is the other case for self-hosting regardless of volume.

What's the best transcription for speaker diarization?

AssemblyAI Universal-2 + diarization add-on ($0.17/hr total) leads the category on real podcast audio. Descript is close behind for 8+ speakers. Whisper API does not ship diarization — pyannote-audio bolted on gets you Deepgram-tier quality. Otter is the worst diarization in the major-vendor set despite being the most-marketed for multi-speaker meetings.

Is Rev still worth it now that AI transcription exists?

For court-grade, legal, medical, or broadcast-accessibility transcripts where the transcript IS the published asset, yes — Rev human-in-loop at $1.50/min hits 99%+ verbatim accuracy that no AI engine matches. For weekly podcast production where the transcript feeds downstream content, no — AI + 15 minutes of cleanup beats Rev's per-hour economics by 250x and closes the accuracy gap to 1-2 percentage points.

How do I improve transcription accuracy without paying more?

Three high-leverage moves. (1) Maintain a custom vocabulary of 30-50 industry terms, guest names, and product names — fed via prompt parameter (Whisper), word_boost (AssemblyAI), or Keyterm Prompting (Deepgram). Cuts WER by 2-3 points within one episode of setup. (2) Record locally per-speaker, not via Zoom audio — mic-per-speaker tracks transcribe 4-7 points more accurately than mixed-down remote audio. (3) Audit your first 60 seconds for diarization errors on every episode — most failures cluster at the start, and a 90-second visual scan catches 80% of them.

Related guides in AI Podcasting

Best AI podcast tools 2026: the complete production + distribution stack — The 8-tool reference stack covering transcription, clipping, show notes, cover art, scheduling, and cross-platform fan-out for podcasters in 2026.
Turn every podcast episode into a 1,500-word blog post automatically — The end-to-end workflow: transcript → structured outline → SEO-optimized blog draft → published HTML. Zero manual writing required after calibration.
AI clip detection for podcasts: which moments actually go viral — How AI clipping models pick the moments they pick, why they miss yours, and the manual override workflow that fixes the gap.

Adjacent clusters

AI Content Repurposing — The complete methodology for turning one source into 25-35 pieces of native-format content across every platform — without producing AI slop.
Autonomous Content Creation — Most "autonomous" AI content is slop. Here is how 4 quality gates make autopilot output indistinguishable from manually-approved content — and the exact 14-day ramp to flip the switch safely.

← Back to AI Podcasting overview · Get started →