For researchers • Verbatim output with timestamps

Audio to Text — Free Online Transcription with AI

Convert audio recordings to verbatim text — free, instantly, and with no upload to any server. Choose Whisper AI for 100% local processing with no API key needed, or Anthropic API for fastest accuracy across 99+ languages. Both engines produce timestamped, speaker-labeled verbatim transcripts downloadable as TXT or DOCX.

Verbatim + Timestamps Speaker Labels TXT & DOCX Export AI-Powered
Step 1

Choose Your Transcription Engine

Both engines produce identical verbatim output — the difference is where your audio is processed.

✦ Recommended

Whisper AI

Runs entirely in your browser · OpenAI model
100% Free No API Key 100% Offline Privacy: Max
✔ Advantages
Completely free — no API key, no account needed
Audio never leaves your device (100% local)
Works offline after model is downloaded once
No usage limit — transcribe as many files as you want
Model cached in browser for instant future use
⚠ Limitations
First-time model download required (39–244 MB)
Slower processing — runs on your CPU/GPU
Less accurate on noisy audio or heavy accents
Requires a modern browser (Chrome 88+ recommended)
Use Whisper AI →
⚡ Fastest & Most Accurate

Anthropic API

Cloud-powered · Claude AI model
API Key Required Cloud-Based Highest Accuracy No Download
✔ Advantages
Fastest transcription — no waiting for model download
Best accuracy for accents, noisy audio, multiple speakers
No setup — start transcribing immediately with your key
Handles very long files with high consistency
Your API key is never stored on Alfreto's servers
⚠ Limitations
Requires an Anthropic API key (free tier available)
Audio is sent to Anthropic's servers for processing
Usage costs apply beyond free tier credits
Requires internet connection at all times
Use Anthropic API →
💡
Not sure which to pick? — Start with Whisper AI if your data is sensitive (e.g. confidential interviews) or if you have no API key. Switch to Anthropic API when you need faster results or are dealing with difficult audio (heavy accents, background noise, overlapping speakers).

Side-by-Side Comparison

All features compared
Feature Whisper AI Anthropic API
Cost100% Free foreverFree tier + paid per use
API Key RequiredNoYes (Anthropic account)
Audio PrivacyStays on your deviceSent to Anthropic servers
Works OfflineYes (after first download)No — requires internet
Setup TimeModel download once (39–244 MB)Instant (no download)
Transcription SpeedSlower (runs on your CPU)Fast (cloud processing)
Accuracy — Clear audioExcellentExcellent
Accuracy — Noisy / AccentedGood (Small model)Best
File Size LimitUnlimited (auto-chunked)Up to 25 MB per chunk
Languages Supported8 languages99+ languages
Output FormatsVerbatim, Clean, SRTVerbatim, Clean, SRT
Speaker LabelsYes (manual naming)Yes (manual naming)
Download TXT / DOCXYesYes
Best ForSensitive data, no-cost useSpeed, difficult audio, research volume
Continue with Whisper AI → Continue with Anthropic API →

How to Convert Audio to Text Online

Two engines, one purpose — free and accurate transcription

🎙️ What This Tool Does

Alfreto's Audio to Text tool converts spoken audio recordings into written text — complete with timestamps, speaker labels, and verbatim accuracy. It is designed for anyone who needs a transcript: researchers conducting qualitative interviews, journalists recording source conversations, students transcribing lectures, podcasters creating show notes, or professionals generating meeting minutes.

The tool offers two distinct transcription engines. Whisper AI processes your audio entirely inside your browser using OpenAI's open-source speech recognition model — nothing is uploaded to any server. Anthropic API uses Claude's multimodal AI via the cloud for faster processing and support for over 99 languages, requiring a free Anthropic API key.

🔒
Privacy for sensitive recordings
When you use Whisper AI mode, your audio file is processed locally in your browser and never transmitted to any server. This makes it suitable for ethically sensitive interviews, confidential recordings, and research data that must remain under your control at all times.

⚙️ How Transcription Works

Both engines follow the same general pipeline, but the processing location differs significantly:

  • Your audio file is read from your device. At this point, no data has been transmitted anywhere — the file is only in your browser's memory.
  • Whisper AI path: OpenAI's Whisper model runs via WebAssembly directly in your browser tab. Audio is chunked into overlapping segments and decoded using automatic speech recognition (ASR) on your own device's CPU.
  • Anthropic API path: Audio segments are sent securely to Claude's API. The model applies multimodal understanding to transcribe speech with high accuracy, then streams the result back to your browser.
  • Both engines produce timestamped segments with speaker labels. You can name speakers manually (e.g., "Interviewer", "Respondent") before or after transcription.
  • The final transcript is displayed in an editable output box and can be exported as .txt or .docx — formatted with timestamps, speaker turns, and paragraph breaks.

Frequently Asked Questions

What is the difference between Whisper AI and Anthropic API modes?

Whisper AI runs entirely in your browser — your audio is never uploaded anywhere. It requires a one-time model download (39–244 MB depending on size) and works offline after that. It supports 8 languages and is the best choice when privacy is critical. Anthropic API sends audio to Claude's cloud servers for faster, more accurate transcription in 99+ languages. It requires a free API key from Anthropic and an internet connection during transcription.

Which mode should I choose for my use case?

Choose Whisper AI if: you are transcribing sensitive or confidential recordings; you want to work offline; you do not want to create any external accounts; or your audio is in one of the 8 supported languages and is reasonably clear. Choose Anthropic API if: you need faster transcription of long files; your audio has heavy accents, background noise, or multiple overlapping speakers; you need support for more than 8 languages; or you are processing high volumes of audio regularly.

What audio formats are supported?

Both engines support MP3, WAV, M4A, OGG, FLAC, OPUS, WEBM, AAC, and MP4. For best accuracy, use clear recordings at 16 kHz or higher sample rate with minimal background noise. Very long recordings are automatically split into chunks and reassembled into a single continuous transcript.

What output formats are available?

Three output formats are available: Verbatim + Timestamps ([MM:SS] Speaker: text) for research coding and analysis; Clean text only with no timestamps for easy reading and editing; and SRT subtitles for use with video players and editing software. All formats can be downloaded as TXT or DOCX files.

Is there a file size or duration limit?

There is no hard limit. Long recordings are automatically split into overlapping segments and merged into one continuous transcript. The Whisper AI engine can handle files of any size as long as your device has sufficient memory. The Anthropic API engine supports audio up to 25 MB per chunk and can process recordings of several hours in total.

Can I add speaker names to the transcript?

Yes. Before transcribing, set the number of speakers and enter their names (e.g., "Interviewer", "Dr. Smith", "Respondent 1"). The tool uses these labels throughout the transcript. You can also edit the transcript directly in the output box after transcription is complete.

Title

Message