Is my audio uploaded to a server?

In Whisper AI mode, no — all processing happens locally in your browser with zero data transmitted. In Anthropic API mode, audio is sent to Anthropic's servers for transcription as per their privacy policy.

For researchers • Verbatim output with timestamps

Audio to Text — Free Online Transcription with AI

Q: What is the difference between Whisper AI and Anthropic API modes?

Whisper AI runs entirely in your browser — your audio never leaves your device. It requires a one-time model download (39–244 MB) and works offline after that. Anthropic API sends audio to Claude's cloud for faster, higher-accuracy transcription in 99+ languages, and requires a free Anthropic API key.

Q: Which mode should I choose?

Choose Whisper AI if privacy is critical (sensitive interviews, confidential recordings) or if you want to work offline. Choose Anthropic API if you need faster transcription, support for more languages, or higher accuracy on difficult audio.

Convert audio recordings to verbatim text — free, instantly, and with no upload to any server. Choose Whisper AI for 100% local processing with no API key needed, or Anthropic API for fastest accuracy across 99+ languages. Both engines produce timestamped, speaker-labeled verbatim transcripts downloadable as TXT or DOCX.

Verbatim + Timestamps Speaker Labels TXT & DOCX Export AI-Powered

Step 1

Choose Your Transcription Engine

Both engines produce identical verbatim output — the difference is where your audio is processed.

✦ Recommended

Whisper AI

Runs entirely in your browser · OpenAI model

100% Free No API Key 100% Offline Privacy: Max

✔ Advantages

Completely free — no API key, no account needed

Audio never leaves your device (100% local)

Works offline after model is downloaded once

No usage limit — transcribe as many files as you want

Model cached in browser for instant future use

⚠ Limitations

First-time model download required (39–244 MB)

Slower processing — runs on your CPU/GPU

Less accurate on noisy audio or heavy accents

Requires a modern browser (Chrome 88+ recommended)

Use Whisper AI →

⚡ Fastest & Most Accurate

Anthropic API

Cloud-powered · Claude AI model

API Key Required Cloud-Based Highest Accuracy No Download

✔ Advantages

Fastest transcription — no waiting for model download

Best accuracy for accents, noisy audio, multiple speakers

No setup — start transcribing immediately with your key

Handles very long files with high consistency

Your API key is never stored on Alfreto's servers

⚠ Limitations

Requires an Anthropic API key (free tier available)

Audio is sent to Anthropic's servers for processing

Usage costs apply beyond free tier credits

Requires internet connection at all times

Use Anthropic API →

💡

Not sure which to pick? — Start with Whisper AI if your data is sensitive (e.g. confidential interviews) or if you have no API key. Switch to Anthropic API when you need faster results or are dealing with difficult audio (heavy accents, background noise, overlapping speakers).

Side-by-Side Comparison

All features compared

Feature	Whisper AI	Anthropic API
Cost	100% Free forever	Free tier + paid per use
API Key Required	No	Yes (Anthropic account)
Audio Privacy	Stays on your device	Sent to Anthropic servers
Works Offline	Yes (after first download)	No — requires internet
Setup Time	Model download once (39–244 MB)	Instant (no download)
Transcription Speed	Slower (runs on your CPU)	Fast (cloud processing)
Accuracy — Clear audio	Excellent	Excellent
Accuracy — Noisy / Accented	Good (Small model)	Best
File Size Limit	Unlimited (auto-chunked)	Up to 25 MB per chunk
Languages Supported	8 languages	99+ languages
Output Formats	Verbatim, Clean, SRT	Verbatim, Clean, SRT
Speaker Labels	Yes (manual naming)	Yes (manual naming)
Download TXT / DOCX	Yes	Yes
Best For	Sensitive data, no-cost use	Speed, difficult audio, research volume

Continue with Whisper AI → Continue with Anthropic API →

How to Convert Audio to Text Online

Two engines, one purpose — free and accurate transcription

🎙️ What This Tool Does

Alfreto's Audio to Text tool converts spoken audio recordings into written text — complete with timestamps, speaker labels, and verbatim accuracy. It is designed for anyone who needs a transcript: researchers conducting qualitative interviews, journalists recording source conversations, students transcribing lectures, podcasters creating show notes, or professionals generating meeting minutes.

The tool offers two distinct transcription engines. Whisper AI processes your audio entirely inside your browser using OpenAI's open-source speech recognition model — nothing is uploaded to any server. Anthropic API uses Claude's multimodal AI via the cloud for faster processing and support for over 99 languages, requiring a free Anthropic API key.

🔒

Privacy for sensitive recordings
When you use Whisper AI mode, your audio file is processed locally in your browser and never transmitted to any server. This makes it suitable for ethically sensitive interviews, confidential recordings, and research data that must remain under your control at all times.

⚙️ How Transcription Works

Both engines follow the same general pipeline, but the processing location differs significantly:

Your audio file is read from your device. At this point, no data has been transmitted anywhere — the file is only in your browser's memory.
Whisper AI path: OpenAI's Whisper model runs via WebAssembly directly in your browser tab. Audio is chunked into overlapping segments and decoded using automatic speech recognition (ASR) on your own device's CPU.
Anthropic API path: Audio segments are sent securely to Claude's API. The model applies multimodal understanding to transcribe speech with high accuracy, then streams the result back to your browser.
Both engines produce timestamped segments with speaker labels. You can name speakers manually (e.g., "Interviewer", "Respondent") before or after transcription.
The final transcript is displayed in an editable output box and can be exported as .txt or .docx — formatted with timestamps, speaker turns, and paragraph breaks.

Frequently Asked Questions

What is the difference between Whisper AI and Anthropic API modes?

Whisper AI runs entirely in your browser — your audio is never uploaded anywhere. It requires a one-time model download (39–244 MB depending on size) and works offline after that. It supports 8 languages and is the best choice when privacy is critical. Anthropic API sends audio to Claude's cloud servers for faster, more accurate transcription in 99+ languages. It requires a free API key from Anthropic and an internet connection during transcription.

Which mode should I choose for my use case?

Choose Whisper AI if: you are transcribing sensitive or confidential recordings; you want to work offline; you do not want to create any external accounts; or your audio is in one of the 8 supported languages and is reasonably clear. Choose Anthropic API if: you need faster transcription of long files; your audio has heavy accents, background noise, or multiple overlapping speakers; you need support for more than 8 languages; or you are processing high volumes of audio regularly.

What audio formats are supported?

Both engines support MP3, WAV, M4A, OGG, FLAC, OPUS, WEBM, AAC, and MP4. For best accuracy, use clear recordings at 16 kHz or higher sample rate with minimal background noise. Very long recordings are automatically split into chunks and reassembled into a single continuous transcript.

What output formats are available?

Three output formats are available: Verbatim + Timestamps ([MM:SS] Speaker: text) for research coding and analysis; Clean text only with no timestamps for easy reading and editing; and SRT subtitles for use with video players and editing software. All formats can be downloaded as TXT or DOCX files.

Is there a file size or duration limit?

There is no hard limit. Long recordings are automatically split into overlapping segments and merged into one continuous transcript. The Whisper AI engine can handle files of any size as long as your device has sufficient memory. The Anthropic API engine supports audio up to 25 MB per chunk and can process recordings of several hours in total.

Can I add speaker names to the transcript?

Yes. Before transcribing, set the number of speakers and enter their names (e.g., "Interviewer", "Dr. Smith", "Respondent 1"). The tool uses these labels throughout the transcript. You can also edit the transcript directly in the output box after transcription is complete.