PDF to Text — Extract Clean Text from Any PDF
Pull readable plain text out of any text-based PDF — entirely in your browser, with no upload required. Powered by PDF.js, the tool extracts text page by page, optionally adding page separators, cleaning up excess whitespace, rejoining hyphenated words, and preserving or collapsing line breaks. Download the result as a .txt file ready for editing, analysis, or import into any application.
Text Preview
(Preview will appear here after conversion)
Scanned PDFs (images) require OCR and are not supported by this tool. For scanned documents, use an OCR tool first.
How to Convert PDF to Text
📄 What This Tool Does
This tool extracts the text layer from a text-based PDF — one where the text was originally typed or generated digitally, not photographed or scanned. It uses PDF.js, Mozilla's open-source PDF engine, running entirely inside your browser. The result is a clean .txt file containing all the text from the PDF, ready for editing, pasting, or importing into other applications.
PDF is a layout-first format — text is stored as positioned characters on a page, not as flowing prose. Extracting it cleanly requires reconstructing reading order, detecting line breaks, rejoining hyphenated words, and collapsing layout whitespace. This tool handles all of that automatically.
🔧 Understanding the Options
Three checkboxes control how the extracted text is processed and structured:
Page separators — Inserts a visible divider line (────────────────────────) between each page's content in the output file. This makes it easy to identify where one page ends and the next begins when working with multi-page documents. Disable this if you want all text as a continuous flow with no page markers.
Clean text — Applies two post-processing steps: (1) Dehyphenation — words that were split across lines with a hyphen (e.g., "impor-\ntant") are rejoined into the correct word ("important"). (2) Whitespace normalization — excess spaces, tabs, and runs of more than two blank lines are collapsed, preventing the output from having large empty gaps caused by hidden PDF layout elements.
Preserve line breaks — When enabled, the original line structure from the PDF is kept in the output. When disabled, text on each page is joined into a single continuous paragraph. This is useful when you need to paste text into a system that handles its own line wrapping, or when you want the cleanest possible prose output without any line break artifacts from the PDF layout.
⚙️ Step-by-Step: How to Use
- Upload your PDF. Click Choose PDF or drag and drop a PDF file into the upload area. The file is read directly from your device — nothing is sent to any server at this step or any subsequent step.
- Set your options. Choose whether to include page separators, apply clean text processing, and preserve line breaks. For most use cases, all three options enabled gives the best result.
- Click Convert to Text. PDF.js parses the file page by page, extracting text items from each page's content stream while tracking their Y-coordinate positions to reconstruct reading order. A progress bar shows real-time page-by-page progress.
- Check the preview. The first three pages of extracted text appear in the preview box on the right. Verify the text looks correct — readable words and sentences, not garbled symbols. If you see symbols, the PDF is likely scanned and cannot be processed by this tool.
- Download the .txt file. Click Download .txt to save the full output. The filename matches the original PDF. The file is encoded in UTF-8, compatible with all modern text editors.
🎯 Who Uses This Tool?
Researchers and academics — Extract text from journal articles, dissertations, and technical reports for corpus analysis, NLP pipelines, or citation work without exposing confidential pre-publication documents to external servers.
Students — Pull text from lecture slides, textbook PDFs, or study materials to paste into notes, translate, or feed into summarization tools.
Professionals and legal staff — Extract text from contracts, legal documents, or reports for quick search, editing, or comparison — without uploading sensitive files to a cloud service.
Developers and data scientists — Convert batches of PDF documents to plain text as a preprocessing step for text analysis, machine learning datasets, or information extraction pipelines.
Writers and editors — Pull text from a PDF proof, draft, or submitted manuscript into a plain text editor for revision without needing PDF editing software.
Frequently Asked Questions
Is my PDF file uploaded to a server?
No — your PDF never leaves your device. The tool uses PDF.js, Mozilla's open-source PDF rendering engine, running entirely inside your browser tab via modern JavaScript APIs. Your file is read from your device's local memory, processed there, and the text output is written back to your device. No data is transmitted over the internet at any point during the conversion.
Does this tool work on scanned PDFs?
No. Scanned PDFs store pages as raster images — photographs of paper. They do not contain a digital text layer, so there is nothing for this tool to extract. If you upload a scanned PDF, the preview will show nothing or garbled symbols. To extract text from a scanned PDF, you first need an OCR (Optical Character Recognition) tool, which analyzes the image and recognizes characters. This tool only works with text-based PDFs — those originally created digitally from Word, InDesign, LaTeX, or similar software.
Why does the extracted text look garbled or disordered?
Several PDF characteristics can cause imperfect extraction: Complex multi-column layouts — PDF.js reads text items in the order they are stored internally, which may not match visual reading order for multi-column academic papers. Unusual fonts or encoding — Some PDFs use embedded fonts with custom character mappings that PDF.js cannot always decode. Tables — Table cells are stored as individual text fragments and may appear merged or in an unexpected order. Right-to-left text — Arabic, Hebrew, and other RTL scripts may require additional processing for correct reading order. Enable the Clean text option and try toggling Preserve line breaks on or off to improve results for complex layouts.
What does the "Clean text" option do?
Clean text applies two automatic post-processing steps. Dehyphenation detects words that were split with a hyphen at the end of a line — a common occurrence in justified text — and rejoins them into the correct word. For example, "impor-\ntant" becomes "important". Whitespace normalization collapses multiple consecutive spaces and tabs into a single space, and reduces runs of more than two blank lines to two, eliminating the large empty gaps that often appear when PDF layout elements are stripped.
What is the difference between "Preserve line breaks" on and off?
With Preserve line breaks enabled, the tool maintains the line structure from the PDF — each visual line in the PDF becomes a separate line in the output. This is useful when structure matters, such as for poetry, code, or formatted documents. With it disabled, all text from each page is joined into a single continuous paragraph with spaces between former line breaks. This is better for prose documents where you want clean flowing text without mid-sentence line breaks caused by PDF column widths.
Is there a file size limit?
There is no server-imposed limit since all processing runs locally on your device. Very large PDFs (hundreds of pages or files over 100 MB) may be slow on devices with limited memory, and the browser may struggle if the extracted text is extremely long. For best performance, use a desktop computer with a modern browser. PDFs up to 50 MB and 200 pages typically process quickly on most devices.
Can I extract text from a password-protected PDF?
PDF.js can open some password-protected PDFs if the password is user-level (view-only protection). If the PDF requires a password to open at all, the conversion will fail. Owner-level restrictions (preventing printing or text copying) typically do not block PDF.js from accessing the text content, since the tool works at the rendering level. If your PDF is encrypted, try using Alfreto's Unlock PDF tool first to remove the protection, then run the text extraction.
What file format is the output?
The output is a plain text file (.txt) encoded in UTF-8. This format is universally compatible — it opens in every text editor (Notepad, TextEdit, VS Code, Sublime Text), word processor (Word, LibreOffice), and can be imported into virtually any data analysis, NLP, or corpus tool. The filename is based on your original PDF filename (e.g., report.pdf becomes report.txt).
Which browsers are supported?
The tool runs in all modern browsers: Google Chrome, Microsoft Edge, Mozilla Firefox, and Safari. PDF.js is developed and maintained by Mozilla and has excellent cross-browser compatibility. For the best performance with large PDFs, Chrome or Edge on a desktop computer is recommended.
How is this different from copying text from a PDF viewer?
Manually copying text from a PDF viewer (such as Adobe Reader or browser preview) has several limitations: you can only copy one page at a time in most viewers, text from multi-column layouts often copies in the wrong order, and reading order across columns is frequently incorrect. This tool processes all pages automatically, applies reading-order reconstruction from the PDF's internal coordinate data, and handles dehyphenation and whitespace cleanup — producing a cleaner result than manual copy-paste, especially for long or complex documents.
Why Use Alfreto to Extract Text from PDF?
Most PDF-to-text tools online require uploading your document to a cloud server — which introduces privacy risks, file size restrictions, and dependency on the service being available. Alfreto uses PDF.js to process everything locally, meaning your document never leaves your device under any circumstances.