OCR

Also known as: Optical Character Recognition, text recognition

OCR (Optical Character Recognition) is the process of turning images of text — a scanned page, a photo of a receipt, a screenshot — into machine-readable characters, so a picture of words becomes text you can select, search, and copy.

Overview

OCR analyses the pixel shapes in an image, matches them against known character forms, and outputs the underlying text. It is what lets you digitise a stack of paper, make an old scanned book searchable, or pull the total off a photographed receipt — converting image-only documents into something a computer can actually read.

The reason OCR comes up constantly with PDFs is the text-layer distinction. A PDF exported from software already has its text embedded, so you read that layer directly and no OCR is needed. A scanned PDF is just images of pages, so there is nothing to extract until OCR has recognised the characters first.

Worth being precise about scope: extracting text from a PDF that already has a text layer is not OCR — it is just reading what is already there. True OCR (recognising text inside a flat image) is a heavier, separate step, and its accuracy depends on scan quality, fonts, and language.

Common questions about OCR

Do I need OCR to extract text from a PDF?

Only if the PDF is a scan or otherwise image-only. PDFs with a real text layer — most exports from word processors, browsers, and report generators — can be read directly without any OCR step.

How can I tell if a PDF needs OCR?

Open it in any viewer and try to select a line of text with your cursor. If text highlights, there is a text layer and no OCR is needed. If nothing selects and the page behaves like an image, it is image-only and needs OCR.

Is OCR perfectly accurate?

No. Accuracy depends on the resolution and contrast of the scan, the fonts, and the language. Clean digital scans recognise very well; faxes, handwriting, and low-resolution photos produce errors, so OCR output should always be proofread.

Tools that work with OCR

Extract Text from PDF

Pull the selectable text out of a PDF — in your browser.

PDF to Images

Render each PDF page to a PNG or JPEG — in your browser.

PDF
PDF (Portable Document Format) is a page-description file format that locks the exact layout, fonts, and graphics of a document so it renders identically on any device — which is why contracts, invoices, bank statements, and forms are almost always shared as PDFs.

External references

https://en.wikipedia.org/wiki/Optical_character_recognition