OCR
Also known as: Optical Character Recognition, text recognition
OCR (Optical Character Recognition) is the process of turning images of text — a scanned page, a photo of a receipt, a screenshot — into machine-readable characters, so a picture of words becomes text you can select, search, and copy.
Overview
OCR analyses the pixel shapes in an image, matches them against known character forms, and outputs the underlying text. It is what lets you digitise a stack of paper, make an old scanned book searchable, or pull the total off a photographed receipt — converting image-only documents into something a computer can actually read.
The reason OCR comes up constantly with PDFs is the text-layer distinction. A PDF exported from software already has its text embedded, so you read that layer directly and no OCR is needed. A scanned PDF is just images of pages, so there is nothing to extract until OCR has recognised the characters first.
Worth being precise about scope: extracting text from a PDF that already has a text layer is not OCR — it is just reading what is already there. True OCR (recognising text inside a flat image) is a heavier, separate step, and its accuracy depends on scan quality, fonts, and language.