This is a tool to “transcribe” a PDF file into paragraphs (etc.) that remain associated with regions of the original pages. (This way, one can verify or edit the transcription at any time.) It is designed for (scanned) PDFs that are mostly lines of text (paragraphs, headings, verses, footnotes: not illustrations, math, tables, forms).
Load a PDF, use OCR to detect lines, group lines into "chunks", save the result.
Warning: This page may change at any time, and older files may not load. You are strongly recommended to save this page to a HTML file and use it locally. Some stable versions are here.
Drop a PDF file here or click to select a file.
The word chāyā is Sanskrit for "shade", "shadow", and a kind of gloss giving the Sanskrit equivalent of Prakrit text. The files generated by this tool are intended to serve as a "companion" or "sidecar" to the scanned images in the PDF file.
Source on GitHub. See also: Ambuda: Proofing, Simon Willison's tools: OCR (starting point for this project), Scribe OCR (future inspiration?).