Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note: AS far as I know, Calibre does not do OCR, so a PDF with only scanned content will not work.


I've had good luck using Tesseract [0] for scanned PDFs. If you're not CLI-inclined, there are several GUIs for it available [1]. I have had good luck downloading scanned PDFs from archive.org and running them through Tesseract.

Did not know about Calibre for this - I was relying on opening each search and searching it individually.

[0]: https://github.com/tesseract-ocr/tesseract [1]: https://www.opait.com/tessstudio/


OCRmyPDF is a tool using Tesseract, specifically designed for PDFs. I would recommend that over pure Tesseract.

https://github.com/ocrmypdf/OCRmyPDF


I recommend running any such PDFs through OCRmyPDF.

https://github.com/ocrmypdf/OCRmyPDF




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: