66.7 PDF Extract Text


The text of a pdf document can be extract using pdf2txt. This works well for text-based PDF documents. However, some PDF documents are primarily a container for images, even images of pages of text. In this case we may need to use optical character recognition (OCR) and the command ocrmypdf. See Section 66.12 for details.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0