Dear @linux and @academicchatter folks:
Please suggest libre/open source tools that allow for the extraction of text and images from scientific pdf documents?
P.S: I’m on a linux machine. Would like something terminal friendly, if possible!
You must log in or register to comment.
The first tool I can think of is LibreOffice Draw
Maybe there are other tools, but I think LibreOffice Draw do the job pretty well
Edit: If the PDF has written text, you may wanna use an OCR tool, but I don’t have any to suggest
gImageReader is a graphical front-end to the open-source OCR program Tesseract, so that might be just what you’re looking for. The default settings don’t add the OCR’d text to the PDF but you can do that.
@ajayiyer@mastodon.social OCRmyPDF is exactly what you are looking for