

something like: 1:1 In the beginning God created the heaven andġ:2 And the earth was without form, and void andĭarkness was upon the face of the deep.


The problem with pdftotext from poppler-utils 22.12.0 is that it adds newlines within paragraphs when the paragraph is longer than the PDF page width, e.g. , and I would like to illustrate it with a minimal example. (It uses multiple lines per paragraph, yet they are not the same line breaks as in the other versions!)Įbook-convert vs pdftotext concrete minimal exampleĮbook-coinvert was previously mentioned by frabjous Pdftohtml > pdfreflow > htmltotext: It removed page numbers, but still junk in header/footer. Pdftotext (with -layout): Similar, but more indents. Worst for start of chapter big letters: "T\n\nhe". Pdftotext (without -layout): Not bad, bullets line up, but header/footer noise. Correctly got "The" at the start of the chapter. The ones it missed are double-spaced though! Bullets don't always line up with the text. Converts most paragraphs to be single lines. "The", not "T he" or even "T he".Įbook-convert: Left in page numbers, and some hidden junk in header/footer (but no FFs). Correctly got the big capitals at start of sections, e.g. Junk that was hidden in the PDF did not get output. My second choice is ebook-convert.Īdobe: left in FF for page breaks, left in page numbers, hasn't converted headings/paragraphs to single lines, but it has fixed hyphens. I've been comparing the output side-by-side. (I am pre-processing for text analysis experiments, not as a reader, but I think my first and second choice would be the same.) As a fan of open source (and automation) I hate to say this, but the best results I just got (on quite a large, complex PDF) were to open it in Adobe Reader, then choose File|Save As Text.
