Today's article isn't going to apply to many of you...

KGIII

Super Moderator
Staff member
Gold Supporter
Joined
Jul 23, 2020
Messages
11,792
Reaction score
10,357
Credits
97,550
Today we simply convert PDF to text. This is an easy process in the terminal but not something most folks are going to do. After all, one of the things about PDF is that whole F thing. They quite like the formatting.


Still, there are a few reasons why one might want to do this. There are tools that process .txt files that aren't able to process .pdf files properly. After all, open .pdf file in your regular old plain text editor to see the markup and all that...

So, it's an option.
 


this would also be a cool thing for people using (neo)mutt to view .pdf files inside their favorite mail client directly.

Oh, good point. I hadn't even thought of that use. It's pretty easy to convert and then use cat, or even Vi/Nano if you want to edit it.

I haven't played with Qubes in a long time.
 
Does this work for flattened or password protected PDF's?

I ask this because sometimes PDF's can't be altered depending on what you use to convert it and the are often skewed. I've had PDF's that are password protected. They can't be altered in any way without a password.

Forgive me if you already know this, it explains part of my question
A flattened PDF is more like an image. If you've ever used PhotoShop or GIMP a lot images are manipulated in things called layers. Something like laying a picture over another picture. Flattening the layers makes a single image from the two. PDF's can also be flattened.
 
Last edited:
After I get settled, I'll try it

I'm not actually sure what a 'flattened' PDF is, so I can't speculate. You can remove passwords easily enough. I actually recently shared an article about how you can brute force PDF passwords.
 
As far as I understand it, in a typical .pdf file you can have multiple layers, containing text, images, editable form fields, text annotations etc.

With flattened pdf files, everything gets merged down onto a single layer. So each page is basically an image.

So I think to extract text from a flattened pdf file, you’d probably have to extract the single-layer images of each page and then use OCR (Optical Character Recognition) software to retrieve text from each page.

Flattening pdf files is useful when somebody sends you a pdf form to fill in.
After you’ve filled the form in, if you flatten it before sending it back, it means that nobody can easily edit the file to change any of your personal details that were on the form. Because in the flattened .pdf, the fields in the filled-in form are no longer editable. They’ve been flattened down to an image……. I think!
I could be wrong, but it’s something like that.
 
So I think to extract text from a flattened pdf file, you’d probably have to extract the single-layer images of each page and then use OCR (Optical Character Recognition) software to retrieve text from each page.

Then it might not work with those, thanks. I saw no reference to OCR in the man pages, so I'll assume that's not an option. (It was just this moment that I learned what 'flattened' meant in regards to PDF, so thanks again.)

I could be wrong, but it’s something like that.

Given your history, I'll err on the side of you being correct.
 
Flattening pdf files is useful when somebody sends you a pdf form to fill in.
Yes. If you've ever received an a form in an ODT file, you'll see the text after the characters you type move to the right or move down the page when reach the end or hit 'enter'.

If information is entered into a table, the characters you type and the text in the form won't move until you reach character limit in that field. Depending on the formatting, the boundaries the cells will do one of three things (at least those that I know of in my experience editing and creating PDF's.)
Either the characters you type run past the boundaries and you don't see it, nothing in the table is shifted.... But I need to tell you that stuff.

None of those things happen in a PDF that is flattened. Filling out a PDF is literally like inserting a page into a type writer. If a layer is locked, you have to click or tab over to that area before you can type in it. If you keep hitting space, and I'm not totally positive which it is, either the cursor disappears behind the text box ( a layer) or 'returns'.

If you've ever converted a PDF into word format, You'll see all the tables used to create the document. Editing a PDF that's been converted into an ODT file can be confusing. Take any PDF to the Adobe website. They convert it for free. Open it in Libre and you'll probably see what I'm talking about.

I start PDF in Libre Office then export the ODT file as a PDF then flatten it with something else. It might be possible to flatten it in Libre Office Draw, but Draw skews things in PDF's so I haven't bothered.

After you’ve filled the form in, if you flatten it before sending it back, it means that nobody can easily edit the file to change any of your personal details that were on the form. Because in the flattened .pdf, the fields in the filled-in form are no longer editable. They’ve been flattened down to an image……. I think!
I could be wrong, but it’s something like that.
 


Top