Today's article isn't going to apply to many of you...

KGIII · Mar 25, 2024

Today we simply convert PDF to text. This is an easy process in the terminal but not something most folks are going to do. After all, one of the things about PDF is that whole F thing. They quite like the formatting.

Turn PDF Into Text • Linux Tips

Today's exercise is simple, though it will rely on the terminal because we're just going to turn PDF into text.

linux-tips.us

Still, there are a few reasons why one might want to do this. There are tools that process .txt files that aren't able to process .pdf files properly. After all, open .pdf file in your regular old plain text editor to see the markup and all that...

So, it's an option.

blunix · Mar 25, 2024

smooth!

qubes has an option to turns untrusted pdf's into harmless ones: https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html

It essentially turns the pdf into a RGB bitmap, which you can then view reasonably securely. the whole conversion happens in a "disposable VM" in qubes, meaning it will spawn a VM, convert the file in it, and then throw the VM away again.

blunix · Mar 25, 2024

this would also be a cool thing for people using (neo)mutt to view .pdf files inside their favorite mail client directly.

KGIII · Mar 25, 2024

blunix said:
this would also be a cool thing for people using (neo)mutt to view .pdf files inside their favorite mail client directly.

Oh, good point. I hadn't even thought of that use. It's pretty easy to convert and then use cat, or even Vi/Nano if you want to edit it.

I haven't played with Qubes in a long time.

Sherri is a Cat · Mar 26, 2024

Does this work for flattened or password protected PDF's?

I ask this because sometimes PDF's can't be altered depending on what you use to convert it and the are often skewed. I've had PDF's that are password protected. They can't be altered in any way without a password.

Forgive me if you already know this, it explains part of my question
A flattened PDF is more like an image. If you've ever used PhotoShop or GIMP a lot images are manipulated in things called layers. Something like laying a picture over another picture. Flattening the layers makes a single image from the two. PDF's can also be flattened.

Sherri is a Cat · Mar 26, 2024

KGIII said:
After all, one of the things about PDF is that whole F thing.

I'm not sure exactly what you mean.
Format?
Portable Document Format?

KGIII · Mar 26, 2024

Sherri is a Cat said:
I'm not sure exactly what you mean.
Format?
Portable Document Format?

Yes.

Sherri is a Cat said:
Does this work for flattened or password protected PDF's?

Maybe and yes - after you remove the password.

Sherri is a Cat · Mar 26, 2024

KGIII said:
Maybe...

After I get settled, I'll try it

KGIII · Mar 26, 2024

Sherri is a Cat said:
After I get settled, I'll try it

I'm not actually sure what a 'flattened' PDF is, so I can't speculate. You can remove passwords easily enough. I actually recently shared an article about how you can brute force PDF passwords.

JasKinasis · Mar 27, 2024

As far as I understand it, in a typical .pdf file you can have multiple layers, containing text, images, editable form fields, text annotations etc.

With flattened pdf files, everything gets merged down onto a single layer. So each page is basically an image.

So I think to extract text from a flattened pdf file, you’d probably have to extract the single-layer images of each page and then use OCR (Optical Character Recognition) software to retrieve text from each page.

Flattening pdf files is useful when somebody sends you a pdf form to fill in.
After you’ve filled the form in, if you flatten it before sending it back, it means that nobody can easily edit the file to change any of your personal details that were on the form. Because in the flattened .pdf, the fields in the filled-in form are no longer editable. They’ve been flattened down to an image……. I think!
I could be wrong, but it’s something like that.

KGIII · Mar 27, 2024

JasKinasis said:
So I think to extract text from a flattened pdf file, you’d probably have to extract the single-layer images of each page and then use OCR (Optical Character Recognition) software to retrieve text from each page.

Then it might not work with those, thanks. I saw no reference to OCR in the man pages, so I'll assume that's not an option. (It was just this moment that I learned what 'flattened' meant in regards to PDF, so thanks again.)

JasKinasis said:
I could be wrong, but it’s something like that.

Given your history, I'll err on the side of you being correct.

Sherri is a Cat · Mar 28, 2024

JasKinasis said:
Flattening pdf files is useful when somebody sends you a pdf form to fill in.

Yes. If you've ever received an a form in an ODT file, you'll see the text after the characters you type move to the right or move down the page when reach the end or hit 'enter'.

If information is entered into a table, the characters you type and the text in the form won't move until you reach character limit in that field. Depending on the formatting, the boundaries the cells will do one of three things (at least those that I know of in my experience editing and creating PDF's.)
Either the characters you type run past the boundaries and you don't see it, nothing in the table is shifted.... But I need to tell you that stuff.

None of those things happen in a PDF that is flattened. Filling out a PDF is literally like inserting a page into a type writer. If a layer is locked, you have to click or tab over to that area before you can type in it. If you keep hitting space, and I'm not totally positive which it is, either the cursor disappears behind the text box ( a layer) or 'returns'.

If you've ever converted a PDF into word format, You'll see all the tables used to create the document. Editing a PDF that's been converted into an ODT file can be confusing. Take any PDF to the Adobe website. They convert it for free. Open it in Libre and you'll probably see what I'm talking about.

I start PDF in Libre Office then export the ODT file as a PDF then flatten it with something else. It might be possible to flatten it in Libre Office Draw, but Draw skews things in PDF's so I haven't bothered.

JasKinasis said:
After you’ve filled the form in, if you flatten it before sending it back, it means that nobody can easily edit the file to change any of your personal details that were on the form. Because in the flattened .pdf, the fields in the filled-in form are no longer editable. They’ve been flattened down to an image……. I think!
I could be wrong, but it’s something like that.

Today's article isn't going to apply to many of you...

KGIII

Super Moderator

Turn PDF Into Text • Linux Tips

blunix

Active Member

blunix

Active Member

KGIII

Super Moderator

Sherri is a Cat

Well-Known Member

Sherri is a Cat

Well-Known Member

KGIII

Super Moderator

Sherri is a Cat

Well-Known Member

KGIII

Super Moderator

JasKinasis

Super Moderator

KGIII

Super Moderator

Sherri is a Cat

Well-Known Member

Members online

Latest posts