how to edit a scanned document?

marbles

Member
Joined
Jan 2, 2023
Messages
122
Reaction score
20
Credits
1,117
hi-

i scanned a document and from what i've read, in order to edit it, i need to convert the ocr to text

can someone please tell me how to do this via a GUI program. i don't know how to do terminal window stuff (and i don't have enough data to to learn it, i'm sure i'll have some questions and i just don't have that much data to ask those questions)

the only editing i need to do in the scanned pdf is just use the eraser and maybe add a text box

i know theres some smart people on here, so if you want a challenge :), here you go;

please let me know if theres a GUI scanning software program that will
1. scan a document and create the ocr AND
2. convert the ocr to text AND
3. allow me to edit it (use an eraser)

so i don't have to do use 3 different programs

thx for helping
 


I have not used it but Master PDF Editor may have that capability. It gets good words from people I trust, anyway.

In a pinch, I have used GIMP for non-extensive editing.
 
I don't know which applications to recommend for scanning and OCR on Linux. I use a Mac. I sent a PM to @marbles to help with terminology and how OCR works. I changed my mind and decided to post it here for others to find if they search from the internet:

"I can't help with the answer, but I can help with terminology.

"You use scanner software to scan the paper document to a digital 'bitmap'. The bitmap is whatever is on the scanned area on the paper, one dot / blank spot at a time. The scanner software may keep the bitmap in memory or it could write the bitmap to a graphics file like .JPG, .PNG, .TIFF, .RAW, or another file format. Those graphics file formats know nothing about the characters (letters) on the paper. They only know the bitmap (dots) image.

" 'OCR' stands for 'optical character recognition'. It is the process that converts a bitmap into a text file. The OCR software looks at the dots in the bitmap and figures out which are characters and what they are. The OCR software can save the text to a text file or a document file for you.

"NOTE: OCR software is not perfect. It makes errors and requires proof reading. I type fast and accurately. For me, it isn't always worth it.

"The scanner software may have OCR capabilities built-in. You may need two programs - a scanner program to create the image file, and a separate OCR program to convert the image file to a text or document file. That's the part where others can help. I use Mac software for scanning and OCR. I don't know which Linux software would be appropriate for your needs."

-> I hope someone here who knows Linux applications can recommend appropriate scanner software and/or OCR software.
 
marbles wrote:
please let me know if theres a GUI scanning software program that will
1. scan a document and create the ocr AND
2. convert the ocr to text AND
3. allow me to edit it (use an eraser)
so i don't have to do use 3 different programs

Without great expertise in pdf editing, I cannot advise on a single program to achieve your aims. However, the way I've been able to do these things is to scan a doc to pdf format, then open the doc in the xournal program and fiddle about. One can erase anything by painting over it in white, or any colour over it, and add text anywhere, including over the erasure. Erasing can be fiddly because it's painting over a section of the pdf. It hasn't mattered in my case if the pdf is of a picture or text, erasing and filling in text has been adequate, but I can't say I've achieved professional looking docs. If the pdf is merely a form to fill in with empty spaces, a professional looking outcome is possible. My expertise with the xournal program, though sufficient for my needs, is not high order, and I've never needed to work with OCR. YMMV.
 
There is Tesseract which is a command line utility that does OCR

Have a look here - https://lindevs.com/install-tesseract-ocr-on-ubuntu/ and here - https://www.howtogeek.com/682389/how-to-do-ocr-from-the-linux-command-line-using-tesseract/
There is also a GUI for Tesseract called YAGF which is in the Debian Repos I do not know if anyone else has it or not - in fact both are in the Debian Repos

Also GOCR here - https://jocr.sourceforge.net/

of the two the only downfall to GOCR is it does not do multiline layouts very wel also as far as I know GOCR has not been updated for a couple of years and might be dead - so there is that
 
Last edited by a moderator:
As mentioned before, whenever I need to do OCR these days, I just use a search engine to find an online OCR service (there are perfectly free services). I think I still have a dedicated scanner that does OCR, but I haven't bothered with that in ages.
 
Google docs have an OCR function, once I needed OCR and was the best I found
 
Open the scanned PDF file in Acrobat.
Choose Tools > Edit PDF. Acrobat automatically applies OCR to your document and converts it to a fully editable copy of your PDF.
 
Open the scanned PDF file in Acrobat.
Choose Tools > Edit PDF. Acrobat automatically applies OCR to your document and converts it to a fully editable copy of your PDF.
Unless you didn't notice that your are on a Linux forum, Acrobat doesn't have Linux support.
 
I believe Adobe Reader once had a Linux port, but 9.5.5 - the very last release - although it still works fine, IS now more than 10 years old.....

(shrug)

I use Qoppa's PDF Studio, Master PDF Editor & Foxit Reader to handle PDF stuff. Thankfully, I've never needed OCR capabilities.


Mike. ;)
 
As other folks point out, pdf editors can be used for this. However, its hard to duplicate the original format and thats where OCR comes into play.

Im sure you could also try converting all the text to a different font.
 
I would give Master PDF a try. I often need to add text to a PDF invoice and found it to be a challenge on Linux. On Windows it was easy-peasy. Master PDF works well for the task.
 
I gave a quick look at Convertio's website, and it may be worth your time to look at their security page.

Before you blindly upload any document to a website to have it OCR'd or for any other purpose, give a thought to the sensitivity of the content in that document. Even if a website promises to destroy your data, think about how much you can trust them to get it right. There is also the possibility that they have been hacked without their knowledge.
 
Hi, try https://openpaper.work/ You can scan documents, it reads OCR in common languages and you can Rotate the image/PDF and change colors. have a look at it and tell me how you're like it :)
 

Members online


Latest posts

Top