Questions on PDF's & JSON files.

Nik-Ken-Bah

Well-Known Member
Joined
Sep 9, 2019
Messages
735
Reaction score
716
Credits
2,741
OK! Crew here is the problem.
I have a PDF book digitised by Bezo's mob which are in Public Domain. The way they have done it is by using images ( that appear to me as being just photocopies) so that you cannot copy any of the text you wish to add to your study notes. I came across this many moons ago but generally can get around it by locating copies that are digitised by the Universities or oddly enough Microsoft (the only plus I know for them.) and other sponsors that hold the hard copy books and so I am able to extract text that I want so I can copy and paste into Libre Office word.
I have to-day been installing then testing the PDF programs that allow you to edit, merge, split etc. PDF's but I cannot get one to turn it into text and uninstalling them when they didn't. Even had Libre Office open it up and it opened it up in Draw and still wasn't able to get to the text alone.
So is there a program that can convert images with text in to just text?

What is JSON file and how are they created and what can read them?
 


Have you tried an "OCR" application?
I think OCR is Optical Character Recognition (perhaps).
Try some of those and let us know if it works.
 
What is JSON file and how are they created and what can read them?

you can see example of json file end of this thread https://linux.org/threads/usb-linux-boot-ventoy.29944/post-116948

i just use geany a light ide/text editor to create .

you can use python etc to convert json into python etc useable format

another example package.json:

Code:
{
  "name": "CI4",
  "version": "1.0.0",
  "description": "sass & gulp",
  "main": "gulpfile.js",
  "dependencies": {
    "bootstrap": "^4.5.0",
    "browser-sync": "^2.26.10",
    "gulp-autoprefixer": "^6.1.0",
    "gulp-clean": "^0.4.0",
    "gulp-clean-css": "^4.3.0",
    "gulp-concat": "^2.6.1",
    "gulp-minify-css": "^1.2.4",
    "gulp-rename": "^1.4.0",
    "gulp-sass": "^4.1.0",
    "gulp-uglify": "^3.0.2"
  },
  "devDependencies": {
    "gulp": "^4.0.2",
    "gulp-sourcemaps": "^2.6.5"
  },
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "andy",
  "license": "ISC"
}
 
@Nik-Ken-Bah
You might want to look at this as an opportunity to refresh your typing skills!!! LOL.

I don't know for sure but Calibre may be of some help. If you don't have it installed, I'm sure you can find it in Software Manager. {LM 20.1}

I have seen some PDF files that I could copy and paste into LO Writer; or your text editor.

I have had trouble with Amazon stuff in the past. I think they fall into the proprietary category.

Mike Garcanz was right when he mentions in his book - "Linux and the Unix Philosophy" - that text is more valuable than pictures, videos, audios and anything else on a computer. All those things require text to explain them.

OG TC
 
I think they fall into the proprietary category.
The ones I am looking at are Public Domain books which means that their copyright has lapsed and a lot of the books I obtain from the Internet Archive are over 100 years old.
When you look at the title of a book Bezo's mob digitised youcan buy it from Amazon. But I am unaware if you have to purchase the digitised so that it is free for you to do what you like with it. But know that the print form of the book costs you, which is fair cost of printing and all that. But digitised nah because they do not own the copyright because it has expired bloody decades ago and no longer exists. .
 
I think OCR is Optical Character Recognition (perhaps)
Thanks and you are right in what you thought it meant.
So I will have a burl at it tomorrow as I think there is a program that uses OCR.
 
There are a ton of OCR tools - some of them exist online and you simply upload the picture and it spits out the results with fairly decent accuracy. So, if you don't want to install anything, you can look for free online services.
 
@Nik-Ken-Bah
Nik,
I just went over to the Calibre app and they do offer the ability to change the format of PDFs, E-Pubs, etc into several other formats. RTF is another one I noticed.
I changed the Mike Garcanz book from E-Pub over to txt format. The quality is very good.
OG TC
 
Have you tried an "OCR" application?
I think OCR is Optical Character Recognition (perhaps).
Try some of those and let us know if it works.
Thanks for the pointing me in the right direction. Installed and used Master PDF Editor5 as it had an OCR function which cleaned the book up nicely.

@70 Tango Charlie , @KGIII , @captain-sensible
Appreciated the input as it helped brush the cobwebs of thought away.
Now I know what to do I can now include American and European libraries in my searches at Internet Archives as to many of their books were scanned by Bezo's peons.
 

Members online


Top