How do I back up every page in a website?

Fancy · Mar 22, 2020

I need to make an off-line copy of a wiki, with the following requirements.

1. Nothing outside wiki.website.com is copied.
2. Only webpages, images, css, and javascript files are copied.
3. It must be time-delayed, to prevent harm to the website. Like a 10 to 30 second delay on fetching each webpage.

I vaguely know that tools like wget or curl would be the best way to do this, but I don't know where to go from there.

Vrai · Mar 22, 2020

wget would seem to be the answer.
see "man wget"
it seems that wget has options to lighten the server load - from the man page;
"
-w seconds
--wait=seconds
Wait the specified number of seconds between the retrievals. Use
of this option is recommended, as it lightens the server load by
making the requests less frequent. Instead of in seconds, the time
can be specified in minutes using the "m" suffix, in hours using
"h" suffix, or in days using "d" suffix.
"

Oh, by the way, are you using Linux to do this?

Fancy · Mar 22, 2020

Yes, Ubuntu 18.04.

edit

@Vrai, I'm getting a weird error and undesired behavior.

Here's my script (not the real site for which I want to make a local backup):

Code:

wget --limit-rate=20k --tries=100 ‐w25 --random-wait --no-clobber \
--user-agent="Mozilla/5.0 (Android 8.0.0; Mobile; rv:61.0) Gecko/61.0 Firefox/68.0" \
--retry-connrefused \
-adjust-extension --recursive --level=7 \
--reject=bmp,png,gif,jpg,jpeg,zip,rar,7z,7zip,tar,gz,php,txt,pdf,js,css,ico \
--exclude-directories='https://SOMESITE.com/wiki/tools/*','https://SOMESITE.com/wiki/Special:*' \
--include-directories='https://SOMESITE.com/wiki/*' \
https://SOMESITE.com/wiki/Main_Page

Here's djust-extension in the same directory as which I call the script above.

Code:

--2020-03-23 03:06:52--  http://xn--w25-pn0aa/
Resolving xn--w25-pn0aa (xn--w25-pn0aa)... failed: Name or service not known.
wget: unable to resolve host address ‘xn--w25-pn0aa’
--2020-03-23 03:06:52--  https://SOMESITE.com/wiki/Main_Page
Resolving SOMESITE.com (SOMESITE.com)... 168.420.64.140
Connecting to SOMESITE.com (SOMESITE.com)|168.420.64.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘SOMESITE.com/wiki/Main_Page’

     0K .......... .......... ....                             20.2K=1.2s

2020-03-23 03:06:54 (20.2 KB/s) - ‘SOMESITE.com/wiki/Main_Page’ saved [24858]

FINISHED --2020-03-23 03:06:54--
Total wall clock time: 1.9s
Downloaded: 1 files, 24K in 1.2s (20.2 KB/s)

1. It's trying to resolve the wait parameter (-w25) as a URL. Why is it doing this and how do I make it stop?

2. The undesired behavior is that it's only downloading:

https://SOMESITE.com/wiki/Main_Page

I want it to download the entire site within:
https://SOMESITE.com/wiki/*

... but excluding:

https://SOMESITE.com/wiki/Tools/*

https://SOMESITE.com/wiki/Special:*

I don't know why it's doing this and I doubt the proprieters of SOMESITE.com appreciate me debugging my script against their site, so if someone can tell me which parameter is doing this I would really appreciate it.

edit

Issue #1 is some sort of unicode character thing, retyping everything between the beginning of the script to the end of -w25 fixed it somehow.

Vrai · Mar 23, 2020

Fancy said:
2. The undesired behavior is that it's only downloading:
https://SOMESITE.com/wiki/Main_Page
I want it to download the entire site within:
https://SOMESITE.com/wiki/*

... but excluding:
https://SOMESITE.com/wiki/Tools/* https://SOMESITE.com/wiki/Special:*
I don't know why it's doing this

Just throwing out a quick possibility here - some sites do not allow downloading of the entire site.

Have you tried with any different sites?

Fancy · Mar 23, 2020

How would I know if another site I test it on doesn't also disallow mirroring?

Given that I'm disguising my traffic as a mobile browser, using very little bandwidth, and downloading pages excruciatingly slow, it unlikely that there are automated systems to detect it.

Here's https://SOMESITE.com/robots.txt:

Code:

User-agent: *
Disallow: /FTP/
Disallow: /paste/

User-agent: Mediapartners-Google*
Disallow: /

User-agent: Googlebot-Mobile
Disallow: /

edit

I got it to work, but it's interpreting wiki web page names as directories and all kinds of things. Pretty useless piece of software when it comes to modern websites.

Vrai · Mar 24, 2020

Fancy said:
How would I know if another site I test it on doesn't also disallow mirroring?

Ask the site's WebMaster ?

Fancy · Mar 24, 2020

The site's webmaster is not known for his mental stability or for reacting well to reasonable requests. Asking this would evolve into a weeks long struggle session and a public mental breakdown on social media, as well as all his brown-nosers bad-mouthing me to fellow members of their cliques and getting me struggle-sessioned on their sites as well. These people do not care in the slightest about reverse-engineering, but unfortunately they are impossible to remove from the hobby so I must route around them.

Given the probability of the same occurring here and the uselessness of wget, consider this thread unsolved and closed.

---

For anyone reading this in the future, I gave up on making a personal backup due to wget misbehaving IRT characters in URLs.

If it runs into an "/" in a URL, it creates a new directory named what was before that and the previous "/", and the remainder of the filename is used to create the file. For example:

https://www.somesite.com/wiki/square_matrix/vector

... saves as:
~/wiki/square_matrix/vector.html

It also treats a dot as a file extension and seems to not treat files saved as such as HTML. Example:

https://www.somesite.com/wiki/Main.bin

... saves as:
~/wiki/Main.bin

... and none of the URLs located in Main.bin (a bunch of disassembled routines in this case) are saved. This is undesired behavior.

It does similar for other things like "%", which makes no sense as it should at least look for something in the beginning of the file to indicate it is HTML before it does anything else.

I tried several different searches with various things on different search engines, none of which produced information on how to undo this. I uncovered lots of articles which are 1 for 1 copies of one another for supposed advanced wget usage. I already wasted a day trying to get wget to work, so I gave up and decided to manually download what I need and do all the copying and pasting by hand.

Thanks for trying to help me.

Vrai · Mar 24, 2020

Just came across this item I had saved some years ago.
May be something here that can help.
A number of good links in the article to check.

How to Use wget, the Ultimate Command Line Downloading Tool

Newer isn’t always better, and the wget command is proof. First released back in 1996, this application is still one of the best download managers on the planet. Whether you want to download a single file, an entire folder, or even mirror an entire website, wget lets you do it with just a few...

www.howtogeek.com

How do I back up every page in a website?

Fancy

New Member

Vrai

Well-Known Member

Fancy

New Member

Vrai

Well-Known Member

Fancy

New Member

Vrai

Well-Known Member

Fancy

New Member

Vrai

Well-Known Member

How to Use wget, the Ultimate Command Line Downloading Tool

Staff online

Members online

Latest posts