The first example downloads the page and creates a history what was copied in to a separate file. With this version you should be able to move to a new webroot on a webserver and use the Site as it was.
Of course you need to install all the services the website uses.
The second example does the same but convert the links so that you can read the website locally from the directory.
Create a five levels deep mirror image of the GNU web site, with the same directory structure the original has, with only one try per document, saving the log of the activities to gnulog:
wget -r https://www.gnu.org/ -o gnulog
The same as the above, but convert the links in the downloaded files to point to local files, so you can view the documents off-line:
I installed that actually and couldn’t get the GUI version to run.
I also ran the fedora version for CLI and it didn’t actually work as expected.
I suspect it maybe my inputs are not correct.
I tried to reached out to the devs and also post a report of the problem on the forum but there website has some issues I think as each time I log a post it doesn’t load the page.
[solomon@fedora ~]$ httrack
Welcome to HTTrack Website Copier (Offline Browser) 3.49-2
Copyright (C) 1998-2017 Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help
Enter project name :math
Base path (return=/home/solomon/websites/) :math
Enter URLs (separated by commas or blank spaces) :https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/
(enter) 1 Mirror Web Site(s)
2 Mirror Web Site(s) with Wizard
3 Just Get Files Indicated
4 Mirror ALL links in URLs (Multiple Mirror)
5 Test Links In URLs (Bookmark Test)
Proxy (return=none) :
You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :
You can define additional options, such as recurse level (-r<number>), separated by blank spaces
To see the option list, type help
Additional options (return=none) :
---> Wizard command line: httrack https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ -O "math/math" --get -%v
Ready to launch the mirror? (Y/n) :Y
HTTrack3.49-2 launched on Wed, 06 Apr 2022 15:54:39 at https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/
(winhttrack https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ -O "math/math" -qg -%v )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive information,
such as username/password authentication for websites mirrored in this project
do not share these files/folders if you want these information to remain private
Mirror launched on Wed, 06 Apr 2022 15:54:39 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ with the wizard help..
15:54:49 Warning: Warning: store text/html without scan: math/math/index-2.html
HTTrack Website Copier/3.49-2 mirror complete in 10 seconds : 1 links scanned, 1 files written (92455 bytes overall) [15966 bytes received at 1596 bytes/sec], 92455 bytes transferred using HTTP compression in 1 files, ratio 16%
(No errors, 1 warnings, 0 messages)
Thanks for using HTTrack!
It’s not being singled out—when someone notices one, they point it out, that’s all. If you find a post that’s off-topic, please flag it so that the mods can take a look.
There are so many general Linux forums: unix.stackexchange/stackoverflow are good ones for a start (and there are hundreds of search results for “wget download a website”, because you’re certainly not the first one wishing to do this :))
It is the example from the gnu.org you posted on your initial request, not mine.
You have to take the second link with --convert-links. Otherwise when you click on a link it sends you to the website of itute. And i can not guarantie that you get all the files. It says default is 5 Level deep. It tries just once. You have endless settings you can adjust. Just check out wget --help
You will not be able to get the source code of .php or other server side language files. It will parse it to html.
The log file on the end of the command you better rename to something reminding you from which page it is (itute-log) or so.
Not sure why the web version, webhttrack, doesn’t run for you. Works fine for me, but I’ve had it installed from years ago.
I tried using it with the link you provided https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ and it succeeded in downloading some of the site structure but not the pdfs. Maybe it’s due to the way the site is laid out.
I retried using just the root https://www.itute.com and it is downloading the pdf documents too.
Btw, i both cases I added +*.pdf to the wildcards to make sure pdf files are downloaded.
The log file shows this is the command being used: webhttrack -q -%i -iC2 https://www.itute.com/ -O "/media/data/math" -n -%P -N0 -s2 -p7 -D -a -K0 -c8 -%k -A250000 -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->" +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.pdf -ad.doubleclick.net/* -%s -%u
You could try the above on the command line, maybe with httrack if webhttrack still does not work for you, of course after suitably changing the output destination given after -O.
This could grab more content than you need, and it could take a while (hours) as the program limits download speeds to avoid overwhelming the server, but it should ultimately get you what you want.
Just wanted to share that, when I needed to do something similar a few weeks ago, I couldn’t get wget to download the PDFs I was trying to retrieve. It turned out, it was too impatient when testing. I would cancel the command after two minutes when I noticed no PDFs were downloaded. I finally let it run for 10 minutes and, to my surprise, the PDFs were present. Similar experience documented here.
The command I used was:
wget -r --level=4 --limit-rate=100k --wait=2 --random-wait -A pdf "$url"