I did that sort of thing 15 or so years ago. But websites were generally smaller and did not contain much if any javascript back then. YMMV with more modern âdynamicâ websites.
The first example downloads the page and creates a history what was copied in to a separate file. With this version you should be able to move to a new webroot on a webserver and use the Site as it was.
Of course you need to install all the services the website uses.
The second example does the same but convert the links so that you can read the website locally from the directory.
Create a five levels deep mirror image of the GNU web site, with the same directory structure the original has, with only one try per document, saving the log of the activities to gnulog:
wget -r https://www.gnu.org/ -o gnulog
The same as the above, but convert the links in the downloaded files to point to local files, so you can view the documents off-line:
Thanks I should be more clearer. Basically I need to download every maths exam on this page.
It would take many hours to manually click each one so I thought I could use some command like wget or a webcrawler to download the website contents of that page.
so according to your syntax I could try the following command
As @glb indicates, wget should work fine for a âstaticâ web page, but more recently web pages have dynamic loads based on embedded javascript, in which case youâd get a copy of the script but not all the math exam pages you seek. If each exam has a link off the top level page, then the approach @ilikelinux shows should work, but again those exam pages themselves may be dynamic.
I installed that actually and couldnât get the GUI version to run.
I also ran the fedora version for CLI and it didnât actually work as expected.
I suspect it maybe my inputs are not correct.
I tried to reached out to the devs and also post a report of the problem on the forum but there website has some issues I think as each time I log a post it doesnât load the page.
[solomon@fedora ~]$ httrack
Welcome to HTTrack Website Copier (Offline Browser) 3.49-2
Copyright (C) 1998-2017 Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help
Enter project name :math
Base path (return=/home/solomon/websites/) :math
Enter URLs (separated by commas or blank spaces) :https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/
Action:
(enter) 1 Mirror Web Site(s)
2 Mirror Web Site(s) with Wizard
3 Just Get Files Indicated
4 Mirror ALL links in URLs (Multiple Mirror)
5 Test Links In URLs (Bookmark Test)
0 Quit
: 3
Proxy (return=none) :
You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :
You can define additional options, such as recurse level (-r<number>), separated by blank spaces
To see the option list, type help
Additional options (return=none) :
---> Wizard command line: httrack https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ -O "math/math" --get -%v
Ready to launch the mirror? (Y/n) :Y
HTTrack3.49-2 launched on Wed, 06 Apr 2022 15:54:39 at https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/
(winhttrack https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ -O "math/math" -qg -%v )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive information,
such as username/password authentication for websites mirrored in this project
do not share these files/folders if you want these information to remain private
Mirror launched on Wed, 06 Apr 2022 15:54:39 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ with the wizard help..
15:54:49 Warning: Warning: store text/html without scan: math/math/index-2.html
HTTrack Website Copier/3.49-2 mirror complete in 10 seconds : 1 links scanned, 1 files written (92455 bytes overall) [15966 bytes received at 1596 bytes/sec], 92455 bytes transferred using HTTP compression in 1 files, ratio 16%
(No errors, 1 warnings, 0 messages)
Done.
Thanks for using HTTrack!
*
[solomon@fedora ~]$
This is sort of off-topic for this forum because it really isnât anything Fedora specific. You will have better luck on a general Linux forum with this query.
Respectfully, I see a quite a fair bit of off topic posts and discussions very often on ask fedora, not sure why mine gets singled out. Is there any Linux forum you could recommend?
Itâs not being singled outâwhen someone notices one, they point it out, thatâs all. If you find a post thatâs off-topic, please flag it so that the mods can take a look.
There are so many general Linux forums: unix.stackexchange/stackoverflow are good ones for a start (and there are hundreds of search results for âwget download a websiteâ, because youâre certainly not the first one wishing to do this :))
It is the example from the gnu.org you posted on your initial request, not mine.
You have to take the second link with --convert-links. Otherwise when you click on a link it sends you to the website of itute. And i can not guarantie that you get all the files. It says default is 5 Level deep. It tries just once. You have endless settings you can adjust. Just check out wget --help
You will not be able to get the source code of .php or other server side language files. It will parse it to html.
The log file on the end of the command you better rename to something reminding you from which page it is (itute-log) or so.
Not sure why the web version, webhttrack, doesnât run for you. Works fine for me, but Iâve had it installed from years ago.
I tried using it with the link you provided https://www.itute.com/download-free-vce-maths-resources/free-maths-exams/ and it succeeded in downloading some of the site structure but not the pdfs. Maybe itâs due to the way the site is laid out.
I retried using just the root https://www.itute.com and it is downloading the pdf documents too.
Btw, i both cases I added +*.pdf to the wildcards to make sure pdf files are downloaded.
The log file shows this is the command being used: webhttrack -q -%i -iC2 https://www.itute.com/ -O "/media/data/math" -n -%P -N0 -s2 -p7 -D -a -K0 -c8 -%k -A250000 -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->" +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.pdf -ad.doubleclick.net/* -%s -%u
You could try the above on the command line, maybe with httrack if webhttrack still does not work for you, of course after suitably changing the output destination given after -O.
This could grab more content than you need, and it could take a while (hours) as the program limits download speeds to avoid overwhelming the server, but it should ultimately get you what you want.
Thanks so much. I didnât realised this task would be so complex. When I download customers websites update, I do it through access through FTP and its quite easy.
I downloaded the official fedora version and also tried the binary and couldnât get it to run. I will try again, thanks for the insights.
Are you using the commandline version from Fedora repo?
Iâm using httrack package from the Fedora repo which provides both the httrack and webhttrack commands/programs.
Itâs not so complex but there are many configurable options. I found I did not need to play around with them much to download entire websites, but for partial websites it seems trickier.
If you post the errors youâre getting with installing or running httrack it will be easier to sort it out. Definitely a useful tool to have.
Just wanted to share that, when I needed to do something similar a few weeks ago, I couldnât get wget to download the PDFs I was trying to retrieve. It turned out, it was too impatient when testing. I would cancel the command after two minutes when I noticed no PDFs were downloaded. I finally let it run for 10 minutes and, to my surprise, the PDFs were present. Similar experience documented here.
The command I used was:
wget -r --level=4 --limit-rate=100k --wait=2 --random-wait -A pdf "$url"