Using Wget to download entire website content (and running httrack from Fedora repo)

Hi there,

I am looking for the correct command to utilise wget to download the entire content from a website.

I have read this literature but cannot see a simple way to do this:

Has anybody done this sort of thing before.

I need all the data from a website as the server keeps going offline.

1 Like

I did that sort of thing 15 or so years ago. But websites were generally smaller and did not contain much if any javascript back then. YMMV with more modern “dynamic” websites.

1 Like

Check out the “advanced example” section.

The first example downloads the page and creates a history what was copied in to a separate file. With this version you should be able to move to a new webroot on a webserver and use the Site as it was.
Of course you need to install all the services the website uses.

The second example does the same but convert the links so that you can read the website locally from the directory.

  • Create a five levels deep mirror image of the GNU web site, with the same directory structure the original has, with only one try per document, saving the log of the activities to gnulog:

wget -r -o gnulog

  • The same as the above, but convert the links in the downloaded files to point to local files, so you can view the documents off-line:

wget --convert-links -r -o gnulog

1 Like

Hi there,

Thanks I should be more clearer. Basically I need to download every maths exam on this page.

It would take many hours to manually click each one so I thought I could use some command like wget or a webcrawler to download the website contents of that page.

so according to your syntax I could try the following command

wget -r -o gnulog

WebHTTrack will probably do what you want:

It’s in the Fedora System repo too: dnf info httrack

I’ve used it successfully for similar needs.

1 Like

As @glb indicates, wget should work fine for a ‘static’ web page, but more recently web pages have dynamic loads based on embedded javascript, in which case you’d get a copy of the script but not all the math exam pages you seek. If each exam has a link off the top level page, then the approach @ilikelinux shows should work, but again those exam pages themselves may be dynamic.

1 Like

Thanks yes I did try that one.

I installed that actually and couldn’t get the GUI version to run.

I also ran the fedora version for CLI and it didn’t actually work as expected.

I suspect it maybe my inputs are not correct.

I tried to reached out to the devs and also post a report of the problem on the forum but there website has some issues I think as each time I log a post it doesn’t load the page.

[solomon@fedora ~]$ httrack

Welcome to HTTrack Website Copier (Offline Browser) 3.49-2
Copyright (C) 1998-2017 Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help

Enter project name :math

Base path (return=/home/solomon/websites/) :math

Enter URLs (separated by commas or blank spaces) :

(enter)	1	Mirror Web Site(s)
	2	Mirror Web Site(s) with Wizard
	3	Just Get Files Indicated
	4	Mirror ALL links in URLs (Multiple Mirror)
	5	Test Links In URLs (Bookmark Test)
	0	Quit
: 3

Proxy (return=none) :

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :

You can define additional options, such as recurse level (-r<number>), separated by blank spaces
To see the option list, type help
Additional options (return=none) :

---> Wizard command line: httrack  -O "math/math" --get  -%v  

Ready to launch the mirror? (Y/n) :Y

HTTrack3.49-2 launched on Wed, 06 Apr 2022 15:54:39 at
(winhttrack -O "math/math" -qg -%v )

Information, Warnings and Errors reported for this mirror:
note:	the hts-log.txt file, and hts-cache folder, may contain sensitive information,
	such as username/password authentication for websites mirrored in this project
	do not share these files/folders if you want these information to remain private

Mirror launched on Wed, 06 Apr 2022 15:54:39 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring with the wizard help..
15:54:49	Warning: 	Warning: store text/html without scan: math/math/index-2.html

HTTrack Website Copier/3.49-2 mirror complete in 10 seconds : 1 links scanned, 1 files written (92455 bytes overall) [15966 bytes received at 1596 bytes/sec], 92455 bytes transferred using HTTP compression in 1 files, ratio 16%
(No errors, 1 warnings, 0 messages)
Thanks for using HTTrack!
[solomon@fedora ~]$ 

right thanks I see,

so seems like wget is not the right command for downloading media for this type of site only the actual html script.

I have read wget works for this type of thing but unsure of the approach, the literature on wget is quite heavy

This is sort of off-topic for this forum because it really isn’t anything Fedora specific. You will have better luck on a general Linux forum with this query.

Respectfully, I see a quite a fair bit of off topic posts and discussions very often on ask fedora, not sure why mine gets singled out. Is there any Linux forum you could recommend?

It’s not being singled out—when someone notices one, they point it out, that’s all. If you find a post that’s off-topic, please flag it so that the mods can take a look.

There are so many general Linux forums: unix.stackexchange/stackoverflow are good ones for a start (and there are hundreds of search results for “wget download a website”, because you’re certainly not the first one wishing to do this :))

no worries, thanks i will check that out. also try to be more fedora related in future, i thought wget was an RHEL exclusive command

1 Like

That’s not the case. wget is available on most Linux distributions, including Fedora. (It’s a commonly used utility)

1 Like

It is the example from the you posted on your initial request, not mine.

You have to take the second link with --convert-links. Otherwise when you click on a link it sends you to the website of itute. And i can not guarantie that you get all the files. It says default is 5 Level deep. It tries just once. You have endless settings you can adjust. Just check out wget --help

You will not be able to get the source code of .php or other server side language files. It will parse it to html.

The log file on the end of the command you better rename to something reminding you from which page it is (itute-log) or so.

1 Like

Not sure why the web version, webhttrack, doesn’t run for you. Works fine for me, but I’ve had it installed from years ago.

I tried using it with the link you provided and it succeeded in downloading some of the site structure but not the pdfs. Maybe it’s due to the way the site is laid out.

I retried using just the root and it is downloading the pdf documents too.

Btw, i both cases I added +*.pdf to the wildcards to make sure pdf files are downloaded.

The log file shows this is the command being used:
webhttrack -q -%i -iC2 -O "/media/data/math" -n -%P -N0 -s2 -p7 -D -a -K0 -c8 -%k -A250000 -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->" +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js +*.pdf* -%s -%u

You could try the above on the command line, maybe with httrack if webhttrack still does not work for you, of course after suitably changing the output destination given after -O.

This could grab more content than you need, and it could take a while (hours) as the program limits download speeds to avoid overwhelming the server, but it should ultimately get you what you want.

1 Like

Thanks so much. I didn’t realised this task would be so complex. When I download customers websites update, I do it through access through FTP and its quite easy.

I downloaded the official fedora version and also tried the binary and couldn’t get it to run. I will try again, thanks for the insights.

Are you using the commandline version from Fedora repo?

I’m using httrack package from the Fedora repo which provides both the httrack and webhttrack commands/programs.

It’s not so complex but there are many configurable options. I found I did not need to play around with them much to download entire websites, but for partial websites it seems trickier.

If you post the errors you’re getting with installing or running httrack it will be easier to sort it out. Definitely a useful tool to have.

Just wanted to share that, when I needed to do something similar a few weeks ago, I couldn’t get wget to download the PDFs I was trying to retrieve. It turned out, it was too impatient when testing. I would cancel the command after two minutes when I noticed no PDFs were downloaded. I finally let it run for 10 minutes and, to my surprise, the PDFs were present. Similar experience documented here.

The command I used was:

wget -r --level=4 --limit-rate=100k --wait=2 --random-wait -A pdf "$url"

Many variations are possible, of course.