Wget versus wget2 | how to fix old scripts

An example here:

Problem in wget < > Solution / Workaround in wget2
edit edit
1 Like

If someone wants to share the difficulties and solutions about the wget > wget2 change.
I made the top as a wiki. So collaboration is possible. Might be a good idea to explain in a separate request which difficulties has been faced and in the first topic list old task in wget > new task in wget2.

I am not sure that there is really an issue at present. On my system (f41) I see this

$ ls -l /usr/bin/wget*
lrwxrwxrwx. 1 root root      5 Dec  1 18:00 /usr/bin/wget -> wget2
-rwxr-xr-x. 1 root root 191952 Dec  1 18:00 /usr/bin/wget2

which would tend to indicate that usage should be almost identical. The link would prevent issues with scripts and the command change while allowing users time to modify the scripts to name the final changed command in the script.

Obviously, if the usage has been significantly altered so the command line output is not compatible with earlier usage that should be an issue and reported as a bug. Adding options should work; altering existing options to function differently should never be the case when replacing the same command.

Unfortunately this is not the case. It is like dnf, the changes would have needed to be made in smaller steps and parallel. If you check the link on Gitlab, you will see that.

Gitlab summary

Wget2 Introduction

The development of Wget2 started and everybody is invited to contribute, test, discuss, etc.
The codebase is hosted in the ‘wget2’ branch of wget’s git repository, on Gitlab and on Github - all will be regularly synced.

Wget2 on Savannah (currently just has wget infos)

Wget2 on Gitlab

Wget2 on Github

The idea is to have a fresh and maintainable codebase with features like multithreaded downloads, HTTP2, OCSP, HSTS, Metalink, IDNA2008, Public Suffix List, Multi-Proxies, Sitemaps, Atom/RSS Feeds, compression (gzip, deflate, lzma, bzip2), support for local filenames, etc.
Some of these feature have been built into Wget in the meantime, but some other are really hard to implement into the old codebase.

Most of the functionality is exposed via library API (libwget), to allow external programs make use of it. E.g. have a look at examples/print_css_urls.c - just a few lines of C to parse and print out all URLs from a CSS file.

Wget2 will stay as an own executable separate from Wget.
So you can install and test Wget2 without endangering your existing architecture and scripts.

What is missing

  • FTP(S) support
  • WARC support
  • Some Wget options are missing (almost all are WARC or FTP related)
  • API documentation incomplete

New options

–check-hostname Check the server’s certificate’s hostname. (default: on)
–chunk-size Download large files in multithreaded chunks. (default: 0 (=off))
Example: wget --chunk-size=1M
–cookie-suffixes Load public suffixes from file. They prevent ‘supercookie’ vulnerabilities.
–cut-file-get-vars Cut HTTP GET vars from file names. (default: off)
–cut-url-get-vars Cut HTTP GET vars from URLs. (default: off)
–dns-cache-preload File to be used to preload the DNS cache.
–follow-sitemaps Follow the URLs listed in the sitemaps of robots.txt. (default: on)
–force-atom Treat input file as Atom Feed. (default: off)
–force-css Treat input file as CSS. (default: off)
–force-metalink Treat input file as Metalink. (default: off)
–force-rss Treat input file as RSS Feed. (default: off)
–force-sitemap Treat input file as Sitemap. (default: off)
–fsync-policy Use fsync() to wait for data being written to the pysical layer. (default: off)
–max-threads Max. concurrent download threads. (default: 5)
–metalink Parse and follow metalink files and don’t save them (default: on)
–ocsp Use OCSP server access to verify server’s certificate. (default: on)
–ocsp-file Set file for OCSP chaching. (default: .wget_ocsp)
–ocsp-date Check if OCSP response is too old. (default: on)
–ocsp-nonce Allow nonce checking when verifying OCSP response. (default: on)
–ocsp-server Set OCSP server address (default: OCSP server given in certificate).
–ocsp-stapling Use OCSP stapling to verify the server’s certificate. (default: on)
–hsts Use HTTP Strict Transport Security (HSTS). (default: on)
–hsts-file Set file for HSTS caching. (default: ~/.wget-hsts)
–hsts-preload Preload HTTP Strict Transport Security (HSTS) data via libhsts.
–hsts-preload-file Set name for the HSTS Preload file (DAFSA format).
–http2 Use HTTP/2 protocol if possible. (default: on)
–http2-request-window Max. number of parallel streams per HTTP/2 connection. (default: 30)
–http-proxy Set HTTP proxy/proxies, overriding environment variables.
–https-enforce Use secure HTTPS instead of HTTP.
–https-proxy Set HTTPS proxy/proxies, overriding environment variables.
–input-encoding Character encoding of the file contents read with --input-file. (default: local encoding)
–random-file File to be used as source of random data.
–retry-on-http-status Specify a list of http statuses in which the download will be retried.
–robots Respect robots.txt standard for recursive downloads. (default: on)
–save-content-on Specify a list of response codes that requires it’s response body to be saved on error status
–stats-dns Print DNS stats. (default: off)
–stats-ocsp Print OCSP stats. (default: off)
–stats-server Print server stats. (default: off)
–stats-site Print site stats. (default: off)
–stats-tls Print TLS stats. (default: off)
–tcp-fastopen Enable TCP Fast Open (TFO). (default: on)
–tls-false-start Enable TLS False Start (needs GnuTLS 3.5+). (default: on)
–tls-resume Enable TLS Session Resumption. (default: on)
–tls-session-file Set file for TLS Session caching. (default: ~/.wget-session)

## Different behavior of Wget2

* new 'include' statement for config files, e.g. to load /etc/wget/conf.d/*.conf
* --input-file - (reading URLs from stdin) starts downloading with the first URL to allow slow URL generators feed Wget2
* check HTTP 'ETag' to avoid parsing doublettes
* use HTTP 'Accept-Encoding': gzip, deflate, lzma, bzip2, br
* CLI string options can be set to NULL by prepending a --no-, e.g. --no-user-agent
* boolean CLI options can all be set to true or false
* $WGETRC is not read so far

## Differing CLI options Wget/Wget2

| --- | --- | --- | --- |
|--config|✓|✓|Same as --config-file, for compatibility with Wget1.x|
|--egd-file|✓|✓|A Noop for compatibility (GnuTLS can be compiled/configured to use EGD)|
|--glob|✓||Wget1.x has --no-glob to turn off FTP globbing|
|--if-modified-since|✓||Wget2 uses If-Modified-Since when timestamping is turned on|
|--input-metalink|✓|(✓)|Wget2 uses a combination of --input-file and --force-metalink|
|--metalink-over-http|✓||Wget2 does this automatically|
|--netrc-file||✓|Mainly for test code usage to test .netrc files|
|--preferred-location|✓||Wget2 respects priorities and order of locations|
|--robots||✓|Wget1.x has a robots command but no option, -e robots=1 does the job|
|--show-progress|✓|(✓)|Wget2 has a --force-progress option which is better named|

What I do not understand is, why just after almost two releases of Fedora this get asked to revert. Normally this is done before release the changes. So the users complaining now, not made their Homework while Rawhide/Beta tests.

The table above would be for examples how to solve the script parts which depend on removed/changed syntax.

Of course we can also hide this topic if not needed :slight_smile:

It is a rewrite to clean up the code base and remove old and safety-risky parts of it to make it more secure.

But if it results in a different output for the same command options then it will definitely break scripts that use that output. That is what should not have changed without time for users to adjust to the changes. A sudden switch with differing output is what I am understanding from this thread