Wednesday, October 3, 2012

wget to download not accessible whole pages or sites

Hi all,

Today the problem is how to download a whole website for legal proposes. One interesting tool we can use is wget. For example we can write:

$ wget -r URL

to download the site.

Sometime it doesn't work because developers restrict the access of robots, browsers, etc. In this case we are going to receive an error HTML page. Wget allows us to mimic different browsers HTTP headers. For example, an easy way to download this kind of sites is to define an alias:

alias wgets='H="--header"; wget $H="Accept-Language: en-us,en;q=0.5" $H="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" $H="Connection: keep-alive" -U "Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2" --referer=http://www.google.com/ '
and then repeat our first command but now adding a "s" after wget:

$ wgets -r URL
There are other tools like curl more powerful as wget for more general proposes.

Have fun!