How to rip most websites

Status
Not open for further replies.

Osiris

Golden Master
Messages
36,817
Location
Kentucky
Ripping websites means to create a local copy of a website for offline browsing purposes. Creating a website mirror can actually be a good idea for several purposes. Even with all those caches that save website information many get lost when a website goes into Nirvana. It's also nice if you need information on a computer with no Internet access, or only temporary Internet access, say an HTML course for example.
One of the most efficient ways to rip websites is by using the program HTTrack which might look a little bit confusing at the beginning because of its many options. I would like to walk you through the process of ripping a website. Please note that this method is not working on all websites but on most.
To begin with you need to download and install the software HTTrack Website Copier. Start it once it has been installed, you will be greeted with a new project dialog. Each project creates the offline copy of one or more urls.
httrack_rip_websites.jpg

The first screen manages the properties of the project. Just add a name - i prefer the name of the website that I want to rip - and a location on your hard drive where you want to save it. Make sure you have enough free disk space on that hard drive. Click Next to continue.
You add urls and the kind of action that you want HTTrack to perform. The standard action will download an exact copy of the website and make it available offline. The most important aspect here is the Set Options button which opens the configuration for the project.
It is very important to load the options and make some changes there. Click on the Browser ID tab and change the ID to Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0). Some websites check for the default ID of Httrack and deny access to it. This way makes it possible to prevent that from happening.
Access the Limits tab afterwards. Select the maximum mirroring and external depths. The first defines how many links will be scanned beginning from the homepage. If you set that to 2 for instance the homepage will be scanned, page 1 which was linked from the homepage will be scanned and page 11 will be scanned which was linked from page 1.
If you leave the first option blank all links will be scanned on that website. No external links are scanned by default which can be changed in that menu as well. I suggest to leave it at that because it would really bloat the project. Make sure you increase the maximum transfer rate in the same menu to the maximum as well to ensure faster downloads.
The Scan Rules tab is another important one. You can include and exclude files in here. If you do not want to download .exe files for instance you can use the string “-*.exe” without the “” in the form.
Passworded Websites:
Passworded websites are most of the times harder to come by. You need to supply HTTrack with the username and password for that website. The easiest way to do so is to add it to the url in the main menu. Instead of adding the url Example Web Page you would add it this way: Example Web Page
That's for websites with basic authentication which means popups that ask for a username and password. It's more difficulty if the website uses form based logins. Your best option to rip those websites is to click on the Add Url button in the main menu and use the capture url feature.
This requires you to set a proxy in your favorite browser for a short time and login into the website that you want to rip so that HTTrack can check the way it is done and hopefully emulate this way when ripping the website.

How to rip most websites
 
Status
Not open for further replies.
Back
Top Bottom