Webmirror 2.0, Feature list

Recursive retrieval of web pages that fit a domain.

The program starts to download web pages and recursively downloads the pages that are referenced by the pages visiting more and more pages. You can define a domain to download not restricted to a single server. Any wild charactered pattern can be included into the domain and also excluded from the domain.

Follow relative and absolute URL-s.

The program does not care how an URL is specified in an HTML file. Other mirroring programs follow only relative URLs even if they point to a page which is on the same server on the same directory. This mirroring program converts all URLs first to absolute and then decides if the targeted file is to be retrieved or not. Therefore it does not matter how an URL is defined in the source file.

Frame handling.

The program supports frames and downloads the files which are neccessary to display the frames.

Picture retrieval can be switched off.

If you want to download textual information without pictures all picture download can be switched off. This way you can get the information without the nifty pictures fast and utilizing lower bandwith.

Retrieval of pictures from outside of the defined domain can be switched on.

In some situation the pictures come from a wide variety of servers. Ususally ad banners, or from picture servers in case of heavy trafic web farm servers. You can instruct the program to download all pictures which are referenced no matter if they are out of the download domain.

Retrieval domain can be defined to include and exclude patterns.

You can define as many patterns as you like to be retrieved. You are not limited to one machine. A single download can process many files, with multiple starting ponints. Yo can say
include http://*.cnn.com/*
include http://*.times.com/*
to download both cnn and times. On the other hand you can exclude domains saying
exclude *.exe
exclude *.js
not to download any file with extension exe or js.

Define maximal deepness of web pages to follow.

You can define the level of deepness you are interested in web pages. A level of five means that you can reach the page with five mouse click from the starting page. Pages that are too deep are not retrieved, you can go to the original location clicking on the link.

Define maximal total size of retrieval.

You can limit the total size of download. If the domain contains more bytes than you thought it won't fill up your disk.

Limit the size of a single file.

You can also limit the size of a single file, saying that you do not want to download files that are larger than let's say one meg.

Basic authentication support.

You can define username and password to download pages which are password protected. What is more you can define different user names and passwords for different servers in a single download.

Multiple proxy definition.

You can define different proxies to be used for different groups of servers. This is extremely useful if you are using a multinational corporate network service that has more than one firewall to the Internet.

Multiple start pages.

You can define as many starting page as you wish. The starting pages are downloaded first and then checked for further links that fit the specified download area.

Configuration files can include each other.

There is a simple way to include other files into a download configuration file. This way you can create one or more configuration files which define general parameters, like proxy definitions and user names and these files can be included into retrieval definiton files that drive the mirroring process.

Automatic or manual configuration of local net card usage.

You may run this program on a machine that has more than one network card. Though the program automatically uses the best card you can configure which card to use for the different directrion. However if you have only one Ethernet card, you need not worry.

Detalied log file generation.

The program generated a detailed log file. From this file you can see how the download process proceeds, which servers are responsive, and which are not, how pages are linked, what header informations the servers send back and so on.

Redirecting html page generation for pages not retrieved.

When a page is not downloaded for some reason, the program generates a so called leaf page that tells the user that the page was not downloaded for some reason and redirects the user to the original location.

Define user agent reported to the server.

Some webservers take care of the type of the bvrowser that visits them and result different pages for different browsers. For example they might present a slightly different page for Netscape and for Microsoft browsers. If you download a page the server does not know what browser you will use to see the pages offline. To tell your choice you can specify the user agent in the retrieval definition file and then the mirroring process will lie that it is a Netscape, Internet Explorer or anything you like.

Automatic default file name creation (usually index.html).

The program automatically finds out what file to create for URLs like http://www.cnn.com This is usually index.html but there are other common choices. The program automatically finds the best file name that fits the server.

Cookie support.

The program is able to send back cookies. This way you can mirror pages that are generated by CGI processes and rely on cookie state information. You can also disable cookies.

Object oriented development.

The program was developed in an object oriented way, the source code is clean, well documented. This means that you can rely on the program as being more bug free if you only want to use the program. If you wan to modify, learn from the code, use some fragments of it your have an easy way. The software is GNU GPL.

Supports Windows NT, Windows 95 and UNIX operating systems.

The software runs on Windows NT (it was actually developed on one), should run under Windows 95, or 98 as it does not use any special features and was tested by beta testers under different brands of UNIXes. It requires no external binary Perl extension modules.

Software is GNU GPL.

As it says. Don't worry about budget.