Recursive retrieval of web pages that fit a domain.
The program starts to download web pages and recursively downloads the pages that
are referenced by the pages visiting more and more pages. You can define a domain
to download not restricted to a single server. Any wild charactered pattern can be
included into the domain and also excluded from the domain.
Follow relative and absolute URL-s.
The program does not care how an URL is specified in an HTML file. Other mirroring
programs follow only relative URLs even if they point to a page which is on the same
server on the same directory. This mirroring program converts all URLs first to absolute
and then decides if the targeted file is to be retrieved or not. Therefore it does not
matter how an URL is defined in the source file.
Frame handling.
The program supports frames and downloads the files which are neccessary to display
the frames.
Picture retrieval can be switched off.
If you want to download textual information without pictures all picture download can
be switched off. This way you can get the information without the nifty pictures fast
and utilizing lower bandwith.
Retrieval of pictures from outside of the defined domain can be switched on.
In some situation the pictures come from a wide variety of servers. Ususally ad banners,
or from picture servers in case of heavy trafic web farm servers. You can instruct the program
to download all pictures which are referenced no matter if they are out of the download domain.
Retrieval domain can be defined to include and exclude patterns.
You can define as many patterns as you like to be retrieved. You are not limited to one machine.
A single download can process many files, with multiple starting ponints. Yo can say
include http://*.cnn.com/*
include http://*.times.com/*
to download both cnn and times. On the other hand you can exclude domains saying
exclude *.exe
exclude *.js
not to download any file with extension exe or js.
Define maximal deepness of web pages to follow.
You can define the level of deepness you are interested in web pages. A
level of five means that you can reach the page with five mouse click from the
starting page. Pages that are too deep are not retrieved, you can
go to the original location clicking on the link.
Define maximal total size of retrieval.
You can limit the total size of download. If the domain contains more bytes than you
thought it won't fill up your disk.
Limit the size of a single file.
You can also limit the size of a single file, saying that you do not want to download
files that are larger than let's say one meg.
Basic authentication support.
You can define username and password to download pages which
are password protected. What is more you can define different user names
and passwords for different servers in a single download.
Multiple proxy definition.
You can define different proxies to be used for different groups of servers.
This is extremely useful if you are using a multinational corporate network
service that has more than one firewall to the Internet.
Multiple start pages.
You can define as many starting page as you wish. The starting pages are downloaded
first and then checked for further links that fit the specified download area.
Configuration files can include each other.
There is a simple way to include other files into a download configuration file.
This way you can create one or more configuration files which define general
parameters, like proxy definitions and user names and these files can be included
into retrieval definiton files that drive the mirroring process.
Automatic or manual configuration of local net card usage.
You may run this program on a machine that has more than one network card. Though the
program automatically uses the best card you can configure which card to use for
the different directrion. However if you have only one Ethernet card, you need not
worry.
Detalied log file generation.
The program generated a detailed log file. From this file you can see how the download
process proceeds, which servers are responsive, and which are not, how pages are linked,
what header informations the servers send back and so on.
Redirecting html page generation for pages not retrieved.
When a page is not downloaded for some reason, the program generates a so called leaf page
that tells the user that the page was not downloaded for some reason and redirects the user
to the original location.
Define user agent reported to the server.
Some webservers take care of the type of the bvrowser that visits them and result different
pages for different browsers. For example they might present a slightly different page for
Netscape and for Microsoft browsers. If you download a page the server does not know what browser
you will use to see the pages offline. To tell your choice you can specify the user agent
in the retrieval definition file and then the mirroring process will lie that it is a Netscape,
Internet Explorer or anything you like.
Automatic default file name creation (usually index.html).
The program automatically finds out what file to create for URLs like http://www.cnn.com
This is usually index.html but there are other common choices. The program automatically
finds the best file name that fits the server.
Cookie support.
The program is able to send back cookies. This way you can mirror pages that are generated by
CGI processes and rely on cookie state information. You can also disable cookies.
Object oriented development.
The program was developed in an object oriented way, the source code is clean, well documented.
This means that you can rely on the program as being more bug free if you only want to use the
program. If you wan to modify, learn from the code, use some fragments of it your have an easy way.
The software is GNU GPL.
Supports Windows NT, Windows 95 and UNIX operating systems.
The software runs on Windows NT (it was actually developed on one), should
run under Windows 95, or 98 as it does not use any special features and was
tested by beta testers under different brands of UNIXes. It requires no external
binary Perl extension modules.