-h [ site name ]
Example of valid site name:
-D : enables debug mode
-T [report type] : Default is HTML
1 : ASCII
2 : HTML
-U [site username] : set the username for Basic Authentication
-P [site password] : set the password for Basic Authentication
-X [proxy host] : set the proxy host to use
-C [proxy port] : set the proxy port to use. Default is 80.
-l [site list file] : filename with sites to spider (see Site
List File format ).
-F [urls to include file] : filename with urls to include in the spider
(see URL to include file format )
-E [email domain] : email domain for host specified by -h. Default will be root domain.
-L [custom login xml filename] : xml configuration file name to use
for custom login support (see Custom Login Handler )
S : If you type the 's' key in the console where metis is running, it will stop the spider.
more to come ...
Metis configuration files
You can control the way the code functions with the following configure files located in the conf directory of your installation path.
|default_port||Default port for HTTP connections||
0 : disable debug
1: enable debug
|useragent||The value will be used in the User-Agent request header for all HTTP/SSL request||
|proxy_host||Proxy host to use|
|http_password||Password used for each requests|
|http_user||User used for each requests|
|timeout_retries||Number of times to retry a request when a timeout occurs||
|socket_timeout||Number of mili seconds before a request will timeout||
|proxy_port||Proxy port to use||
|max_sites_in_parallel||The maximum number of sites to spider in parallel.
-1 means no limit.
|max_request_download||This set the maximum file size the engine will download. This protects again downloads of large files when the engine is thinking the file content is simple html. Setting this value to 0 will tell the engine to download any file size.||
|potential_html_file_ext||This is a list of file extensions that the engine considers as containing html. This is used to bypass some web server protections software. This was developed to allow spidering a site protected by SecureIIS from Eeye.com||
This option will write core spider data on the file system instead of managing them in memory. This allows the spider to use much less memory and allow to spider much larger sites with smaller ram usage.
0 : disable
|site_element_handler||This option indicates to the spider engin witch class to use to manage the site elements (urls done, emails, site urls, ...). The class specified here as to implement java.util.List interface. This could allow users to manage the data as they wish. This is only used if the user_file_io_for_spider_data is enabled.||
This is a list of keywords you would like the spider to search for you. Each keyword need to be seperated by ; (like word1;word2;word3)
|max_threads||This set the maximum number of request the engine will run in concurrence.||
Site List File format
The format of the file is simply. To include a sites to spider, simply add one per line. You can also specify the email domain for the site with the following synthax : [site,emaildomain].
This will spider the www.site.com and it will spider www.domain.com and set the email domain to domain.com.
URL to include file Format
Like the site list file format, the format of this file is simple and follows the same concept.To include a path to include in all spiders, simply add one per line.You can also specify if the path points to a file or folder with the following synthax : [type:path]. The type can ether be File or Folder. If the type is not specified, the spider will try to figure the type.
Custom Login Handler
There is only one custom login handler provied with the default installation. Please refer to custom_login_help.txt and CustomFormLoginCookie.txt for more information