Metis : User documentation

User documentation

Usage:

-h [ site name ]
Example of valid site name:
http://site.com
https://site.com
https://site.com:port
site.com

-D : enables debug mode

-T [report type] : Default is HTML
Report types:
1 : ASCII
2 : HTML

-U [site username] : set the username for Basic Authentication

-P [site password] : set the password for Basic Authentication

-X [proxy host] : set the proxy host to use

-C [proxy port] : set the proxy port to use. Default is 80.

-l [site list file] : filename with sites to spider (see Site List File format ).

-F [urls to include file] : filename with urls to include in the spider (see URL to include file format )

-E [email domain] : email domain for host specified by -h. Default will be root domain.

-L [custom login xml filename] : xml configuration file name to use for custom login support (see Custom Login Handler )

Special keys
S : If you type the 's' key in the console where metis is running, it will stop the spider.

more to come ...

Metis configuration files
You can control the way the code functions with the following configure files located in the conf directory of your installation path.

webglobal.properties

Option	Description	Default Value
default_port	Default port for HTTP connections	80
debug	0 : disable debug 1: enable debug	0
useragent	The value will be used in the User-Agent request header for all HTTP/SSL request	metis 2.1
proxy_host	Proxy host to use
http_password	Password used for each requests
http_user	User used for each requests
timeout_retries	Number of times to retry a request when a timeout occurs	0
socket_timeout	Number of mili seconds before a request will timeout	10000
proxy_port	Proxy port to use	80
max_sites_in_parallel	The maximum number of sites to spider in parallel. -1 means no limit.	-1
max_request_download	This set the maximum file size the engine will download. This protects again downloads of large files when the engine is thinking the file content is simple html. Setting this value to 0 will tell the engine to download any file size.	1572864
potential_html_file_ext	This is a list of file extensions that the engine considers as containing html. This is used to bypass some web server protections software. This was developed to allow spidering a site protected by SecureIIS from Eeye.com	htm;html;shtml;stm;pl;cfm; php;php3;mv;cgi;asp;css
use_file_io_for_spider_data	This option will write core spider data on the file system instead of managing them in memory. This allows the spider to use much less memory and allow to spider much larger sites with smaller ram usage. 0 : disable 1: enable	1
site_element_handler	This option indicates to the spider engin witch class to use to manage the site elements (urls done, emails, site urls, ...). The class specified here as to implement java.util.List interface. This could allow users to manage the data as they wish. This is only used if the user_file_io_for_spider_data is enabled.	faust.sacha.util.FileModeList
keyword_search_list	This is a list of keywords you would like the spider to search for you. Each keyword need to be seperated by ; (like word1;word2;word3)

threadmanager.properties

Option	Description	Default Value
max_threads	This set the maximum number of request the engine will run in concurrence.	100

Site List File format
The format of the file is simply. To include a sites to spider, simply add one per line. You can also specify the email domain for the site with the following synthax : [site,emaildomain].
Example:
www.site.com
[www.domain.com,domain.com]

This will spider the www.site.com and it will spider www.domain.com and set the email domain to domain.com.

URL to include file Format
Like the site list file format, the format of this file is simple and follows the same concept.To include a path to include in all spiders, simply add one per line.You can also specify if the path points to a file or folder with the following synthax : [type:path]. The type can ether be File or Folder. If the type is not specified, the spider will try to figure the type.
Example:
[Folder: /test]
[File: /ldjflskdjflskdjf?arg=test]
/SingleSiteFile

Custom Login Handler
There is only one custom login handler provied with the default installation. Please refer to custom_login_help.txt and CustomFormLoginCookie.txt for more information