The Omgili bot crawls public websites, including news sites, messageboards (forums) and blogs, including their comments.
If our spider doesn't behave well (e.g. too many requests, recursive urls) please contact us so we can fix this issue!
Why do you access my website?
We crawl your website to integrate your website into our public search engine (http://omgili.com/
Additionally, we process and filter this data so our clients can access this data through our web service APIs. Among our clients are market research companies, marketing agencies, search engines and ohter web applications who use us to outsource their crawling in order to save resources (imagine every media company crawling your site).
How much bandwidth do we use?
We try to be as efficient as possible when we request your website.
We try to keep the bandwidth and requests as low as possible:
- In addition to our caching, we use the gzip compression to save bandwidth between your servers and ours.
- We also use the If-Modified-Since and the ETag HTTP Headers to skip requesting unchanged web pages.
- We adapt the crawl rate based on the hits found and the rank of the site, internal caches, and use state of the art compression algorithms to further reduce bandwidth usage.
How can I recognize that your bot accesses my web site?
We use the following user agents:
How can I instruct the Omgili bot from not accessing my web site?
If you would like to exclude a part or the entire website from crawlers accessing your site, you can do so by creating a file called "robots.txt" in the root directory of your website:
Or if you would only like to exclude our spider from accessing your site:
The robots.txt file is a widely recognized format to instruct crawlers on what they can crawl and what not. More information regarding robots.txt are available on the site http://www.robotstxt.org/
Most people would like to have their website indexed because it can be found in search engines later on and generate traffic.