Our Dataprovider.com spider
At Dataprovider.com we use our own in-house developed spider to analyze domains. To avoid inconvenience and save bandwidth our spider works very efficiently, and only downloads 10 to 20 pages per website. The Dataprovider.com spider identifies itself with a user agent, which makes it visible in logs and site statistics programs you may use. Look for the following agent:
- "Mozilla/5.0 (compatible; Dataprovider.com)"
Robot Exclusion Protocol
The robot exclusion protocol is a set of instructions that specifies which areas of a website a robot can and cannot process. We adhere to these instructions and exclude any directory or content that does not want to be indexed. There are two way’s to instruct a spider not to index your website. You can use a META tag in the HMTL, or alternatively add a robots.txt to the root folder of your web site.
Robots META tag
In the HTML of a page you can add a robots META tag. With the META tag, a robot can be instructed whether or not to index a given webpage, and whether or not to follow the links to another page. If you want to exclude pages of your website from indexation, you can imbed the robot META tag as shown in the following HTML code:
- <!DOCTYPE html>
- <html lang="en">
- <meta charset="utf-8">
- <meta name="robots" content="index, nofollow">
The robot exclusion protocol uses the robots.txt file, which is required to be placed in the root directory of a site. For example: if you have the website mydomain.com you can create a file called mydomain.com/robots.txt. Before the spider indexes any website, it will always look for this file. The Dataprovider.com spider is triggered by setting the user-agent to ‘dataprovider’ (if you only want to disallow dataprovider.com) or to ‘*’ if you want to disallow all spiders.
- User-agent: dataprovider
- Disallow: /
For more information on how to block spiders check http://www.robotstxt.org/robotstxt.html