Tuesday 13 March 2012

How to create a web robot


 A robot.txt file is a file you can easily create to let the spider know that you don't want it to crawl on your page, or part of your page.

  • 1.Open your favorite text editor. It doesn't matter what text editor you use. Notepad works just fine if you're on a PC, and can be found under "Accessories."
  • 2.Enter two lines, one for the name of the spider that will be crawling your web page, and one for the directory or file name you want to exclude for the search. This is the syntax:
    User-Agent: [Spider or Bot name]
    Disallow: [Directory or File Name]
    For example:
    User-Agent: Googlebot
    Disallow: /mywebsite/private.html
    where "Googlebot" is the robot sent out by Google, and "private.html" is the file in the directory "mywebsite" that you do not want the robot to index.
  • 3.Exclude a section of your site from all spiders. If you do not want any robots to index a certain section of your site, use the "*" character after User-Agent. Your file would look like this:
  • User-Agent: *
    Disallow: /mywebsite/private.html
  • 4.Exclude your whole site from all robots. If you don't want any of your site to be visible by robots, (e.g. if you are building your website, and it is not ready to be viewed by the public), insert a "*" character after User-Agent, and the "/" after Disallow. For example:
    User-Agent: *
    Disallow: /
    5.If you want to allow all robots to access your whole site, simply add the asterisk as before, and leave the Disallow section empty, as follows:
    User-Agent: *
    Disallow:
    6.Save the file as robot.txt, and place it in the root directory of your website. For example, http://www.mywebsite.com/robots.txt.
  • No comments: