Security value of robots.txt or time bomb security flaw

May 8, 2015

3563

Security value of robots.txt or time bomb security flaw ?

You might be surprised to hear that one small text file (robots.txt) , could be the downfall of your website – Security value of robots.txt or time bomb security flaw.

The robots.txt is a very simple text file that is placed on your root directory. An example would be www.yourdomain.com/robots.txt. This file tells search engine and other robots which areas of your site they are allowed to visit and index.

You can ONLY have one robots.txt on your site and ONLY in the root directory (where your home page is):

Web site with robots.txt free or website that care about privacy ?

Occasionally, a website has a robots.txt file which includes the following command:

User-agent: *
Disallow: /

The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

This is telling all bots to ignore THE ENTIRE domain, meaning none of that website’s pages or files would be listed at all by the search engines!!!

The aforementioned example highlights the importance of properly implementing a robots.txt file, so be sure to check yours to ensure you’re not unknowingly restricting your chances of being indexed by search engines.

If you have a very important website and you don’t want crawlers / search engines to access and scan entire site – this is a good way to start deal with it 🙂

With this example, all search engines are told that they cannot index anything in your Website. It is very important to understand what “all search engines” really means—all search engines who respect robots.txt. This does include all major search engines, but there’s nothing preventing a rogue search engine from simply ignoring these rules.

What you need in robots.txt ?

here’s often disagreements about what should and shouldn’t be put in robots.txt files. Please note again that robots.txt isn’t meant to deal with security issues for your website, therefore I’d recommend that the location of any admin or private pages on your site aren’t included in the robots.txt file. If you want to securely prevent robots from accessing any private content on your website then you need to password protect the area where they are stored. Remember, robots.txt is designed to act as a guide for web robots, and not all of them will abide by your instructions.

Let’s look at different examples of how you may want to use the robots.txt file:

Allow everything and submit the sitemap – This is the best option for most websites, it allows all search engine to fully crawl the website and index all the data, it even shows the search engines where the XML sitemap is located so they can find new pages very quickly:

User-agent: *

Allow: /
#Sitemap Reference
Sitemap:http://www.example.com/sitemap.xml

Allow everything apart from one sub-directory – Sometimes you may have an area on your website where you don’t want search engines to show in the search engine results. This could be a checkout area, image files, an irrelevant part of a forum or an adult section of a website for example all shown below. Any URL including the path disallowed will be excluded by the search engines:

User-agent: *
Allow: /

# Disallowed Sub-Directories
Disallow: /checkout/
Disallow: /website-images/
Disallow: /forum/off-topic/
Disallow: /adult-chat/

Allow everything apart from certain files – Sometimes you may want to show media on your website or provide documents but don’t want them to appear within image search results, social network previews or document search engine listings. Files you may wish to block could be any animated GIFs, PDF instruction manuals or any development PHP files for example shown below:

User-agent: *
Allow: /

# Disallowed File Types
Disallow: /*.gif$
Disallow: /*.pdf$
Disallow: /*.PDF$
Disallow: /*.php$

Allow everything apart from certain webpages – Some webpages on your website may not be suitable to show in search engine results and you can block individual pages as well using the robots.txt file. Webpages that you may wish to block could be your terms and conditions page, a page which you want to remove quickly for legal reasons or a page with sensitive information on which you don’t want to be searchable (remember that people can still read your robot.txt file and the pages will still be seen by some scrupulous crawler bots):

User-agent: *
Allow: /

# Disallowed Web Pages
Disallow: /terms.html
Disallow: /blog/how-to-blow-up-the-moon
Disallow: /secret-list-of-contacts.php

Allow everything apart from certain patterns of URLs – Lastly you may have an awkward pattern of URLs which you may wish to disallow, one’s which may be nicely grouped into a certain sub-directory. Examples of URL patterns you may wish to block might be internal search result pages, left over test pages from development or 2nd, 3rd, 4th etc. pages of an ecommerce category page:

User-agent: *
Allow: /

# Disallowed URL Patterns
Disallow: /*search=
Disallow: /*_test.php$
Disallow: /*?page=*

How to test robots.txt ?

You can test robots.txt in one very easy way 🙂 just enter to the file from you local/remote browser .In example you may see my robots.txt file.

I recommend you to test your robots.txt to ensure that search crawlers will access to it from any location (or specific). In addition , you may “ask” Google to check robots.txt file and YES! Google can check it 🙂

By taking a good look at your website’s robots.txt file and making sure that the syntax is set up correctly, you’ll avoid search engine ranking problems. By disallowing the search engines to index duplicate content on your website, you can potentially overcome duplicate content issues that might hurt your search engine rankings.

The robots.txt tester, located under the Crawl section of Google Webmaster Tools, will now let you test whether there’s an issue in your file that’s blocking Google. (This section of GWT used to be known as Blocked URLs.)

Here you’ll see the current robots.txt file, and can test new URLs to see whether they’re disallowed for crawling. To guide your way through complicated directives, it will highlight the specific one that led to the final decision. You can make changes in the file and test those too, you’ll just need to upload the new version of the file to your server afterwards to make the changes take effect. Our developers site has more about robots.txt directives and how the files are processed.

Additionally, you’ll be able to review older versions of your robots.txt file, and see when access issues block us from crawling. For example, if Googlebot sees a 500 server error for the robots.txt file, we’ll generally pause further crawling of the website.

Since there may be some errors or warnings shown for your existing sites, we recommend double-checking their robots.txt files. You can also combine it with other parts of Webmaster Tools: for example, you might use the updated Fetch as Google tool to render important pages on your website. If any blocked URLs are reported, you can use this robots.txt tester to find the directive that’s blocking them, and, of course, then improve that. A common problem we’ve seen comes from old robots.txt files that block CSS, JavaScript, or mobile content — fixing that is often trivial once you’ve seen it.

Conclusion

I must say my opinion about robots.txt is really rare but i think its a security flaw for websites and big systems that running in the WEB .

Many attackers and many hackers may use robots.txt info as helping to get a success penetration to the system .

I can give a little example for that problem (http://www.alexa.com/topsites) :

# Example 1

website : http://baskino.com/

robots file : http://baskino.com/robots.txt

cool info that not present at all : http://baskino.com/statistics.html

# Example 2

website: http://news.sky.com/

robots file : http://news.sky.com/robots.txt

cool info that not present at all : http://news.sky.com/status/status.json

# Example 3

website: http://kukuruku.co/

robots file : http://kukuruku.co/robots.txt

cool info that not present at all : http://kukuruku.co/include/

# Example 4

website : https://www.yahoo.com/

robots file : https://www.yahoo.com/robots.txt

cool info that not present at all : https://www.yahoo.com/_remote

cool info that not present at all : https://www.yahoo.com/_tdpp_api

# Example 5

website: http://www.arcas.co.uk/

robots file : http://www.arcas.co.uk/robots.txt

cool info that not present at all : http://www.arcas.co.uk/administrator/

As you can see , 60 sec and i can know what website is hiding and that information may help me to penetrate successfully to the site.

Stay safe ! (Security value of robots.txt or time bomb security flaw)

Good luck