Even the most experienced SEO professionals sometimes mess up with robots.txt and take a hit on the organic performance of their websites and how far it is acceptable when an experience professional messes it up? Think of the marketers who are implementing it for the first time?

Anyways, I don’t want to underestimate the capabilities of the experienced professionals or the upcoming marketers who have a decent understanding of the domain. Many times the mess happens due to an inappropriate process and also due to poor understanding of the importance of the robots.txt implementation in SEO.

Here, in this article, I will be helping you out with the importance of robots.txt and also the process template that you should be using in order to implement it correctly.

Even before we get into something bigger let’s brainstorm on the basics

What is a robots.txt file in SEO?

Robots.txt is a simple text file that contains list of commands or instructions defining what to get crawled and what not to be crawled on the website by the bots. Robots.txt helps bots understand about the type of URLs those can be crawled and those cannot be crawled.

Why is robots.txt important in SEO?

The one who have understood the search engines functionality throughly would have already figured out the reasons for this, anyways If you haven’t got hold of search engines functionality please go through this article.

A brief: Avoiding the wastage of crawl resource!

The Google bots visit your website with certain energy or bandwidth and inorder to make use of it effectively one has to help bots in understanding what to be ignored and what to be considered for crawling.

E.G: If bot is coming to your website with a time bandwidth of 10 min and if has to spend all of is time on the wrong URLs ideally it never crawls the valid URLs and therefore your valid web pages never rank for the keywords they are supposed to be ranked, sometimes they would not get crawled and indexed at all.

Who wants the the real performer to be ditched? No one prefers the best to be losing, by implementing the robots.txt we are helping the best/valid pages to get crawled often and in turn rank for as many keywords as possible allowing them to rank in the top fold of the SERP.

Note: If your website in not getting crawled it never ranks for any keywords.

The SEO thumb rule is to improve the crawl stats as much possible even before you touch any other optimizations for the visitors coming in to the site. Read more about improving the crawl stats and how it helps SEO.

What happens if you let bots waste their energy crawling invalid/wrong pages?

There is more loss than you can imagine, it’s not just the crawl resource wastage but also your website may get into the risk of search engine penalty, such as Google panda algorithm penalty, and possibly other when your invalid pages lead to:

  1. Content duplication
  2. Creation of thin webpages
  3. Competing with valid web pages and in turn both stop performing (Both here are valid and invalid web pages/URLs)
  4. The domain authority (DA) growth of the website will be very slow as most of it gets distributed to various other pages.

With the above mentioned problems eventually your website loses all of its organic traffic to its competitors.

How do one stop  wastage of crawl  resource?

That’s when the robots.txt implementation and couple of other implementation such as no-index, no-follow and canonical come into picture, anyways here we are going to discuss just about the robots.txt implementation.

Implementing or creating the robots.txt file includes understanding 3 different things

  1. Understanding of valid and invalid URLs on the website
  2. Understanding the commands user in robots.txt
  3. The process of evaluation.

It’s really important to follow the process in creating the robots.txt files, if you fail to do so you may get into the risk of blocking the valid pages and allowing the invalid pages and henceforth the organic performance of the website goes to toss, losing all of your traffic to competitors. You can use this template to make sure that you have enough clarity about the valid and invalid URLs.

 

Understanding the valid and invalid URLs on the website:

How do one understand what is valid and what is not valid? If you are aware of the URL’s structure and how the system is built, the type of website (dynamic/static), the products and more you will find it easy to figure out what is valid and what is not valid URL.

If you don’t have understanding on this, sit with your developer or product manager or the SEO heads who can help you in understanding what is a valid URL and what is not valid on the system?

After understanding valid and invalid URL the immediate step would be finding what’s happening with those invalid URLs? To understand that you can follow the below process:

Understanding crawling of valid and Invalid webpages (We have to avoid bots crawling invalid webpages)

If your website is small with few tens or hundreds of web pages follow the below process in understanding what is getting crawled and what is not by the bots.

Rely on webmaster data:

Login to your Google webmaster Tools or if you don’t have one create it right away. Once you have the webmaster tools, here is what you have to do in order to figure out if your invalid URLs are getting crawled.

Have to figure out all the Invalid URL structures, patterns or themes.

Go through the HTML Errors:

In case of invalid URLs originating from the parent/Valid URLs you would see your invalid URLs containing question mark (?), Equal to (=), Ampersand (&) or just an ID (132, 198 etc) with no change in the meta tags or a minor change in the meta tags.

Ideally these invalid pages will have all the content exactly same as the parent page.

Alternatively you may find they not having any title or description or one or two words of content, but then this is possible with your valid pages as well, so just make sure they are not your valid web pages/URLs.

HTML Improvements

Download the internal linking report from the Google webmaster tools report:

You can download the internal links report and evaluate it against the valid URLs and you can figure out those invalid URLs easily.

Download external links data from Google webmaster Tools:

You can figure out the URL for which the links are been built from the external websites and you can download and process this against the valid URLs in order to figure out the invalid URLs.

Take a look at your website crawl errors on webmaster tools:

Here in this section you can find the crawl errors, which typically are the wrong URL’s those are getting crawled or the valid URL’s that has got any issue, but anyways understand the crawl errors associated with the valid URL’s is a different topics and you can read it all here. Let’s continue with the crawl errors of invalid URL’s, you just have to identify the pattern, theme or structure of such wrong URL’s and also understand from where they are getting crawled, so that you can find a solution at the root of the problem instead of defining on robots.txt.

Search console landing pages:

Go through your search console landing pages either on Google webmaster tools or Analytics if you have linked both in order to get more data instead of just sampling. From this landing pages you will be able to figure out the invalid if at all there when you check against your valid URL list.

Note: The data you get from the Google webmaster tools in a sampling data, and you cannot completely rely on it or take it for granted.

Do a site:domain.com on the Google search:

site:yourdomainname shows you all types of URLs those are crawled and indexed by search engines, you may have to go through all of the results in case of small index count and in case of massive websites you have a different approach which I will be discussing later. From the results you will be able to understand the invalid URL patterns.

For massive website to understand the same data you can rely on tools like:

Majestic SEO:

Majesticseo for backlinks or the URL’s those are getting crawled through backlinks, you can download this report and check against the valid URL list.

IIS Crawler:

The IIS crawler basically crawls the complete site and generates a SEO report which also includes lot of SEO related issues, this report also includes the complete list of URL’s in case of great internal linking on the system, so that you can understand the complete website URL’s and identify the invalid one.

MOZ Bar:

In case of too many search results, set the results count for the results page and download all the results, process, and identify the wrong URLs if there are any.

Apart from the above mentioned tools you can rely on any other tools and find the details.

What’s next after identifying the invalid URLs?

Simple, you just have to block all of them for search engines crawling and indexing by writing a specific set of commands/instructions.

Don’t worry! This doesn’t need any technical knowledge, you can do it by yourself.

Defining the commands on the robots.txt:

You just have to know about the following five, its usage and functionality.

  •  Useragent:

Useragent is the first thing you should be speaking about in the robots.txt, this is the instance where you will be defining as to which bot this should get applicable. For example if you write as Useragent : * it implies that its applicable to all bots and if you want to instruct it to only to Google bot you will be writing it as Useragent: Googlebot, you just have to figure out the bot names and have to use accordingly if you want to deal with those bots.

  • Allow
  • Disallow
  • Hash #

Hash # is used in commenting

  • Asterisk *

Asterisk (*) makes it applicable to all/everything. This is very simple and clear if you want to allow webpages you have to write Allow and if you want to block you have to write Disallow.

Here is a Example/sample robots.txt and its function explained, kindly go through it

User-agent: Googlebot
Allow: /
Disallow: /*? #Restricts crawling of URL contains "?" from the second level
Disallow: /*= #Restricts crawling of URL contains "=" from the second level
Disallow: /*_ #Restricts crawling of URL contains "_" from the second level
Disallow: /*payment #Restricts crawling of URL containing payment term from the second level
Disallow: /*.php #Restricts crawling of URL containing .php term from the second level
Disallow: /*.index #Restricts crawling of URL containing .index term from the second level
Disallow: /wp-blog* #Restricts crawling of URL containing wp-blog term at the starting of the second level
Disallow: /community/profile/* #blocks crawling of user profiles
Disallow: /services/* #Restricts crawling of URL containing services/ term at the second level

Note: Here the first level indicates domain name E.g: https://kandra.pro/

How to evaluate the Robots.txt Implementation (Usage of robots.txt Checker)

Google has multiple tools each for a different purpose and here is a one tool from Google to understand the robots.txt file functionality, you can find it on google webmaster tools under the crawl section as robots.txt tester. Through this you can check easily whether the valid/Invalid URL’s are functioning properly based on the particular command.Allow links robots.txt tester

Disallow links robots.txt tester

Disallow links robots.txt tester1

Conclusion:

Robots.txt implementation is an integral part of search engine optimization and one cannot miss working on this especially when the website are massive or has various kinds of assets and are dynamic in nature. Saving crawl resource through robots.txt is like building backlinks from few hundreds of credible and relevant website and hence even before you start working on some low priority works on SEO you have to be dealing with it in order to see a quick progress happening on your website from the SEO front.

Author photo

Manjunath Chowdary

Digital Marketing Consultant

-Kandra Digital

An agency that’s been built with the core purpose of delivering the quality digital marketing in the era where Digital marketing services are just business rather than the value for the business, business owners and their resources/time.

Get to us