What Is Robots.txt & What Can You Do With It? )


What is a robots.txt file?

Robots.txt is a short text file that instructs web crawlers (such as Googlebot) what they are allowed to crawl on your website.

From an SEO standpoint, robots.txt helps crawl the most important pages first and prevents bots from crawling non-critical pages.

Here’s what robots.txt might look like:

robots.txt example

Where to get robots.txt

Finding robots.txt file is quite easy- Go to any domain homepage and add “/robots.txt” at the end.

This will show you a real, functional robots.txt file, here is an example:

https://yourdomain.com/robots.txt

The Robots.txt file is a public folder that can be practically checked on any website – you can also find it on sites like Amazon, Facebook or Apple.

Why robots.txt is important?

The purpose of the robots.txt file is to let crawlers know which parts of your website they can access and how they should interact with their pages.

Generally speaking, it is important that the content of the website be crawled and indexed first – Search engines need to find your pages before they appear as search results

However, in some cases, it is better to ban web crawlers from visiting certain pages (such as blank pages, your website’s login page, etc.).

This can be achieved by using a robots.txt file that crawlers always check first before they start crawling websites.

Note: The robots.txt file may prevent search engines from crawling, but not from indexing.

Although crawlers may be prohibited from visiting a particular page, Search engines may still index it if some external link points it out.

This indexed page may therefore appear as a search result, but without any useful content – since crawlers could not crawl any data from the page:

Indexed page is blocked by robots.txt

To prevent Google from indexing your pages, use other appropriate methods (such as noindex meta tags) to indicate that you do not want certain parts of your website to appear as search results.

In addition to the basic purpose of the robots.txt file, there are some SEO benefits that can be effective in certain situations.

1. Optimize crawl budgets

A crawl budget determines how many pages a web crawler, like Googlebot, will crawl (or crawl) in a given period of time.

Many large websites usually have a lot of unimportant pages that don’t need to be crawled and indexed frequently (or not at all).

Using robots.txt tells search engines which pages to crawl and which to avoid altogether. – which optimizes crawling efficiency and frequency.

2. Manage duplicate content

Robots.txt can help you avoid crawling similar or duplicate content on your pages.

Many websites contain some duplicate content – pages with URL parameters, www vs. non-www pages, identical PDF files, etc.

By pointing these pages to robots.txt, you can manage content that does not need to be crawled and help search engines crawl only those pages that you want to appear in Google search.

3. Prevent server overload

Using robots.txt can help prevent website servers from crashing.

Generally speaking, Googlebot (and other reputable crawlers) are generally good at determining how fast your website should crawl without compromising server capabilities.

However, you may want to block access to crawlers that visit your site too often and too often.

In this case, robots.txt may tell crawlers that they should focus on specific pages, leaving other parts of the website alone and thus preventing site overload.

Or as Martin Split, Google’s developer advocate explains:

This is how much pressure we can put on your server due to crawl rate, basically nothing crashing or killing your server too much.

Also, you may want to block certain bots that are causing problems on the site – a “bad” bot that overloads your site with requests, or a block scraper that is trying to copy all the content on your website.

How does the robots.txt file work?

The basic principles of how a robots.txt file works are very simple – it consists of 2 basic elements that indicate what a web crawler should do and exactly what it should be:

  • User-agent: Specify which crawlers will be directed to avoid (or crawl) specific pages
  • Instructions: Tells user-agents what they should do with specific pages.

Here is the simplest example of what a robots.txt file might look like with these two components:

User-agent: Googlebot
Disallow: /wp-admin/

Let’s take a closer look at both of them.

User-agent

A user-agent is a name for a specific crawler that will be guided by instructions on how to crawl your website.

For example, the typical Google crawler user-agent is “Googlebot“This is for Bing crawlers”BingBot“, For Yahoo”Slap”, Etc.

You can use the symbol to identify all types of web crawlers for a specific direction at once. ” * “(Called Wildcard) – It represents all bots that “respect” the robots.txt file.

It looks like this in the robots.txt file:

User-agent: * 
Disallow: /wp-admin/

Note: Keep in mind that there are many types of user-agents, each focusing on crawling for different purposes.

If you want to see what user-agent Google key uses, check it out Google Crawler Overview.

Instructions

Robots.txt instructions are rules that specific user-agents will follow.

By default, crawlers are instructed to crawl every available webpage – robots.txt then specifying which pages or sections of your website should not be crawled.

The 3 most common rules used are:

  • Not allowed – Tells crawlers not to access anything specified in this instruction. You can assign multiple denial instructions to user-agents.
  • Allow – Tells crawlers that they can access some pages from the already unauthorized site section.
  • Sitemap – If you have set up an XML sitemap, robots.txt may point web crawlers where they point to your sitemap to find the pages you want to crawl.

Here’s an example of what robots.txt might look like with these 3 simple instructions:

User-agent: Googlebot
Disallow: /wp-admin/ 
Allow: /wp-admin/random-content.php 
Sitemap: https://www.example.com/sitemap.xml

With the first line, we determine that the instruction applies to a specific crawler – Googlebot.

In the second line (instructions), we told Googlebot that we did not want it to access a specific folder – in this case, the login page of a WordPress site.

On the third line, we added an exception – although some under Googlebot could not access /wp-admin/ Folder, it can visit a specific address.

With the fourth line, we instructed Googlebot where you should look Sitemap With a list of URLs you want to crawl.

There are also some useful rules that can be applied to your robots.txt file – especially if your site has thousands of pages that need to be managed.

* (Wildcard)

Wildcard * A guide that indicates a rule for combined patterns

This rule is especially useful for websites that have a lot of content, filtered product pages, and so on.

For example, instead of not allowing each product page under it /products/ Individual categories (as in the example below):

User-agent: * 
Disallow: /products/shoes?
Disallow: /products/boots?
Disallow: /products/sneakers?

We cannot approve them at once using Wildcard:

User-agent: * 
Disallow: /products/*?

In the example above, the user-agent is instructed not to crawl any page under it /products/ The section that has the question mark “?” (Often used for parameterized product section URLs).

3

The $ Symbols are used to indicate the end of a URL – crawlers may be instructed not to (or should) crawl URLs with a specific ending:

User-agent: *
Disallow: /*.gif$

The ” $ “The sign tells the bot to ignore all URLs that end in”.gif

#

The # The sign serves as a comment for human readers – it does not affect any user-agent or serve as a guide:

# We don't want any crawler to visit our login page! 
User-agent: *
Disallow: /wp-admin/

How to create a robots.txt file

Creating your own robots.txt file is not rocket science.

If you use WordPress for your site, you will have a basic robots.txt file already created – similar to the one shown above.

However, if you plan to make some additional changes in the future, there are a few common plugins that can help you manage your robots.txt file, such as:

These plugins make it easy to control what you want to allow and what you want to approve without having to write complex syntax by yourself.

Alternatively, you can also edit your robots.txt file via FTP – uploading a text file is quite easy if you are confident in accessing and editing it.

However, this method can introduce much more complex and rapid errors.

How to check a robots.txt file

There are many ways you can test (or test) your robots.txt file – first, you should try to find the robots.txt on your own.

If you do not have a specific URL, your file will be hosted. “https://yourdomain.com/robots.txt”- Specific URLs may be different if you use another website builder.

To find out if search engines like Google can actually find and “comply” with your robots.txt file, you can do either:

  • Use robots.txt tester – A simple Google tool that can help you determine if your robots.txt file is working properly.
  • Check out the Google Search Console – for any errors you may have caused by robots.txt.CoverageGoogle Search Console tab. Make sure there are no URLs reporting messages “Blocked by robots.txt“Inadvertently.
Google Search Console - blocked by robots.txt example

Robots.txt best practice

Robots.txt files can be easily complicated, so it’s best to keep things as simple as possible.

Here are some tips to help you create and update your own robots.txt file:

  • Use a separate file for the subdomain – If your website has multiple subdomains, you should consider them as separate websites. Create a separate robots.txt file for each subdomain you own
  • Specify a user-agent only once – Try to combine all the instructions assigned to a specific user-agent. This will put simplicity and organization in your robots.txt file.
  • Confirm specifications – Be sure to specify the correct URL path and pay attention to any trailing slashes or specific signs that are present (or missing) in your URL.



Source link

Leave a Reply

Your email address will not be published.