It is much more common than we imagine to need to deindex pages from Google, that is, to remove them from the index. This makes it possible to clean up the site and sometimes to get out of a penalty. Here’s how…
Why do we sometimes have to de-index pages?
There are several situations:
- You didn’t realize that some pages were indexed – and that wasn’t planned
- Some pages create internal duplicate content and you want to get rid of it
- You feel that (low-quality) pages make you take too many risks with Google‘s algorithm (or even with the search quality team if they were to go through it) and you want to remove them from Google
- One or more pages cause you legal problems and you must remove them as soon as possible from your site and Google
What is the difference between indexable pages and indexed pages?
An indexable page is a page that meets all the technical requirements for it to be indexed.
An indexed page is a page that Google has crawled and “decided” to add to its index (it happens that Google crawls an indexable page and does not index it anyway).
I remind you that to be indexable, a page must obviously also be “crawlable”!
A crawlable page is a page allowed to crawl: in clear, not blocked in the robots file.txt even if there is a special case that I expose below. It must also be accessible (to Google) and in a supported format.
Conversely, a non-indexable page is a page for which Google is told not to index it. As you guessed it, to deindex a page from Google, you need to follow 2 steps:
- Make it non-indexable for Google
- Then de-index it
I detail these 2 steps below.
How do I make a page non-indexable?
The first question to ask yourself is probably the following: should the page you want to deindex remain viewable by Internet users?
How to deindex a page that is still accessible to Internet users?
In this case, you must choose from these solutions (the links give the details if necessary):
- Adding a robots meta tag noindex (or none) tells engines that you don’t want it indexed. If it is currently indexed, it will be deindexed when Google detects this tag in the page; if it is not yet indexed, then it will not be indexed in the future either (so it is about prevention)
- Send a special HTTP header (X-Robots-Tag): this is the same idea as the robots meta tag noindex. It is necessary when the document to be deindexed is not an HTML page, because in this case you can not add meta tags (PDF, Word or Excel documents, etc.).
- Set a canonical URL different from the URL of the page to be deindexed. For example, a product listing is accessible to both URL A and temporary URL B because of promotion. You can define in page B a canonical URL referring to A. Attention, the canonical URL is an indicator that you provide to Google, which does not undertake to respect it in 100% of cases.
Then you either wait for Google to deindex the page, or you speed up the process (see below).
As I know that you have followed my explanations, you have understood that your page must be crawlable, right? Because if you forbid Google to crawl it, it will not be able to see that you ask that it be deindexed.
How do I de-index a page that is no longer accessible?
In this case, you must choose from these most common solutions:
- send an HTTP code 404 or 410: this tells Google that the page does not exist (404) or more (410). The 410 code seems more efficient, because with a 404 code it can take several months before Google finally decides to deindex the page! If you’re lost in all of these HTTP codes, check out my list.
- send a special HTTP header (X-Robots-Tag): this is the same idea as the robots meta tag noindex. It is necessary when the document to be deindexed is not an HTML page, because in this case you can not add meta tags (PDF, Word or Excel documents, etc.).
- redirect in 301 to another page: we use this method when we think that the URL to be removed had obtained (good) backlinks (for example in ecommerce or classified ads site). To avoid losing the benefit, it is necessary to set up a permanent redirection. Note that if you do this on a large amount of URLs, it is likely that Google considers it as soft 404 and that in the end the pages are not deindexed.
Then, either you wait for Google to deindex the page (it can be very long), or you speed up the process (see below).
How do I check that a page is non-indexable?
You can use different tools to verify that you are in one of the situations described above.
However, I recommend that you go through specialized software (such as RM Tech, the one I designed at My Ranking Metrics). After an exhaustive analysis of your site, it will list all the URLs of non-indexable HTML pages.
You will be able to confirm that the non-indexable pages are the ones you have planned. Otherwise, if the tool lists non-indexable pages that should be indexable, in other words it is a rather serious error…
How long will it take for Google to delete my pages?
Now that you have verified that the page(s) to be removed from Google are “non-indexable”, whether they are still online or not, you have to wait…
Indeed, the page will only be deindexed when Google seeks to access it (the crawler). And again, in the case of a 404 error, I told you that it could take a long time…
How do I quickly delete a page from Google?
Delete a page with Search Console
If you have only one page to delete, or a small number, the most effective is certainly to make an explicit request in Google Search Console. Before, it was the URL removal tool. Since September 2015, Google has slightly changed the terms used but the idea remains the same.
If, on the other hand, you have many URLs, it may be tedious or even impossible in practice to go through individual requests in Search Console.
Rest assured, I have a tip 🙂
It is not very well known and I offer it here: list all the URLs to deindex in a sitemap file! A simple text file with one URL per line is more than enough (UTF-8 encoding), with the name of your choice. Declare this file in Search Console (Exploring > Sitemaps)andwait.
The idea is that a sitemap is not used to index pages, but to encourage Google to crawl URLs.
With this sitemap:
- Google will come quite quickly crawl all these URLs
- He will find that they must be deindexed
- As he comes to crawl them, he will deindex them
- In addition, each time you go to Search Console, you will know how many URLs of this sitemap are still present in the index.
As soon as all URLs are deindexed, you can delete this sitemap.
Can I use the robots.txt file to deindex pages?
Quick answer: “no”, for the good reason that the robots file.txt does not manage the indexing but the crawl.
Concretely, if you only prohibit the crawl of a URL, Google will no longer come to crawl it, that’s all. If the URL was indexed, it will not deindex it! Simply, he will never come to update it again. This is also a classic mistake,
Certainly, there is a small remark to make: it is possible to delete a page via Search Console, and to prevent it from returning in the future in the index of Google, it is blocked in the file robots.txt. It is therefore not the fact of putting it in the robots.txt that deindexes it, but the combination “request for deindexation in GSC + blocking in robots.txt“.
Last point: the Noindex directive: located in the robots file.txt, . For years, Google has taken this into account while it has never been part of the standard and Google has never talked about it anywhere in its help pages. But in July 2019, Google indicated that it should no longer be used, because from September 1, 2019 it would stop supporting it.