Crawling and indexation are different. We dissect your robots.txt vs your meta robots, outline safe practices, and assist in unintentional page removals from search.
In the SEO toolbox, almost nothing is as handy—or as easily muddled—as the noindex tag and the disallow rule. However, while both guide search engine bots as they roam your site, they serve separate missions. As a result, pick the wrong one and you risk vanishing your site from search results or leaving sensitive pages exposed. Therefore, use this guide to apply noindex and disallow correctly every single time.
Core Concepts: A Library Analogy
Picture the internet as the world’s biggest library, and Google is the head librarian. For example, each web page is like a book the librarian visits, skims, and decides whether to file in the card catalog that visitors use later.
The “”Disallow”” Directive
The “”Disallow”” notice works like a “”Staff Only”” sign taped to a back-room door. The full message lives in a small file called robots.txt at the site’s entrance. When a library helper sees the sign, they turn away. They do not open the door, flip the light switch, or peer through a window to see inside.
However, if they’re cataloging in the main hall and a patron mentions a book behind that door (like another site linking to your page), they may still write the title—the URL—on the card index. Since they never stepped inside, the catalog flags the title as “entry unverified.” Consequently, a page locked by “Disallow” can still appear in search results with only the URL and a note that no preview is available.
The “”Noindex”” Directive
Think of noindex as a note on a book’s first page that says, “Leave this out of the public inventory.” Before the librarian can read the note, they need a key, pull the book, and flip to that first page. The catch is the book must remain reachable on the shelf. If a “disallow” message locks the room, the librarian can’t see the book. In short, no note is seen, the message stays hidden, and noindex never gets a voice.
Blocking Traffic With Disallow
So, what’s this Disallow command? It’s a simple rule in robots.txt that says, “Hey, Googlebot, don’t knock on this URL, folder, or the whole site.” Google explains it mainly manages how many requests the crawler sends so your server stays cool. However, robots.txt is a polite “please don’t,” not an iron “can’t.” Most well-behaved bots obey. Yet uncouth bots may ignore it. Therefore, if a page needs real secrecy, use passwords or access controls.
How to Make Your robots.txt Work
To keep bad robots from hammering your site, start with a plain robots.txt file. Next, follow these rules.
- Where to Put It: Your file should be a UTF-8 text file named robots.txt in the main folder. For example, if your site is at
https://www.example.com, the file must live athttps://www.example.com/robots.txtand not in a subfolder. Name it exactlyrobots.txt—don’t add extra characters or switch the letters around. - How to Write It: Robots.txt uses rule sets. Each starts with a
User-agentline naming which robot is addressed. Then addDisallowto block paths orAllowto permit them. If you mean “everyone,” use the asterisk (*). Finally, remember that paths are case-sensitive, so type rules exactly right.
Disallow a Single Folder
User-agent: Googlebot
Disallow: /secret-folder/
Disallow Multiple Folders
User-agent: *
Disallow: /hidden-dashboard/
Disallow: /order-summary/
Disallow an Entire Site
User-agent: *
Disallow: /
HINT: Disallowing a URL Isn’t a Guarantee
Here’s the nuance: robots.txt tells polite bots to stay away, but it does not prevent URL discovery. If a Googlebot crawler sees a link to a “disallowed” URL, it may still keep the link. In that case, you’ll usually see a bare URL, no snippet, and a note that the publisher blocked access. Therefore, if your goal is to keep a URL out of results entirely, disallow alone won’t do it. Consequently, use stronger controls or pair the process with noindex once the page is crawlable.
Using the Noindex Directive on Your Website
What Does Noindex Do?
When you place a noindex rule on a page, you are asking search engines, “Please do not save this page in your index.” Unlike “disallow,” noindex focuses on whether the page appears in results, not on whether the bot may read it. However, for noindex to work, search engines must crawl the page first. Also, note that putting noindex in robots.txt hasn’t worked with Google since 2019.
Using the Meta Robots Tag
The easiest way to say “noindex” is with a meta robots tag. For example, add the tag to the page’s head.
- Where to put it: Place the tag in the
<head>section of your HTML. - What it looks like: The tag can instruct all crawlers—or a specific one like Googlebot—to skip indexing.
HTML
<!DOCTYPE html>
<html>
<head>
<meta name="robots" content="noindex">
<title>This Page Will Not Be Indexed</title>
</head>
<body>
...
</body>
</html>
Bonus tip: You can add , follow. That way, search engines skip saving this page but still follow its links to discover more pages.
Using the X-Robots-Tag Header
Sometimes you’re handling non-HTML files like PDFs, JPGs, or MP4s. You can’t embed a <meta> tag there. The X-Robots-Tag solves this. When you configure the server to send this header, the instruction appears in the file’s HTTP header. Consequently, you get the same effect as the meta tag, but it works on any file type.
Here’s how a server response might look.
HTTP/1.1 200 OK
Date: Tue, 25 May 2024 21:42:43 GMT
(…)
X-Robots-Tag: noindex, nofollow
(…)
You usually set this in a server config. On Apache, for instance, you can protect every PDF the same way. Add this to .htaccess, and it goes to work.
Header set X-Robots-Tag "noindex"
To recap quickly: use noindex when the page should be served but not listed in search results. “Disallow,” by contrast, belongs in robots.txt to tell crawlers not to peek at a page at all.
Quick Comparison
| Feature | Disallow (robots.txt) | Noindex (Meta Tag or X-Robots-Tag) |
| What It Does | Stops search engines from crawling the page | Stops the page from showing in search engine results |
| How It Works | Written as a line in a simple text file that lives at the top level of your website | In the <head> part of the page or an HTTP header that’s sent when the page loads |
| Where It Applies | Applies to the whole site, to whole folders, or just to one single page | Applicable to just one page or one file, every time |
| Crawling Effect | Search engines do not go to the page at all, so they don’t read it | Search engines crawl the page anyway so they need to grab it to read the meta tag |
| Indexing Effect | Doesn’t stop search engines from remembering the URL if another site links to it | 100% effective at keeping the URL out of the search results |
| Crawl Budget Effect | Broad, immediate savings, because the bot won’t spend time loading the page | If the page loads to read the noindex, that’s a waste of crawl requests. |
| Link Effect | Makes the page’s links invisible, so the bot sees no links from the page | The noindex tag appears, links might pass value first but it will fade if the bot stops going back to it over time |
Avoiding Conflicts: Noindex and Disallow
We give robots two big jobs in SEO. One is to save our crawl budget by ignoring unneeded areas like endless filters or private API paths. The other is to remove thin pages—such as gratitude messages, empty pages, or on-site search results—from the public index. Block both kinds of pages, right? Almost. However, don’t combine the modes the wrong way.
Mixing Blocking and Removing
If we stack both methods on the same page, we create confusion. We say, “Stay out—the door is locked” (disallow) and also say, “If you get in, forget what you saw” (noindex). Googlebot obeys “stay out” first and never sees the noindex tag. Consequently, the page can remain indexed via links even though the crawler cannot read it. The key we hoped would remove it never arrives, and the page lingers until we change the setup.
How to Remove a Page Correctly
If a page is blocked and still shows up in searches, use this process to clean it up:
- Let Robots In. Open robots.txt and delete the line that blocks that page. Googlebot must walk in.
- Put the Noindex Tag. Ensure the page’s HTML has
<meta name="robots" content="noindex">or that the HTTP headers send an equivalent X-Robots-Tag. - Give It Time. Allow Google a few days to recrawl, see noindex, and remove the page. Track this in the “Pages” report in Google Search Console.
- Request Indexing (Optional). If you’re in a hurry, enter the URL in the URL Inspection tool and click “Request Indexing.”
- Block It Again (if Needed). Only after the page disappears from the index should you add Disallow again. Use this to save crawl resources later.
When to Use Noindex
Common Use Cases
Choose the noindex tag when a page should remain accessible via links but shouldn’t appear for search queries.
- Utility Pages: Thank-you screens, login forms, and personal profile pages help customers but aren’t useful landing pages from search.
- Low-Value Content: Pages you don’t care to rank, like dated archives or tag collections with an empty body, should be hidden to avoid lowering perceived quality.
- Internal Search Results: On-site search pages create near-duplicate URLs that waste crawl budget and should stay out of the index.
- Campaign Landing Pages: Pages built purely for paid ads or newsletters are meant for specific audiences and shouldn’t appear in organic results.
- Staging Environments: During testing, apply noindex to keep unfinished content out of search until the full version is ready.
- Duplicate Content: For thin duplicates like printer-friendly pages, noindex is a fallback when a canonical tag is impractical.
When to Use Disallow
Best Uses
Choose Disallow when you want robots to avoid specific sections. Consequently, you protect your crawl budget and reduce server load.
- Optimizing Crawl Budget: Ensure Googlebot spends time on key pages. For example, e-commerce filters create endless combinations. Disallow duplicates so Google focuses on main product and category links.
- Blocking Unnecessary Files: If scripts, stylesheets, or API paths aren’t needed to understand the page, disallow them. As a result, crawlers skip extras and the server runs smoothly.
- Securing Core Folders: It’s common to block
/wp-admin/. However, this does not lock the door; always use strong credentials. - Redirecting Heavy Media: Disallow large videos, PDFs, or images that hog bandwidth. Keep them in a separate folder and disallow the folder so crawlers pass them by.
Advanced Topics and Strategy
Optimizing Your Crawl Budget
If you run a big site, watch crawl budget, which is how many pages Google can scan in a window of time. Smart choices between noindex and disallow shape performance. Disallow stops the crawl before it starts, conserving budget and server power. For massive sites, this keeps the train on time.
By contrast, noindex arrives late and uses budget. For noindex to apply, the crawler must visit the page first. You are effectively paying a crawl just to say “don’t index.” Therefore, use disallow to keep unwanted URLs out of the crawl queue. On big sites, trying to noindex every faceted URL will burn budget fast. The smarter play is Disallow. You may see a few URLs surface without descriptions via stray links, but that trade-off is usually worth it.
How Link Value is Affected
These directives also influence link value, or PageRank. For example, when you Disallow a URL, Googlebot skips the crawl. No links on that page are counted, so no PageRank flows from it.
When you add noindex, link value can fade gradually. Initially, Google crawls the page, notices noindex, and may still follow its links, passing some value. As time passes, Google sees the page remains noindexed. Consequently, it crawls the page less and may eventually stop fetching it. At that point, Google no longer evaluates the page’s links, and the effect resembles noindex, nofollow.
The lesson is simple: don’t route key internal links through noindexed islands. Instead, keep vital pages connected to indexable pages rich in internal links.
Implementation and Troubleshooting
Check Your Site
- Look for noindexed pages: Use a crawler like Screaming Frog and check the “Directives” tab. Google Search Console shows the same in the “Pages” report under “Excluded by ‘noindex’ tag.”
- Check robots.txt: Visit
yourdomain.com/robots.txtto view the exact file search engines read daily.
Using Google Search Console
Google Search Console (GSC) shows how Google views your pages.
- URL Inspection Tool: This Swiss Army knife reveals index status, robots.txt blocking, last crawl date, and any noindex tag. Consequently, it’s ideal for spot checks.
- robots.txt Tester: In GSC’s settings, review Google’s latest stored copy and test whether specific URLs are allowed or blocked.
Common Problems and How to Fix Them
- “Submitted URL marked ‘noindex’”: This means a URL in your sitemap says “don’t index.” Either remove the noindex tag if the URL should be indexed, or remove it from the sitemap if not.
- “”Blocked by robots.txt””: Google found a link but couldn’t read it due to robots.txt. If you want it in results, delete the blocking line.
Conclusion
Key Takeaways
Figuring out the difference between “Disallow” and “Noindex” is more than a technical detail. In short, it’s the secret sauce of smart SEO. When you know what each one does, you help search engines understand your site.
The main takeaways are:
- Disallow tells the crawler “Skip this page, please.” Use it to manage your crawl budget.
- Noindex tells the crawler “Don’t let this page appear in search results.” Use it to curate what users see.
- Don’t mix the two. If you want a page removed from results, first remove Disallow so the crawler can see and obey noindex.
Final Thoughts
Knowing when and how to use each rule lets you direct how your site appears in search. Consequently, search engines highlight your most important content, tuck less useful links out of sight, and spend time where it delivers the best results.
Implementation steps
- Create a list of pages you want to hide in search; decide which still need a visit from the crawler.
- Tag pages that should pass value via links but stay out of search with a meta robots “noindex, follow” or the X-Robots-Tag.
- Use a robots.txt Disallow for actual traps or sensitive places you don’t want scanned.
- Serve a 410 or 404 for pages you don’t need, and file removal requests for urgent cases.
- Check the Coverage and Removals sections in Search Console to see how status has shifted
Frequently Asked Questions
What does ‘noindex’ do?
Stops pages from showing up in search—but still lets link power pass with ‘follow’ so other pages stay strong.
What does robots.txt ‘disallow’ do?
Keeps bots from crawling a page; doesn’t stop it from showing up if someone links to it from outside.
Which should I use for facets?
Pair ‘noindex,follow’ with canonicals; skip full robots blocks for parameter URLs so pages still get some love.
How do I remove something quickly?
Serve a 410/404, add ‘noindex’, and ask for removal in GSC if it needs a nudge.
What about staging sites?
Password-protect it, add ‘noindex’, and also disallow to keep it off the radar and off the index.