The Big Picture

Why You Need to Know Who’s Looking at Your Website

Imagine you own a restaurant. You have a dining area where customers sit, a kitchen where food is prepared, and a back office where you handle payroll and supplier invoices. You want customers to see your dining area and menu—but you definitely don’t want random people wandering into your back office and reading your financial records.

Your website works the same way. Some pages are meant for the public: your homepage, your services page, your blog posts. Other pages are meant for internal use only: your admin dashboard, your staging pages, your test content. Search engines like Google send automated programs called crawlers (also known as bots or spiders) to visit websites, read their content, and decide what to show in search results. AI search platforms like ChatGPT, Perplexity, Gemini, and Copilot send their own crawlers too.

The question is: how do you tell these crawlers which pages they’re welcome to visit and which ones they should stay away from?

That’s where two important tools come in: robots.txt and meta robots tags. Together, they act like the security system for your website—one controls access at the building entrance, and the other puts locks on individual rooms.

Fernando, who runs a Filipino restaurant in Quezon City, learned about this the hard way. His web developer had added a line of code during the site’s development that told Google to stay away from the entire website. The developer meant to remove it before launch but forgot. For three months, Fernando’s restaurant was completely invisible on Google. No search results, no Google Maps listing clicks leading to his site, nothing. He only discovered the problem when a friend told him, “I can’t find your restaurant on Google anymore.”

This article will teach you exactly what robots.txt and meta robots tags are, how they work, when to use each one, and—most importantly—how to make sure you haven’t accidentally hidden your business from search engines.

The Front Door

What Is robots.txt? Your Website’s Bouncer

A robots.txt file is a simple text file that lives at the very root of your website. If your website is fernandoskitchen.ph, then your robots.txt file would be found at fernandoskitchen.ph/robots.txt. Anyone can see it—just type that URL into your browser and hit enter.

Think of robots.txt as a set of instructions posted at your building’s front door. When a search engine crawler arrives at your website, the first thing it does is check the robots.txt file to see if there are any rules about where it can and cannot go. It’s like a security guard checking a list before letting someone into the building.

Here’s what a basic robots.txt file looks like:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /staging/

Sitemap: https://fernandoskitchen.ph/sitemap.xml

Let’s break that down in plain English:

  • User-agent: * — This means “the following rules apply to all crawlers.” The asterisk is a wildcard that means “everyone.”
  • Allow: / — This says “you’re allowed to visit everything starting from the homepage.”
  • Disallow: /admin/ — This says “stay out of anything in the admin folder.”
  • Disallow: /staging/ — This says “don’t visit the staging area either.”
  • Sitemap: — This tells crawlers where to find your sitemap so they can discover all your public pages. (We covered sitemaps in the previous article.)

Fernando’s original robots.txt file had one devastating line: Disallow: /. That single forward slash after “Disallow” told every crawler to stay away from the entire website. No exceptions. Google obeyed the instruction and stopped indexing every page on his site.

Here’s an important thing to understand: robots.txt is a request, not a wall. Well-behaved crawlers like Googlebot, Bingbot, and most AI crawlers will respect your robots.txt rules. But malicious bots or scrapers might ignore them entirely. So robots.txt is great for managing search engines, but it’s not a security measure for protecting sensitive data. If you have truly private information, it should be behind a login or password, not just blocked by robots.txt.

The Room-Level Lock

What Are Meta Robots Tags? Page-by-Page Control

While robots.txt works at the building entrance, meta robots tags work at the individual room level. They’re small pieces of code placed inside the HTML of a specific page, and they tell search engines exactly what to do with that particular page.

Rina, a fashion blogger based in Manila, uses meta robots tags to manage her content carefully. She writes seasonal style guides that she eventually updates and merges into bigger articles. The old versions get a noindex tag so they stop appearing in search results, while the new combined article stays fully visible. This way, she doesn’t have dozens of outdated seasonal posts competing with her current content in Google and AI search results.

A meta robots tag looks like this in your page’s HTML code:

<meta name="robots" content="noindex, nofollow">

This tag sits inside the <head> section of your page—the part that browsers and crawlers read but visitors don’t see on screen. Let’s look at the most common values you can use:

  • index — Go ahead and add this page to your search results. (This is the default behavior, so you don’t technically need to write it, but many people include it for clarity.)
  • noindex — Do NOT add this page to your search results. The page will still exist on your website, but it won’t show up when people search on Google, Bing, or AI platforms.
  • follow — Go ahead and follow the links on this page to discover other pages. (Also the default.)
  • nofollow — Do NOT follow the links on this page. Don’t pass any ranking value through them.
  • noarchive — Don’t save a cached copy of this page.
  • nosnippet — Don’t show a text snippet or preview of this page in search results.

You can combine these values. For example, content="noindex, follow" means “don’t show this page in search results, but go ahead and follow its links to discover other pages.” This is useful for pages like category archives or tag pages that you don’t want cluttering up search results but that contain links to your actual content.

Understanding the Difference

Noindex vs. Nofollow: They’re Not the Same Thing

This is one of the most common points of confusion, so let’s clear it up with a simple comparison.

Marco, a real estate agent in Taguig, had a situation that illustrates the difference perfectly. He had a page on his website listing all the condos he’s sold—a “past transactions” page. He didn’t want this page showing up in Google because the listings were old and no longer available. But that page contained links to his testimonials page, his services page, and his contact page. He still wanted Google to follow those links.

The right approach for Marco was noindex, follow. Here’s why:

  • Noindex told Google: “Don’t show this past-transactions page in search results.”
  • Follow told Google: “But do follow the links on this page to discover my testimonials, services, and contact pages.”

If Marco had used noindex, nofollow instead, Google would have ignored the page and ignored all the links on it. That could have slowed down Google’s ability to discover and rank his other important pages.

Here’s a simple way to think about it:

  • Noindex = “Don’t show this page in search results.”
  • Nofollow = “Don’t follow the links on this page.”

They control two completely different things. Noindex is about the page itself. Nofollow is about the links the page contains. You can use one without the other, or you can use them together—it depends on what you’re trying to accomplish.

When should you use each one? Here are some common scenarios:

  • Thank-you pages after form submissions: Use noindex. There’s no reason for these to appear in search results.
  • Internal search results pages: Use noindex. Google doesn’t want to index a page that shows your site’s own search results—it considers that low-quality content.
  • Login or admin pages: Use noindex, nofollow. Keep them out of search results entirely and don’t pass link value through them.
  • Affiliate or sponsored pages: Consider nofollow on the affiliate links specifically, or use rel="sponsored" on individual links instead.
  • Old content being redirected: Use noindex on the old page while setting up a redirect. (We’ll cover redirects in the next article.)
Choosing the Right Tool

robots.txt vs. Meta Robots Tags: When to Use Which

Now you know what both tools do, but when should you use robots.txt versus a meta robots tag? Here’s the key difference:

  • robots.txt prevents crawlers from visiting a page. The crawler never even looks at the page’s content.
  • Meta robots tags let crawlers visit the page but tell them what to do with what they find.

This distinction matters more than you might think. If you block a page with robots.txt, Google can’t see the page at all—which means it also can’t see a meta robots tag on that page. So if you block a page in robots.txt but also put a noindex tag on it, Google will never read the noindex tag because it was told not to visit the page in the first place.

Here’s an important catch: if other websites link to a page you’ve blocked in robots.txt, Google might still show that page in search results—just with very limited information. Google knows the page exists because of the external links, but it can’t read the page to get details. The result? A search listing with your URL but no title or description. That looks unprofessional.

Dalisay, who runs a dental clinic in Metro Manila, discovered this problem when her clinic’s pricing page kept appearing in Google with the description “A description for this result is not available because of this site’s robots.txt.” She had blocked the pricing page with robots.txt because she didn’t want competitors seeing her prices. But a local health directory had linked to that page, so Google knew it existed and showed it anyway—just without any useful information.

The fix? Dalisay removed the robots.txt block on the pricing page and added a noindex meta tag instead. Now Google could visit the page, read the noindex tag, and properly remove it from search results. No more embarrassing half-listings.

General rule of thumb: Use robots.txt to block entire sections of your site that crawlers don’t need to waste time on (like your admin area or a folder full of old test pages). Use meta robots tags when you want to keep individual pages out of search results but still want Google to be able to read them and follow their links.

The Silent SEO Killer

How to Check If You Accidentally Blocked Google

Patricia owns a bakery in Quezon City. She hired a freelance developer to redesign her website. The developer built the new site on a staging server and—correctly—added a noindex tag to every page during development so the unfinished site wouldn’t appear in Google. But when the developer moved the finished site to Patricia’s live domain, those noindex tags came along for the ride.

For four months, Patricia’s bakery was completely invisible on Google. She kept getting fewer and fewer online orders. She thought the economy was slow. It wasn’t the economy—it was a few lines of code that told Google to pretend her website didn’t exist.

This happens more often than you’d think. Here’s how to check if it’s happening to you:

Step 1: Do a Site Search on Google

Open Google and type site:yourdomain.com (replace “yourdomain.com” with your actual domain). This shows you every page from your website that Google has in its index. If you see zero results or far fewer pages than you expect, something might be blocking Google. Try the same search on Bing to compare results.

Step 2: Check Your robots.txt File

Open a browser and go to yourdomain.com/robots.txt. Look for any Disallow rules. If you see Disallow: / with no path specified after the slash, that means your entire website is blocked. If you see rules blocking specific important pages (like your homepage, services page, or blog), those need to be removed.

Step 3: Check Your Page Source for Noindex Tags

Visit one of your important pages, right-click anywhere on the page, and select “View Page Source.” Use your browser’s find function (Ctrl+F on Windows, Cmd+F on Mac) and search for “noindex.” If you find a meta tag with “noindex” in it, that page is being hidden from search results.

Step 4: Use Google Search Console

If you have Google Search Console set up (and you should—we covered this in an earlier article), use the URL Inspection tool. Paste any important page’s URL and Google will tell you whether the page is indexed, whether it found any blocking directives, and what it sees when it visits the page. This is the most reliable way to diagnose crawling and indexing problems.

Step 5: Check for X-Robots-Tag in Server Headers

There’s one more place where noindex instructions can hide: your server’s HTTP headers. This is more technical, but some hosting providers or security plugins add an X-Robots-Tag: noindex header without you knowing. You can check this by using a tool like httpstatus.io or Google Search Console’s URL Inspection—it will show you if any HTTP headers are telling Google not to index your pages.

If you find any issues, fix them immediately. Every day your pages are blocked is a day you’re invisible to people searching for your products and services—on Google, Bing, and every AI search platform that relies on traditional search indexes.

The New Question

Should You Block AI Crawlers? The Trade-Offs

Here’s a question that didn’t exist a few years ago but is now very relevant: should you block AI crawlers like ChatGPT’s GPTBot, PerplexityBot, ClaudeBot, or Google’s AI systems from visiting your website?

AI companies send crawlers to websites to collect content that helps train their models and provide answers to users. When someone asks ChatGPT, Perplexity, Gemini, or Copilot a question, the answer might be based on content that was crawled from websites like yours.

You can block these AI crawlers using robots.txt. Here’s an example:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

Each User-agent line targets a specific AI crawler, and Disallow: / tells it to stay away from your entire site.

But before you rush to block everything, consider the trade-offs:

Reasons you might want to block AI crawlers:

  • You don’t want your original content used to train AI models without compensation.
  • You’re concerned about AI platforms reproducing your content and reducing traffic to your website.
  • You create premium or paywalled content that shouldn’t be freely available through AI answers.

Reasons you might want to allow AI crawlers:

  • When someone asks ChatGPT or Perplexity to recommend a restaurant in Quezon City, Fernando’s restaurant could appear in that recommendation—but only if the AI has been able to crawl his website.
  • AI-powered search is growing rapidly. Blocking AI crawlers today could mean being invisible in the search tools that millions of people will use tomorrow.
  • For most small businesses, the benefit of appearing in AI search results outweighs the risk of content being used for training.
  • Rina noticed that a significant portion of her blog traffic started coming from people who first discovered her through AI search recommendations. Blocking those crawlers would cut off that discovery channel.

The practical advice for most small businesses: Unless you have a specific reason to block AI crawlers (like protecting premium content), leave them open. The visibility benefits of appearing in ChatGPT, Perplexity, Gemini, and Copilot responses generally outweigh the risks for small business owners. You can always block specific AI crawlers later if your situation changes.

Keep in mind that this landscape is evolving quickly. Google’s own AI Overviews draw from the same index that Googlebot builds, so blocking Googlebot to avoid AI usage would also remove you from regular search results—which would be a disaster. The key is to be intentional and informed, not reactionary.

Avoid These Pitfalls

Five Common Mistakes That Hide Your Website from Search

Over the years, I’ve seen the same mistakes come up again and again. Here are the five most common ones—and how to avoid them:

Mistake 1: Leaving development noindex tags on your live site. This is what happened to Patricia’s bakery. Always do a final check before launching a redesigned site. Search the source code of every important page for “noindex” and remove any tags that shouldn’t be there.

Mistake 2: Using “Disallow: /” instead of “Disallow: /specific-folder/”. That single forward slash blocks your entire site. It’s the nuclear option. Only use it if you truly want to hide everything—which, as a business, you almost never do.

Mistake 3: Blocking CSS and JavaScript files in robots.txt. Some older guides recommended blocking CSS and JS files from crawlers. This is outdated advice. Google needs to see your CSS and JavaScript to understand how your pages look and function. Blocking them can actually hurt your rankings because Google can’t render your pages properly.

Mistake 4: Confusing robots.txt blocking with page removal. Blocking a page in robots.txt does not remove it from Google’s index if it’s already been indexed. To remove a page from search results, you need a noindex meta tag (and the page must remain crawlable so Google can see that tag) or a manual removal request through Google Search Console.

Mistake 5: Not having a robots.txt file at all. While not having one won’t break anything—crawlers will just visit everything—it’s a missed opportunity to include your sitemap URL and to organize your crawl instructions clearly. Every professional website should have one.

Your Action Plan

A Simple robots.txt and Meta Tags Checklist

Here’s a practical checklist you can work through today, even if you’re not technical:

  1. Check your robots.txt file. Visit yourdomain.com/robots.txt. Make sure it doesn’t contain Disallow: / (which blocks everything). Make sure your sitemap URL is listed.
  2. Do a site: search on Google. Search for site:yourdomain.com and make sure your important pages appear. If your homepage or main service pages are missing, investigate immediately.
  3. Check your key pages for noindex tags. View the source of your homepage, services page, and any other important pages. Search for “noindex.” If you find one where it shouldn’t be, remove it or ask your developer to remove it.
  4. Set up Google Search Console. If you haven’t already, get Google Search Console connected to your website. Use the URL Inspection tool to check your most important pages.
  5. Decide on your AI crawler policy. Review which AI crawlers you want to allow or block. For most small businesses, allowing them is the better choice for now.
  6. Add noindex tags to pages that shouldn’t rank. Thank-you pages, login pages, internal search results, old draft content—these should all have noindex tags.
  7. Schedule a quarterly check. Set a calendar reminder to review your robots.txt and run a site: search every three months. This catches problems before they cost you months of visibility.

Marco makes this part of his quarterly website review. Every three months, he checks his robots.txt, runs a site: search, and inspects his top five pages in Google Search Console. It takes about fifteen minutes, and it has caught a problem once already—a plugin update that silently added noindex tags to his property listing pages.

Common Questions

Frequently Asked Questions

What is a robots.txt file and does my website need one?

A robots.txt file is a small text file that lives at your website’s root (yoursite.com/robots.txt). It gives instructions to search engine crawlers about which parts of your site they are allowed or not allowed to visit. Most websites benefit from having one, even if it simply confirms that everything is open for crawling. Without one, crawlers will attempt to access every page, which is usually fine but can waste resources on pages you don’t need indexed.

What is the difference between noindex and nofollow?

Noindex tells search engines not to include a specific page in their search results, while nofollow tells them not to follow or pass ranking value through the links on that page. You can use them separately or together. For example, you might noindex a thank-you page after a form submission but still allow Google to follow links on it.

How do I check if I accidentally blocked Google from my website?

The easiest way is to search Google for site:yourdomain.com. If no pages appear or key pages are missing, something may be blocking Google. You can also use Google Search Console’s URL Inspection tool to test a specific page, or visit yourdomain.com/robots.txt directly to see if a Disallow rule is hiding content. Check your page source code for meta robots noindex tags as well.

Can robots.txt completely prevent a page from appearing in Google?

Not always. Robots.txt tells crawlers not to visit a page, but if other websites link to that blocked page, Google may still show it in search results with limited information. To truly prevent a page from appearing in search results, use a meta robots noindex tag on the page itself. That tag directly tells Google to exclude the page from its index.

Should I block AI crawlers like ChatGPT and Perplexity using robots.txt?

It depends on your goals. If you want to prevent AI platforms from training on your content, you can add Disallow rules for crawlers like GPTBot (ChatGPT), PerplexityBot, and others. However, blocking AI crawlers may also reduce your visibility in AI-powered search results and answer engines. Many small businesses benefit from appearing in AI responses, so consider the trade-offs before blocking.

Where do I put a meta robots tag on my page?

Meta robots tags go inside the <head> section of your HTML page, between the opening <head> tag and the closing </head> tag. They look like: <meta name="robots" content="noindex, nofollow">. If you use WordPress, plugins like Yoast SEO or Rank Math let you set these without touching code. The tag must be on the specific page you want to control—it does not work site-wide like robots.txt.

Will blocking pages with robots.txt or noindex hurt my SEO?

Blocking the right pages actually helps your SEO by keeping low-value pages out of Google’s index. Pages like admin panels, duplicate content, internal search results, and staging sites should be blocked. Problems only arise when you accidentally block important pages like your homepage, service pages, or blog posts. Always double-check your robots.txt and meta tags to make sure your key money pages remain visible.

Reference

Glossary

robots.txt
A plain text file at the root of your website that gives instructions to search engine crawlers about which pages or sections they can and cannot visit.
Meta Robots Tag
A small piece of HTML code placed in a page’s <head> section that tells search engines what to do with that specific page—for example, whether to index it or follow its links.
Noindex
A directive that tells search engines not to include a page in their search results. The page still exists on your website but won’t appear when people search.
Nofollow
A directive that tells search engines not to follow or pass ranking value through the links on a page.
Crawler (Bot/Spider)
An automated program sent by search engines or AI platforms to visit websites, read their content, and report back. Googlebot, Bingbot, and GPTBot are examples.
User-agent
The name a crawler uses to identify itself. In robots.txt, you use the user-agent name to write rules that apply to specific crawlers.
Disallow
A command in robots.txt that tells crawlers they are not permitted to visit a specific page or folder on your website.
X-Robots-Tag
A noindex or nofollow instruction sent through your server’s HTTP headers instead of through HTML meta tags. It works the same way but is set at the server level.
The Bottom Line

Take Control of What Search Engines See

Your robots.txt file and meta robots tags are the gatekeepers of your website. Used correctly, they help search engines and AI platforms focus on your best content while keeping private, outdated, or low-value pages out of the spotlight. Used incorrectly—or accidentally left in the wrong state—they can make your entire business invisible online.

Take fifteen minutes this week to check your robots.txt file and view the source code of your most important pages. That small investment of time can prevent the kind of costly mistakes that Fernando, Patricia, and Dalisay experienced. And in the next article, we’ll cover redirects—what happens when you move or delete a page and need to make sure both visitors and search engines end up in the right place.

Need Expert Help?

Not Sure If Google Can See Your Website?

A technical SEO audit will check your robots.txt file, meta robots tags, server headers, and indexing status to make sure nothing is accidentally hiding your business from Google and AI search platforms. If something’s wrong, you’ll know exactly what to fix and how.