Unstoppable Domains

Robots.txt

Spaceship Spaceship
Watch

Name Trader

formerly @stubTop Member
Impact
11,467
I'm struggling to find a good compromise with my robots.txt file. Should I only allow the google, yahoo, bing bots and exclude everything else? What is the upside/downside to doing that? What are the names of the google/yahoo/bing bots? I don't want google scraping any of my images. A sample of a strong robots.txt file would be useful.

Also what to do about all the bots/scrapers that ignore robots.txt? I have been putting an entry into my .htaccess file to exclude them. But I lost track of the website which had all these bots listed. Any help with this would also be a great help.
 
0
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
AfternicAfternic
Not too much point in using robots.txt to exclude bots as the ones you don't want will just ignore it. Also, nosy people have been known to scrape robots.txt files to find out what you're trying to block and don't want to have seen. Best to use htaccess. But you can keep google out of your images in robots.txt by disallowing the folder they are in.
The basic bots are googlebot and bingbot, but they each have several. Here are googles: https://support.google.com/webmasters/answer/1061943?hl=en
(If you run Adsense don't block mediapartners!)
Here are bings: http://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0
 
1
•••
Thanks for the links enlytend. Do you have anything similar for Yahoo! Or do they use Bingbot these days. It's my intention to disallow all bots and then allow these 2 bots only. Can the bots be written on 1 allow line, or do they have to separate lines. ie

User-agent: *
Disallow: /

User-agent: Googlebot, Bingbot
Disallow: /images/

Will this work as intended? Or how should it be written.

Do you have any links to any sites which can update me about all the evil bots/scrapers to exclude in .htaccess?
 
0
•••
Thanks for the links enlytend. Do you have anything similar for Yahoo! Or do they use Bingbot these days. It's my intention to disallow all bots and then allow these 2 bots only. Can the bots be written on 1 allow line, or do they have to separate lines. ie

User-agent: *
Disallow: /

User-agent: Googlebot, Bingbot
Disallow: /images/

Will this work as intended? Or how should it be written.

Yahoo is using Bing now. Legit businesses that have crawlers generally have an information page on their crawler posted somewhere on their site.

You can't combine user agents - each agent gets its own section:

User-agent: Googlebot
Disallow: /images/

User-agent: Bingbot
Disallow: /images/
Disallow: /something else/
Disallow: /anotherthing/

Do you have any links to any sites which can update me about all the evil bots/scrapers to exclude in .htaccess?

There's ongoing discussion over at Webmasterworld's "search engine spider and user agent identification forum" A quick searchI found a 2013 htaccess excerpt here: http://perishablepress.com/2013-user-agent-blacklist/ - just be careful when you use someone else's list and be aware that something they are blocking may not be something YOU want to block.
 
1
•••
A quick question about something that always confuses me. Should these "allow" statements come before or after the "disallow all" statement in robots.txt, or doesn't it matter? This kind of logic has always confused me.

I want to block everything apart from these two bots (including googlebot-image) and all scrapers. Since the list I have in my .htaccess file is probably a couple of years old already. I feel it needs updating. Thank for the link. I'll check it out.
 
0
•••
This is what I ended up with as my primary standard robots.txt.....

User-agent: Googlebot
Disallow :

User-agent: Bingbot
Disallow:

User-agent: *
Disallow: /

Of course, this can be amended depending on the domain and it's usage.
 
0
•••
1
•••
Its clearly mention to bot to which one is allowed and which is disallow while crawling.The bot crawls the robots.txt file before crawl the website.
 
0
•••
Yes. But with a generic robots.txt file like a stated above, it doesn't give any clues. But anyway a rougue bot is going to search, whatever your robots.txt file says. But, as I say, it doesn't give any hints away. You need to deal with these rougue bots in .htaccess or some other method. I haven't got a good handle on those rougue bots yet.
 
0
•••
Dynadot — .com Registration $8.99Dynadot — .com Registration $8.99
Appraise.net

We're social

Unstoppable Domains
Domain Recover
DomainEasy — Zero Commission
  • The sidebar remains visible by scrolling at a speed relative to the page’s height.
Back