Dynadot

Tech question(s) about bots?

Spaceship Spaceship
Watch

Chris2412

Established Member
Impact
1
I am a n00b developer and I would like to talk about bots and how they crawl a web page.

I tried searching keyword “bot” on the forum but I got nill, just a bunch of random results. I’m sure this is a thread covering this so a link provided will do just fine.

So, Google has “bots” that crawl your page. It looks for keywords, phrases, ect for cataloging purposes. Which is good, because you want your website cataloged in their search engine.

But there are other kind of bots, too. Yes?

Some of these bots are evil pawns sent out to do- what exactly?

Eat your bandwidth?

I read a simple php code a year or two ago, that basically makes bot’s sleep (or time out).

I am not even close to HTML 5 yet so perhaps I am getting way ahead of myself.

There’s no harm in asking. Perhaps I can bump this thread as I get more knowledge and have additional questions regarding coding.
 
1
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
If you're trying to be a designer, then you definitely don't need to know this!
 
1
•••
I could ask you, "why not"... but then I would get scolded again.
 
0
•••
A designer would typically learn front end technologies. Javascript and CSS mostly.
 
0
•••
Knowledge is power, I am a bit of a nut.

When I took courses in college I was too busy starring at the brunette two terminals down to actually learn anything (I'm not really a blonde gal type of guy).

The whole education system is bollocks, anyways. If you want something, you go out and learn it for yourself.

I know nothing about bots and I wanted to talk about it. No harm, intended.

I will never pretend to be something I'm not.

Worst case scenario you all make fun of me, so what? At least I am not purchasing WP templates like most of these people (nothing wrong with that, just saying).

I want to create something for myself. I want to be an architect.

It's a bit daunting starting from a completely blank page. But, to know I can create a decent layout from scratch, well that's at least getting the ball rolling.

I want to continue to grow. If that means making a fool of myself, than so be it.

That being said...After all this, I still have very little knowledge about bots.
 
0
•••
Ok, either you're just really weird (no offense)... or you are actually trying to do something malicious with bots, because otherwise your obsession makes zero sense.

Seriously, it's not a thing you should even think about.
 
0
•••
I feel like we are beating a dead horse, Beezy.

I do appreciate your time and efforts.

I certainly am not trying to program a malicious bot. Quite the contrary. I wanted to learn about bots, their purpose, if/how/when to make them sleep, ect.

I'm currently learning spry elements. So yeah, I'm a n00b. Go easy.
 
0
•••
Rest assured no one in this forum will laugh at your ignorance, we all have our shortcomings.

You wanna be a developer, start reading. You are worried about bots, just pick strong passwords for now. Down the road you'll learn how to block certain bots. A bot is not malicious by default.
 
1
•••
1
•••
Chris2412 said:
If NP forum is not a community to help, advise, and educate in web development...than you are correct- I am in the wrong place.
You're posting in the right place and its absolutely fine to ask :). Just saying that we don't have many posts on the subject because the overwhelming interest in NP is in buying and selling domains :).

iowadawg said:
Worse offender, once they find your blog/site?
Baidu!

Baidu is the #1 search engine in China and the #2 search engine worldwide (and expanding their reach through various internet acquisitons), but does spider aggressively. It respects robots.txt so if China is not your audience you can block it there. I think you still have to register with their webmaster tools to change frequency. Here's their FAQ: http://help.baidu.com/question?prod_en=master&class=Baiduspider&id=1000973

And just like with Googlebot, there are rogue bots who use the Baidu user agent to try to get past your malicious bot blocking strategy.

Why someone would send them to eat bandwidth?
Most of them don't do it deliberately. Some might do as in a Denial of Service (DOS) attack

Why people are engineering bots instead of... basically anything else they could be doing?

Because it scales tasks which would be impossible to do manually.

Most of what hits your site are not new bots that somebody constructed, its software they're running to accomplish a task. Malicious reasons include spamming comments to get backlinks to their site, scraping content so they can use your content without writing their own, harvesting email addresses so they can send spam, scanning for vulnerabilities so they can get confidential data like passwords, use your site to host illegal downloads, or install their bots on your server so they have a bigger network of server power performing their tasks ...

Why some bots are "good" and others are evil?
Software isn't evil, intent is up to the party who runs it
 
Last edited:
2
•••
1
•••
If you use cloudflare on your site they will do a great job of keeping offending bots away

Here's another good resource: http://www.distilnetworks.com/

Features:
  • THEFT BOTS: Block bots from siphoning away your data & revenue
  • FORM FRAUD: Submitting fake forms. Your forms are being flooded with fake information and clogging your database with bad leads.
  • CLICK FRAUD: Clicking on paid ads. Your daily ad budget is maxing out because of bots, not potential buyers.
  • COMMENT SPAM: Interrupting your users. Spend your time moderating your actual visitors, not bots
 
0
•••
Baidu uses so many different chinese IPs to sent their bot army out.
That using robots.txt would mean getting all those IPs and listing them.
And the list is LONG.
Have not seen this, but it seems that others now saying baidu is now using IPs that are not chinese, to avoid that problem of being blocked.
They are agressive and are worse than google!
 
0
•••
No, using robots.txt means listing the user agent, not the IPs. I think there are about 3 user agents.

Someone probably has an htaccess list of the ip's if you want to go that route. There are lists to block entire countries, I'm sure someone has a Baidu list.

Google is easier to control.
 
0
•••
I use a wordpress plugin to block ALL of china!
End of story.
 
1
•••
Sorry, I've been meaning to reply to this thread for a while, but I've had a lot of work to do and ended up getting a nasty cold earlier this week.

This is certainly a great place to ask about bots, because bots are an important part of SEO, and SEO is a big part of domaining. Plus, many of us have web development experience.

Also, knowledge of bots is definitely relevant to web designers--often more so to them than anyone else. Modern search engine bots, like Googlebot, care quite a bit about design and layout, and will penalize a website if they dislike the design. It's also important for designers to understand how bots interpret content within different types of layouts, and how to make the key content stand out.

I'll write up some information about different types of bots and post it here shortly.
 
3
•••
Bots are computerized visitors to your website, typically with a very specific purpose. They range from simple to very complex.

There are many varieties of bots, but the most important tend to be search engine crawlers. Crawlers roam your site without interacting, much like a human clicking random links. Their purpose is to gather data. Search engine crawlers mostly gather data about your content and how it's relevant to users. Googlebot and Bingbot are probably the most relevant to you. For the most part, Bingbot copies Googlebot. There are a lot of myths about Googlebot, and even Google often states misleading information. The practice of designing a website to appeal to search engine crawlers is called SEO: Search Engine Optimization. Googlebot is pretty smart, and does a good job of seeing a website the same way a user would. It likes to see modern coding and design, simple layout with focus on content, emphasis on security, and compatibility with modern browsers. If the text on a website sounds like a sales pitch, Googlebot won't be happy. It also tries to identify what purpose a website serves: if the site doesn't seem to offer any particular service or contributation to the internet, it won't rank the site as high. Googlebot focuses particularly on "above-the-fold" content: what's visible when you first load a page, without scrolling. There should be useful content up there, and it should load quickly. Googlebot dislikes when the entire page has to load before the above-the-fold area is legible. If a page takes a long time to load, Googlebot will penalize the site. If the site doesn't load at all--like with the idea you proposed--then the website will be removed from search results on Google. Generally, Bingbot tries to do all of the same things.

There are lot of crawlers that are used for research purposes. Many are run by non-profit organizations that shape the future of the internet. Others aren't quite as innocent, but for the most part they're still harmless. Almost all crawlers identify themselves with a unique User-Agent header and will listen to any restrictions you describe in your robots.txt file. As an example, Wayback Machine attempts to crawl every website and archive the history of the internet. You can view historical snapshots of websites using their free service.

Scrapers are simple bots used to pull structured information off specific pages of websites. Usually the intent is to steal information like e-mail addresses or telephone numbers that can be used for malicious purposes. Sometimes websites will use scrapers to copy content from other websites.

More info to follow when I have more time.
 
3
•••
Aha! Was hoping Paul would weigh in on this topic ...

Taking exception to one comment though:
If the text on a website sounds like a sales pitch, Googlebot won't be happy.

Googlebot doesn't care if you have a sales pitch as long as you have a genuine product or service to sell. Car dealerships can rank very well for car purchase queries and they aren't exactly fine examples of quality content (or, in most cases, good code ... or even unique content) What they do have is a real product (they sell the cars, they're not sending you to an Adsense ad for cars) and they don't have other ads plastered all over the page :).

End of detour - back to your regularly-scheduled bot talk ;)
 
1
•••
If you use cloudflare on your site they will do a great job of keeping offending bots away (as well as the other benefits).

I thought this too.....

Although in bitcoin world, cloudflare sites have notoriously been attacked and taken down by ddos attacks and extorted for bitcoin - even though I thought that's what cloudflare specialized in (ddos protection).

Unsure why cloudflare websites get targeted by ddos extortionists - not sure if it's cloudflare they are attracted to or if its just that the bitcoin websites they choose to attack typically always just "happen to use" cloudflare.

You keep talking about bots - and I'm growing curious if your words are actually talking about ddos attacks. Bots as in zombie farms that are used to manifest the attacks - sent to your site. You are referring to "eating your bandwith" and bots - and this is where my head goes with these questions.

If perhaps I'm hitting the nail on the head with what you are referring to, the purpose of these "bots" or attacks, is typically to be just annoying - to disrupt your service - to disrupt you.

Like I said above, this method has been used as an extortion method where a ddos is sent to a website, along with an email requesting a bitcoin payment. Emails usually say something like "We found a vulnerability in your website, please pay us xxx amount of btc to xxxxxxxxxxxxxxxxx wallet to get the ddos attack to stop".

Typically speaking, the person writing the letter merely bought the ddos attack online from some source - usually a group who runs a bot farm for this type of service - you pay them, they attack, you send email and extort.

If this is not referring to anything you were referencing or asking about, then I apologize for misunderstanding what you might be referring to.

Cheers! And I hope this helps a little bit!
z3
 
1
•••
Oh i wanted to add, my experience with bots on websites other than DDOS, worth mentioning - I've seen mostly when I build forums - like using SMF or wordpress sites. I notice without even advertising my site, tons of bots find it and register with spam addresses - making accounts and spamming spammy links to the forums or word press site. There are ways to keep them out, I think like ajax makes a script to help keep them out.
 
1
•••
Aha! Was hoping Paul would weigh in on this topic ...

Taking exception to one comment though:


Googlebot doesn't care if you have a sales pitch as long as you have a genuine product or service to sell. Car dealerships can rank very well for car purchase queries and they aren't exactly fine examples of quality content (or, in most cases, good code ... or even unique content) What they do have is a real product (they sell the cars, they're not sending you to an Adsense ad for cars) and they don't have other ads plastered all over the page :).

End of detour - back to your regularly-scheduled bot talk ;)

Well, everything I said was an oversimplification. There are always exceptions, and context is important. The idea is that you shouldn't be creating a website that is blatantly promotional without really contributing anything to the internet. Generally, Googlebot dislikes promotional content, but if everyone in your industry is doing the same thing, it doesn't matter all that much.

Promotional websites that rank higher than useful websites have typically tricked Google in some way. For example, exact match domains are very effective, and can easily be used to outrank competitors or more legitimate websites. Google maintains that exact match domains are not effective, but that's a load of horse dung.

And, of course, if your only purpose is to sell something, then Google's going to expect you to sound somewhat promotional.
 
0
•••
0
•••
Googlebot dislikes promotional content, but if everyone in your industry is doing the same thing, it doesn't matter all that much.

Promotional websites that rank higher than useful websites have typically tricked Google in some way.

Ecommerce sites aren't "useful" and are "tricking" Google in some way??

It's about query intent. Transactional, Informational, Navigational. Ecomm websites SHOULD rank above informational sites for transactional queries.

If I want to buy a pair of shoes I don't need Wikipedia telling me what a shoe is. If I want pizza, I don't need a scholarly treatise on pizza. If I want to buy a BMW I dont' need sites telling me what a car is or an extensive history of the BMW company ..show me some dealerships so I can see what they have in stock. Transactional queries, all :).

--- a-a-a-a-nd, back to bot talk.
 
Last edited:
2
•••
Paul Buonopane, please marry me.

Ok, either you're just really weird (no offense)... or you are actually trying to do something malicious with bots, because otherwise your obsession makes zero sense.

Seriously, it's not a thing you should even think about.

Tech people ARE weird. We obsess over this stuff all the time because we find it fascinating and we're so eager to learn about it.

His question makes absolute sense to me. When I first started web development one of my client's websites was hacked. I set out to learn what hacking methods were used so I could prevent it in the future. Learning how hackers hack taught me how to secure my websites. Chris2412 simply wants to learn about bots so he can best deal with them.
 
3
•••
1
•••
  • The sidebar remains visible by scrolling at a speed relative to the page’s height.
Back