IT.COM

Website Scrapers?

Spaceship Spaceship
Watch
Impact
11,335
This question is for those people with well established domains for sale websites. It's a serious question.

How do you stop the scrapers? From my hosts logs. I have 1 domain on my website with 139K of visitors from 1 URL in the space of 2 1/2 weeks. It's obvious in this case. But how do you spot the scrapers from the real people? Stop the scrapers and let the real people thru to your website. There must be some solution for this other than to let the scrapers run wild? Any free or paid solutions considered?
 
2
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
1
•••
1
•••
It depends on what the website is. My highest traffic websites are forums. I have filters in place to stop spammers from registering to the point of being able to post anything.

I also have blog websites some medium traffic some low and same for these I have filters in place to keep the spammers from posting anything.

Otherwise, I don’t care who stops by spammer or not.
 
Last edited:
1
•••
Would you mind explaining further? How does that help?

Hi best to read what they say for themselves - by default they filter out a lot of bot traffic and can do more:
https://www.cloudflare.com/dns/dns-firewall/

Not sure what you mean by scrapers - collectors/copiers of data? Just repeat bot visitors consuming bandwidth & maybe slowing you down or costing you money?

If for example using WP with Wordfence, that blocks a lot of known bad IPs. And usually in your hosting you can create a list of blocked IPs. Others have created honeypots only bots would enter, and when they do that IP is banned.
 
1
•••
coudflare
don't waste your time
I tried a lot of ways
: cloudflare
 
3
•••
It depends on what the website is. My highest traffic websites are forums. I have filters in place to stop spammers from registering to the point of being able to post anything.

I also have blog websites some medium traffic some low and same for these I have filters in place to keep the spammers from posting anything.

Otherwise, I don’t care who stops by spammer or not.

I get A LOT of "scraper-like" activity. IP addresses going to the same pages over and over again maybe from 1 IP might go 1 page 10 times in 2 weeks up to 3/4M times in 2 weeks. maybe 500+ different IP addresses in total (or more) They don't appear to be actually doing any scraping though. But they sure eat up a lot of bandwidth,50GB is not unusual. Basically hitting this 1 page.(sometimed form a different domain) all of them. What would be the best method of attack? rename the old page and put a 404 on the old page. Employ Cloudfare and ban all their IP Addresses. I'm getting a lot of grumps from my host that the shere voulme of blacklisted IPs slowing access to my other websites to a crawl. I'm kinda overwhemed with the problem actually. I've fixed a lot of stuff related to this but it just keeps getting worse. These pages have very little content, There would be no need to visit the page more than once. They are bare static pages. It just doesn't make any sense..

@frank-germany. I suppose your recommendation is a bit like @carob's. It needs some specifics.
 
2
•••
I get A LOT of "scraper-like" activity. IP addresses going to the same pages over and over again maybe from 1 IP might go 1 page 10 times in 2 weeks up to 3/4M times in 2 weeks. maybe 500+ different IP addresses in total (or more) They don't appear to be actually doing any scraping though. But they sure eat up a lot of bandwidth,50GB is not unusual. Basically hitting this 1 page.(sometimed form a different domain) all of them. What would be the best method of attack? rename the old page and put a 404 on the old page. Employ Cloudfare and ban all their IP Addresses. I'm getting a lot of grumps from my host that the shere voulme of blacklisted IPs slowing access to my other websites to a crawl. I'm kinda overwhemed with the problem actually. I've fixed a lot of stuff related to this but it just keeps getting worse. These pages have very little content, There would be no need to visit the page more than once. They are bare static pages. It just doesn't make any sense..

@frank-germany. I suppose your recommendation is a bit like @carob's. It needs some specifics.

Details available from Cloudflare website.

Another thing you can do is use robots.txt to set a crawl delay so - if they obey - they space out their requests.
 
1
•••
If it makes it to Archive.org - there is just abut no way to stop it, I dont think. From Archive I can pull just about everything & from anytime (if recorded) - not exactly sure how its all stored - but it seems you can even pull what you wouldn't find searching it. I'm not an expert on this - just passing on something I found very recently.
 
1
•••
@carob - The only think I found out at Cloudfare wesite was they would serve all these requests probably thru their server cache, or directly to my server if not in the cache. The problem as I see it, is not resolved. The scraping goes on unabaited either thru their cache or directly via my website. There is no penalty to them to stop what they are actually doing. This really is the crux of my beef with these people. Without punishment they will never learn good manners. Hence my feelings are so heavy
 
2
•••
@carob they would serve all these requests probably thru their server cache

Precisely, that means many less requests to your website.

And they do block what they regard as malicious traffic.
 
1
•••
1
•••
Go to Firewall, under access rules you can Ban or Challenge the user by Country or IP.

Screen Shot 2018-10-23 at 11.08.54 AM.png
 
1
•••
I think what Frank-Germany is saying is do not waste your time with Cloudflare. Anyway, doesn't cloudflare just slow things down?
 
1
•••
I get A LOT of "scraper-like" activity. IP addresses going to the same pages over and over again maybe from 1 IP might go 1 page 10 times in 2 weeks up to 3/4M times in 2 weeks. maybe 500+ different IP addresses in total (or more) They don't appear to be actually doing any scraping though. But they sure eat up a lot of bandwidth,50GB is not unusual. Basically hitting this 1 page.(sometimed form a different domain) all of them. What would be the best method of attack? rename the old page and put a 404 on the old page. Employ Cloudfare and ban all their IP Addresses. I'm getting a lot of grumps from my host that the shere voulme of blacklisted IPs slowing access to my other websites to a crawl. I'm kinda overwhemed with the problem actually. I've fixed a lot of stuff related to this but it just keeps getting worse. These pages have very little content, There would be no need to visit the page more than once. They are bare static pages. It just doesn't make any sense..

@frank-germany. I suppose your recommendation is a bit like @carob's. It needs some specifics.

just do it
if it's recommended

you can always rewind

why ask otherwise?
 
1
•••
I think what Frank-Germany is saying is do not waste your time with Cloudflare. Anyway, doesn't cloudflare just slow things down?

I said
use cloudflare : no hassle
 
3
•••
3
•••
3
•••
1
•••
Go to Firewall, under access rules you can Ban or Challenge the user by Country or IP.

Show attachment 100718

Thank you @techpr This is what I was trying to get at. Where to ban this stuff in cloudflare. I haven't tried it yet but it looks like a much better idea/implementation than blocking these in the hosts firewall. Which they seem to be saying is not a good idea because it will slow things down a lot.
 
0
•••
OK. I've implemented my firewall in CloudFlare (which I've only been using a couple of days since you guys have been solidly recommending it). So I'm a noob. I've still got some tinkering to do related to my last report from my host. But I'm set to go. I gave my host instructions to delete all my firewall instructions from me. We'll see how things operate from here.

PS: I'm liking CloudFlare very much indeed.
 
Last edited:
3
•••
I think what Frank-Germany is saying is do not waste your time with Cloudflare. Anyway, doesn't cloudflare just slow things down?

It's principally a cacheing (CDN) system. It speeds things up for the visitor.
 
1
•••
I'm still analyzing my hosts records. I'm not finished yet. But I'm finding a lot of these connections are coming from CloudFare IP's. This was before I started with CloudFlare. How can that be? I thought they were supposed to be protecting inbound activities. How does someone use a CloudFare IP for actually outbound scraping a website?
 
0
•••
I'm still analyzing my hosts records. I'm not finished yet. But I'm finding a lot of these connections are coming from CloudFare IP's. This was before I started with CloudFlare. How can that be? I thought they were supposed to be protecting inbound activities. How does someone use a CloudFare IP for actually outbound scraping a website?

the visitors will come from cloudflare
if you need to see the origin country they supply it
$country = $_SERVER["HTTP_CF-IPCountry"];

there may be a way to get the original ip as well if you need it

if they don't
you need to redirect the original traffic
-as I do it -
and store the ip
 
2
•••
@frank-germany - Thanks. I'm a tad tired. It's been a long day. I'll re-read what you said in the morning.
 
1
•••
Back