IT.COM

Website Scrapers?

Spaceship Spaceship
Watch
Impact
11,340
This question is for those people with well established domains for sale websites. It's a serious question.

How do you stop the scrapers? From my hosts logs. I have 1 domain on my website with 139K of visitors from 1 URL in the space of 2 1/2 weeks. It's obvious in this case. But how do you spot the scrapers from the real people? Stop the scrapers and let the real people thru to your website. There must be some solution for this other than to let the scrapers run wild? Any free or paid solutions considered?
 
2
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
Changing to CloudFlare is a huge learning curve. I'm a 10 day old beginner. And for most of the time I've been on with support or their community support. Which are all as thick as too short planks.Only IMHO.

Without CF. If somebody types a domain I'm selling on my domain for sale website, in their browser. They get delivered to the domain page on my website. The URL obviously includes a subpage of the main domain and includes a variable, the domain name.

With CF. I've just (after 8 days.... its a very long story) got them to acknowledge they are seeing the full address. But when I ask. OK. So why isn't it being delivered to that address on my website instead of the home page. Then I get a long scientific mumbo jumbo, which doesn't make any sense to me. But I never get any clear answer or even how to fix it.

You would have thought if they are receiving the full URL, they would deliver the full URL to the website. But NO. they just dump it to the home page. Which is better than not resolving (which has been happening at times) when I've been trying to follow their instructions. Which turn out to have been incorrect.

It's got to the point that I can't understand what they are telling me and they don't listen to what I am telling them. Of course frustration is building on from both sides. So I'm left waiting for yet another day without resolution.

So you might ask, ok if it doesn't work, just leave them. Well my answer to that is this is just such a stupid question, I just cannot believe I've been lead up several dead ends trying to fix this over the last 6 days. They keep asking me what I want, and I keep saying, if somebody types a domain in their browser, I'm expecting them to be delivered to the page on the website which is selling the domain. But after answering that question, I get silence in response. ALSO. They are doing an excellent job at blocking IP Addresses. I really need to get this fixed.

So has anybody who has recommended CloudFlare had any similar problems with them. I would say it would be a miracle if you haven't experienced something similar. Ad know how to fix this.

I have 2 TIMES changed my servers back to where they were before CloudFlare. And everything works fine. Which proves it's CF which is messing this up. OK. I know they are the biggest of their kind in the world. But is there any other company which offers a CDN which offers a similar free level of service. Because I've now had about 2 weeks of potential customers being unable to reach my website from typing in the domain they want in their browser.

Or might you magically know how to fix this problem?
 
0
•••
here is how I do it

having a dedicated server
I use that IP to point all my domain DNS to
in a
/defaultfolder


from here I redirect them to
https://BuyUsedDomain.com/ subfolder/ index.php
where the domain is recognized
and the LP is shown

https://BuyUsedDomain.com/
is at cloudflare with https

all others are not
 
2
•••
Can you give an example of a url that does not work, and one that does?
 
0
•••
Can you give an example of a url that does not work, and one that does?

OK. I discovered that my developer who is installing Invisible ReCaptcha v3 on my website has changed the way things have/are working. This was only put on the user pages and the landing pages for each domain. Both have proved to be unreliable with Google's ranking of visitors between 1-10. The lower the score the more bot like you are. 5+ are allowed by default. But I've taken this permission all the way down to 2 on the users and still I get recaptcha failed. It was put on the landing pages because of the huge number of bots going thru my website.

In a nutshell I'm going to have to rethink the strategy on both the users and the landing pages. the users should be an easy fix, by replacing the invisible recaptcha with a "I am not a robot" button. For the landing pages I am going to try the button approach also when visiting the landing pages. If CloudFlare still insists on pointing the direct link to the landing page to the home page (which is happening now because of the programming) after the landing pages are asks you to click I am a robot (which I think is possible, but not sure). Then I have to forget altogether about any recaptcha at all on the landing pages.

This was all working cleanly for the landing pages until we tried to put recaptcha on them, and/or use CloudFlare at the same time. At that time xxxxx.com/details.php?d=domainname.com (the landing page being visited), just domainname.com in the browser, or even clicking on the link to the landing page on the website worked fine. Typing in the domainname.com in your browser went to it's respective landing page. but put recaptcha and CloudFare into the mix and all hell was let lose. The programmer had to say what to do if you failed the recaptcha, which is to point the domain to the the home page instead. Probably because Google has not been kind to my ranking. Which means all my attempts to go to the landing pages are failing, and so it looks to me that all attempts at going to the landing page in the browser, with either typing in the domainname.com or xxxxx.com/details.php?d=domainname.com, or clicking on the link in the website,were being taken to the home page rather than to the landing page. So the ratings of Google Recaptcha can be high or low for anybody else too.

So it's my basic proposition is Google Invisible ReCaptcha is not ready for prime time. So we are going to test if the normal recapcha "I am not a robot" button solves this issue. But if it doesn't, and I think it's 50/50 how CloudFlare will handle this correctly or not correctly. but I am always the optimist until proven incorrect. we shall see.

How does this impact with my statements about CloudFlare staff? Not a lot. They talk with twisted lips, which no mere normal can figure out what they are talking about. But they have in the end pointed out there was something wrong with the programming. but it's taken 8 or 9 days of absolute rubbish. Which they should have pointed out in day 1. It's been a terrible ride with their support. Just awful. Asking me to use URL forwarding to solve the issue. which is not required and has only made things worse.
 
2
•••
OK. I discovered that my developer who is installing Invisible ReCaptcha v3 on my website has changed the way things have/are working. This was only put on the user pages and the landing pages for each domain. Both have proved to be unreliable with Google's ranking of visitors between 1-10. The lower the score the more bot like you are. 5+ are allowed by default. But I've taken this permission all the way down to 2 on the users and still I get recaptcha failed. It was put on the landing pages because of the huge number of bots going thru my website.

In a nutshell I'm going to have to rethink the strategy on both the users and the landing pages. the users should be an easy fix, by replacing the invisible recaptcha with a "I am not a robot" button. For the landing pages I am going to try the button approach also when visiting the landing pages. If CloudFlare still insists on pointing the direct link to the landing page to the home page (which is happening now because of the programming) after the landing pages are asks you to click I am a robot (which I think is possible, but not sure). Then I have to forget altogether about any recaptcha at all on the landing pages.

This was all working cleanly for the landing pages until we tried to put recaptcha on them, and/or use CloudFlare at the same time. At that time xxxxx.com/details.php?d=domainname.com (the landing page being visited), just domainname.com in the browser, or even clicking on the link to the landing page on the website worked fine. Typing in the domainname.com in your browser went to it's respective landing page. but put recaptcha and CloudFare into the mix and all hell was let lose. The programmer had to say what to do if you failed the recaptcha, which is to point the domain to the the home page instead. Probably because Google has not been kind to my ranking. Which means all my attempts to go to the landing pages are failing, and so it looks to me that all attempts at going to the landing page in the browser, with either typing in the domainname.com or xxxxx.com/details.php?d=domainname.com, or clicking on the link in the website,were being taken to the home page rather than to the landing page. So the ratings of Google Recaptcha can be high or low for anybody else too.

So it's my basic proposition is Google Invisible ReCaptcha is not ready for prime time. So we are going to test if the normal recapcha "I am not a robot" button solves this issue. But if it doesn't, and I think it's 50/50 how CloudFlare will handle this correctly or not correctly. but I am always the optimist until proven incorrect. we shall see.

How does this impact with my statements about CloudFlare staff? Not a lot. They talk with twisted lips, which no mere normal can figure out what they are talking about. But they have in the end pointed out there was something wrong with the programming. but it's taken 8 or 9 days of absolute rubbish. Which they should have pointed out in day 1. It's been a terrible ride with their support. Just awful. Asking me to use URL forwarding to solve the issue. which is not required and has only made things worse.

Can you share which script you used having problem?

I can confirmed that folioTrader, dndork and db2 works well with Cloudflare enabled.

kam
 
0
•••
0
•••
Can you share which script you used having problem?

I can confirmed that folioTrader, dndork and db2 works well with Cloudflare enabled.

kam

I'm using my own developed website. Not a script.
 
0
•••
I'm using my own developed website. Not a script.

I suggest you check the setting in the control panel (cpanel, directadmin, etc) and make sure that you have SSL enabled. :xf.wink:
I had the similar experience before just because I forgot to enable SSL in control panel.:dead:
 
1
•••
here is how I do it

having a dedicated server
I use that IP to point all my domain DNS to
in a
/defaultfolder


from here I redirect them to
https://BuyUsedDomain.com/ subfolder/ index.php
where the domain is recognized
and the LP is shown

https://BuyUsedDomain.com/
is at cloudflare with https

all others are not

This was working fine until we put CF & ReCaptcha into the mix. So this will work again someday soon. Most of the problems are with Recaptcha. Once we get that working to our satisfaction. I'm hopefully expecting, CF is not at fault here. IMHO. But that doesn't mean their support isn't awful. It is awful.

This is not much different to me. I point every domain on the website via my Registrar's DNS directly to xxxxx.com/details.php?d=domainname.com Before CF and Invisible ReCaptcha 3, this was all I needed to do. It worked fine. You have to understand that at the moment, it's a fluid situation which I'm working on with my programmer. So everything is not tied down yet. Before we try again. I mentioned all the above just to let you know where I'm at and what we are doing about it. Before we test it again.

But I have a couple of questions about what you are doing.

1) Why do you point all your domains to a /defaultfolder instead of pointing them directly, like me?
2) How do you exactly do these redirects. In CloudFlare? Could you also give me an example?

I operate exactly like you say in your last 3 lines too. So we are almost the same. Except I'm doing the redirect at the Registrars DNS whereas you are doing it from within /defaultfolder somehow. I'd be happy to learn exactly how you are doing this. But my method seems more direct. From my understanding.

I'm on a VPS rather than a dedicated server. But that should not make any differences.

We, me and my programmer, are working through our software issues, which have been caused by our handling of ReCaptcha. But. I am expecting those to get resolved, sooner rather than later, and I am expecting this to get resolved in CloudFlare. I can see the end of the tunnel. But I'm not quite sure how far away the end is. But the hole looks about 1/4 size of my screen. So it's not too far distant in the future.
 
Last edited:
0
•••
I suggest you check the setting in the control panel (cpanel, directadmin, etc) and make sure that you have SSL enabled. :xf.wink:
I had the similar experience before just because I forgot to enable SSL in control panel.:dead:

I'm 100% sure it is enabled. But to be sure. I'll check it in a few minutes time. Thank you.

Yep it's turned on.
 
Last edited:
0
•••
Mildly put, you're experiencing DDoS attack on your website or server.
In the worst case scenario: brute-force attack from hackers trying to compromise your server with toxic code injections, backdoors etc.
Aim - to gain illegal access.

Possible Solutions:
First line of defense is your .htaccess file. You need to ban certain dangerous requests (trace, debug, track, delete, allow_url_include|auto_prepend_file|auto_append_file) to your website. Bad bots, too, should be denied access. Deny directory browsing.
You should also deny access to protected critical server files .htaccess, .htpasswd and all starting with dot.
(The list of deny/allow is long, so you can PM for a copy of the ,htaccess file I use on my server).

Secondly, implement rate-limiter to limit requests per IP per page. For example, limiting the maximum number of requests per second/per page from same IP. For that, you can use this simple scripts:
http://www.omniceps.com/stop-brute-force-attacks-php-throttling/

Be aware, however, that attacks can be distributed over multiple IP within seconds, hence you should ban not a particular IP, but the IP range.

Thirdly, if you are on managed VPS, then ask your ISP to install on the server ModSecurity software (not mine). It's free and very powerful.

Another option is to subscribe to services such as Securi
 
Last edited:
0
•••
Mildly put, you're experiencing DDos attack on your website or server.

In the worst case scenario: brute-force attack from hackers trying to compromise your server with toxic code injections, backdoors etc.
Aim of which is to gain illegal access.

Possible Solutions:
First line of defense is your .htaccess file. You need to ban certain dangerous requests (TRACE, DELETE, TRACK and DEBUG) to your website. Bad bots, too, should be denied access.
You should deny access to protected server files .htaccess, .htpasswd and all starting with dot. (The list is long, so you can PM for a copy of ,htaccess file I use).

Secondly, implement rate-limiter to limit requests per IP per page. For example, limiting the maximum number of requests per second/per page from same IP. For that, you can use this simple scripts:
http://www.omniceps.com/stop-brute-force-attacks-php-throttling/

Although, be aware that attacks can be distributed over multiple IP within seconds, hence you should ban not a particular IP, but the IP range.

Thirdly, you can ask you ISP to install on the server ModSecurity software (not mine). It's free and very powerful.

I think you are correct about the DDOS attack. At least that was happening earlier in October, I think. Which was why we moved to CloudFlare. But I am surprised the web host never mentioned it as such. Perhaps their DDOS protection which they praise in their literature, was just inadequate. 750,000 connections from 1 IP address on 1 page over a 2.5 week period. Plus another 200 IP's from 100-750,000 connections. Doesn't seem so smart to me. My bandwidth was going thru the roof. Probably quadrupled in size for the month of October.

But I'm not so sure that is my current problem. I THINK the root cause is currently wrongly setup of ReCaptcha. However you raise a lot of good points. I will study them tomorrow and take steps to implement as much as I can.I will PM ou also. Thank you.
 
0
•••
1) Why do you point all your domains to a /defaultfolder instead of pointing them directly, like me?
2) How do you exactly do these redirects. In CloudFlare? Could you also give me an example?

the default folder is what you see when you open the IP
so all I have to do is use a nameserver with A Records to that IP

the redirect included country / IP / domain information

the only domain with https is the buyuusedomain / folder/ index
where I redirect it to

I have done so for https
as I didn't t want to change 6K domains
 
1
•••
We, me and my programmer, are working through our software issues, which have been caused by our handling of ReCaptcha.

if recapcha is causing issues why don't take it off

cloudflare will stop most of the junk traffic
 
0
•••
the default folder is what you see when you open the IP
so all I have to do is use a nameserver with A Records to that IP

the redirect included country / IP / domain information

the only domain with https is the buyuusedomain / folder/ index
where I redirect it to

I have done so for https
as I didn't t want to change 6K domains

So you have 6K of A records doing the redirection? Is that correct?
 
0
•••
if recapcha is causing issues why don't take it off

cloudflare will stop most of the junk traffic

That will by my last resort. But it's probably almost a 50% possibility. But I think we have solutions to implement and try before we abandon it. Abandoning it would probably solve all our problems. But I don't want to give up on it, without trying out the solutions first :)
 
0
•••
2
•••
OK. So I've currently corrected my last known problem with CloudFlare, and got ReCaptcha 3 working with it also. Primarily this problem was a programming decision made without my knowledge. Ok. I still have to "fine-tune" Recaptcha. Where I can see some problems coming. But we wait and see what happens.
 
Last edited:
0
•••
I know the thread is a little old, but I just wanted to say my opinion.

Cloudflare does not help preventing scrapers. I scraped a lot of sites based on cloudflare. One example is tutorialzine (.com) for my tutorial search engine :)

The only way would be to ban the ip or the class of ip, but banning an ip range would cost you some visitors.
 
1
•••
I know the thread is a little old, but I just wanted to say my opinion.

Cloudflare does not help preventing scrapers. I scraped a lot of sites based on cloudflare. One example is tutorialzine (.com) for my tutorial search engine :)

The only way would be to ban the ip or the class of ip, but banning an ip range would cost you some visitors.
Me too. Their firewall will cut down on referrer spam, tor users, etc. though.
 
Last edited:
0
•••
  • The sidebar remains visible by scrolling at a speed relative to the page’s height.
Back