IT.COM

information Analytics can be deceiving

NameSilo
Watch
Impact
5,250
Most modern analytics platforms, including Google Analytics, rely on a script that's loaded when users visit your site. Several years ago, this used to be an effective way to discern real people from automated crawlers. Crawlers didn't use full web browsers, so they wouldn't run the script, so their visits wouldn't be recorded.

That's changed. As more websites require JavaScript to function, crawlers have had to switch to full browsers. They don't just download the code and pick it apart to find the content; they load webpages just like a human does, which makes it difficult to tell them apart from real people. Projects like Selenium have made this possible.

Overall, this is great news: it means search engines are getting smarter. Tools like Selenium are also great for automated testing, meaning that web developers can test their sites in a variety of web browsers for every change they make. At NamePros, we use New Relic Synthetics to test various parts of our site at set intervals. The automated monitor can log in, edit its own post, and alert us of any problems. It'll even take screenshots of any errors it encounters.

The problem is that these tools aren't always used for good. Recently, we've seen an influx of malicious crawlers that use Selenium or similar technologies. Google Analytics typically thinks these are real people, and it can throw off our metrics quite a bit.

We keep track of what sort of technologies our visitors use, and we plan our features accordingly. For example, if we have a lot of visitors using iPhones, but not many using iPads, we'll spend more time planning for, developing for, and testing on iPhones. One technology metric that we track is screen resolution, and we do this with the help of Google Analytics.

When I looked at our screen metrics earlier today, I noticed something unusual: the resolution 800x600 had made its way into the top 10 resolutions for the past week. 800x600 means that the screen is 800 pixels wide by 600 pixels tall. That used to be one of the most popular resolutions--two decades ago. Today, it's rarely seen. This immediately raised red flags. Google Analytics was giving me information indicating that I should be testing on 800x600 screens, but my experience told me that didn't make any sense.

Upon closer inspection, the metrics became even more suspicious: apparently, nearly 17,000 users over the past 7 days had a screen resolution of 800x600, were using the exact same version of Chrome on Linux, had a bounce rate of 99.61%, and spent 2 seconds on the site each time they visited.

Fortunately, we have our own analytics system that's much more detailed than Google Analytics. We mostly use it for detecting and prevention fraud, but it also comes in handy for troubleshooting technical issues. According to our analytics, this unusual traffic started on June 20th and proceeded at a rate of 400 to 600 visits per hour. They were clearly automated requests, not real people. Google was treating each and every one of them as a unique user.
 
Last edited:
16
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
My guess would be a remote server or VM running 800x600 with a scraper utilizing browser rendering. Most of my remote connections and VMs are setup at 800x600 or 1440x900. Though of course there are other reasons such as projects/open source code for scraping.

Depending on the amount of unique IPs versus "users" - Probably a new NPs (specific) scraper in town. There is a lot of data here that I imagine some people lurking in the shadows probably want.

I manage a few decent sized websites myself and it is amazing to see the vast tactics employed to gain data. But I have also been on the other side as well, and as a programmer you must be smart to outsmart a smart technical web operator.

This person or group would have been smart to at least deviate the user agent. Using an outdated screen resolution was certainly a huge indicator. I think some admins would be scratching their heads wondering if the resolution was making a comeback and if they should reformat their website. :banghead:


But dont get the wrong idea... I am not your NPs scraper. :xf.grin:
 
3
•••
@Paul Buonopane garbage in, garbage out ;)

Good spot on the screen res! Hope this isn't a silly question, but have you enabled GA's default Bot Filtering? Under View Settings, you can tick the box for 'Exclude all hits from known bots and spiders'. You can test it on a new view for the property and compare over time to the current view. It's not perfect but it catches at least 50% of non-human traffic.

The other options are to look for regular indicators such as Hostname values (e.g 'not set') or IP ranges and use these to create an exclude rule within an Advanced Segment (or the reverse, filtering only for hostnames related to NP properties and services). Common behaviours of sophisticated invalid traffic also include these three characteristics together: 100% New Users, 99%+ Bounce Rate and 1 Page per Session.

The next generation bots (Ghost Traffic) don't even load the website itself, they send the spam traffic directly into the GA server using the property ID. They're almost impossible to block with traditional filters or advanced segments, but I think there's a way to mitigate it through GTM.
 
Last edited:
1
•••
Thank you for the explanation. Very insightful. Is this traffic also the source for the multiple spam posts from new users, usually with unformatted junk content that I think is related to the ongoing football world cup?
 
0
•••
Lol... Just opened my analytics on my personal domain profile and was greeted by this message.

29.59% of your sessions have "Screen Resolution = 800x600". They perform worse on some key metrics.

I am guessing you received a similar message about NPs which triggered this post?

Seems some mass scraper has emerged using low resolution - or google just added the feature to notify the webmaster about resolution issues in the message center. (never noticed a message before myself, though I understand the metrics are there)

Google's bot blocking is generally pretty good (if you look at it compared to raw logs). You'd assume their filter would recognize this by now - but could possibly be malware running on personal PCs making it more complicated than blocking IPs after X number of questionable request in a certain timeframe across all google tracked sites.

Edited to add: Hmmm... Now I'm curious... Do my websites have the same message?

Edited again: No messages in my message center on my other sites (which do not have to do with domains). I would have to look at past months reports and compare to now to see if there has been an uptick of 800x600.
 
Last edited:
1
•••
The next generation bots (Ghost Traffic) don't even load the website itself, they send the spam traffic directly into the GA server using the property ID. They're almost impossible to block with traditional filters or advanced segments, but I think there's a way to mitigate it through GTM.
What would be the reason for someone to do this?

Most bots are actually going to opposite direction. They now render the site in a browser (which triggers the analytics) versus in the past they would just load the page source. This is due to new websites generating content via javascript versus traditional html static (to end user) pages.

But I do not see what someone operating a bot would gain by hitting a site's analytics only?
 
1
•••
What would be the reason for someone to do this?

Most bots are actually going to opposite direction. They now render the site in a browser (which triggers the analytics) versus in the past they would just load the page source. This is due to new websites generating content via javascript versus traditional html static (to end user) pages.

But I do not see what someone operating a bot would gain by hitting a site's analytics only?

They don't load the analytics JS script, but they bypass it using the GAs Measurement Protocol.
Mostly referral spam, some malicious (if you follow the URL, you get may get malware) but the rest is designed to make money from ads when GA users click through to the referrer or hostname. On the extreme end, it's calculated sabotage by polluting a competitors dataset (fake traffic and fake custom events etc).
 
2
•••
They don't load the analytics JS script, but they bypass it using the GAs Measurement Protocol.
Mostly referral spam, some malicious (if you follow the URL, you get may get malware) but the rest is designed to make money from ads when GA users click through to the referrer or hostname. On the extreme end, it's calculated sabotage by polluting a competitors dataset (fake traffic and fake custom events etc).
I believe I see what you are saying... But - I don't think this is a big problem, widespread, or an issue that a webmaster would normally face minus as you say a saboteur trying to manipulate your analytics or cause G to penalize your Ad account.

This tactic would generally be employed by a webmaster themselves - to artificially inflate their traffic numbers or to maliciously try to earn illegitimate profit from G Ads.

As far as bypassing the analytics JS - that would be easy to do by manipulating the source before rendering in the browser. I'd imagine a cunning scraper would do this to avoid being detected - if they suspected the webmaster only used google analytics.
 
0
•••
I believe I see what you are saying... But - I don't think this is a big problem, widespread, or an issue that a webmaster would normally face minus as you say a saboteur trying to manipulate your analytics or cause G to penalize your Ad account.

This tactic would generally be employed by a webmaster themselves - to artificially inflate their traffic numbers or to maliciously try to earn illegitimate profit from G Ads.

As far as bypassing the analytics JS - that would be easy to do by manipulating the source before rendering in the browser. I'd imagine a cunning scraper would do this to avoid being detected - if they suspected the webmaster only used google analytics.

It's surprisingly common - has been going on for a few years now. For referral spam, it can be cheaper and easier for spammers to push data directly into GA's servers (Google Analytics), rather than executing the whole page in the browser or running source code.

These are two good articles covering the topic:

http://help.analyticsedge.com/article/definitive-guide-to-removing-google-analytics-spam/
https://carloseo.com/removing-google-analytics-spam/
 
2
•••
It's surprisingly common - has been going on for a few years now. For referral spam, it can be cheaper and easier for spammers to push data directly into GA's servers (Google Analytics), rather than executing the whole page in the browser or running source code.

These are two good articles covering the topic:

http://help.analyticsedge.com/article/definitive-guide-to-removing-google-analytics-spam/
https://carloseo.com/removing-google-analytics-spam/
Ahh..

Most sites don't post their traffic logs and most web operators don't blindly click on referral links - but I got ya.

It is certainly a best practice to filter as much bad data as you can out of your dataset if you want it to be useful - so those links are useful for web operators.
 
Last edited:
0
•••
FYI:

From the analyticsedge link above.

2018-06-23 final update. This article was started in January 2015 to help Google Analytics users deal with a new form of ‘ghost referrals’, or false information, being injected into their analytics reports. Over the last 3 1/2 years, a number of people have used a variety of techniques to manipulate the data being reported, and I have maintained a series of filters for limiting their effectiveness. For the most part, Google is now responding on its own, and I am moving on.

So looks like Google is mitigating this issue on their own now - pretty much to the author's content.

To break this down for anyone reading - @Nikul Sanghvi is referring to a type of spam attack that tries to trick a web operator into clicking on a malicious link from their Google Analytics dashboard - or hopes that they publicly post their access logs so that they can gain referral links to boost their SEO.

This method of attack is pretty outdated, though it was revived in 2014-15 using Google Analytics to reach the web operator with the malicious URL.

Google has since acted and it seems this is not very common (nor do I see much benefit for someone to do so) these days. Though I would imagine it persists as all spam does.

I do not believe this would be related to @Paul Buonopane 's original post.

(not trying to argue with you @Nikul Sanghvi - You peaked my interest so I was trying to understand what you were speaking of and what the benefit to the person responsible for it would be. I hadn't heard of the term Ghost Spam, but am aware of the old referral trick which originated from an immature internet)
 
1
•••
@Paul Buonopane garbage in, garbage out ;)

Good spot on the screen res! Hope this isn't a silly question, but have you enabled GA's default Bot Filtering? Under View Settings, you can tick the box for 'Exclude all hits from known bots and spiders'. You can test it on a new view for the property and compare over time to the current view. It's not perfect but it catches at least 50% of non-human traffic.

The other options are to look for regular indicators such as Hostname values (e.g 'not set') or IP ranges and use these to create an exclude rule within an Advanced Segment (or the reverse, filtering only for hostnames related to NP properties and services). Common behaviours of sophisticated invalid traffic also include these three characteristics together: 100% New Users, 99%+ Bounce Rate and 1 Page per Session.

The next generation bots (Ghost Traffic) don't even load the website itself, they send the spam traffic directly into the GA server using the property ID. They're almost impossible to block with traditional filters or advanced segments, but I think there's a way to mitigate it through GTM.

It's unlikely that there was actually a screen. Typically, these types of bots would be running on a headless Linux server without any desktop environment. They'll report an 800x600 display, but there's actually no visual system on the server, only text. You wouldn't be able to RDP or VNC into them.

It's also unlikely that it was specific to NamePros. The bots were crawling in a manner that will work on most sites, but not NamePros. They were ignoring our <base> tag, meaning that most of the URLs they tried were invalid and returned 404 or 410.

I did confirm that these were actually rendering the pages in a headless browser, rather than just spamming GA. They were fully capable of doing just about anything that a normal browser can. Specifically, they were using Headless Chrome.

We block traffic with suspicious headers. If you don't have a Host header set, you won't be able to load most pages on NamePros.

It's unlikely that referral spam was involved here, but it's technically possible. Like you said, though, it's unusual to use something as heavyweight as a headless browser for that. Usually these headless browsers are looking to scrape some form of content. For example, we have threads talking about proxies, and many users edit the domain name out of sales threads after domains are sold, resulting in something of the form: xxx.com. It shouldn't be hard to guess what sort of bots are scraping those threads.
 
2
•••
Back