Do raw logs show everything?

Dr_Test · Jul 3, 2008

I have a question... Are Raw logs supposed to show *everything* that users are doing on my site? My site uses CPanel, and spits out what they call a "raw access log" I think, which seems to show all site file requests, etc... but sometimes I don't see certain things...

Example: Often, someone will load my main page, but instead of getting a log entry showing that they're downloading index.htm, I'll get one showing that they're downloading header.htm (an inline frame that displays my header.htm file).

Am I missing something here?

Also, if there is NO reference in my logs to someone downloading a certain file, does that mean the file was never downloaded, period? I'm wondering how the Yahoo bot used up 20 gigs last month on my tiny site, if it didn't download my huge RAR archives. (There's no mention of them ever being downloaded in the logs...)

weblord · Jul 3, 2008

it does show you the following on a raw log
ip address
date/time
Get < - someone accessed it
the filename if it's an .exe it also shows it.
Browser used
it will also shows if it's a se bot
if you're file is being downloaded by someone you can see the ip

nielsencl · Jul 5, 2008

The log files have everything in the way of site activity, so you should see all the page files, graphic files, include files, page errors, and any other kind of file that can be found on your site.

if you don't see any activity for a file, then it may not have been accessed. However, a busy web server may not always record all the activity 100%. I only say this because I know that in some of my reseller accounts I will see some gaps in the reporting, where it will look like the site was down for a couple of days, but I can tell that it was still working. The log process is something that may not always work as it should all the time, but in general it is very good and does show you everything.

If you look at the entire log, you should see by IP what a person does, but the files may not be in the order that you expect them to be, and other log entries from other people may be mixed up with different users.

Keep in mind that the home page could have been a request for ".../index.htm" or for ".../".

Dr_Test · Jul 5, 2008

Okay, thanks.

So, here's the scenario: I'm suspecting that a certain user might be using a program like UpdatePatrol or NeoDownloader (or both) to keep tabs on my site, and basically bum-rush my files at certain intervals. (kind of like a bot, I guess) Could it be that this torrent of transfers is creating gaps in my raw logs? I imagine it could put the site under stress, if there are suddenly 10 requests for 700mb RAR files, all at the same instant.

nielsencl · Jul 6, 2008

It's hard to say if you have missing entries or just large gaps between file requests. if your server is really overloaded it could be the logging process is a low priority and data may get dropped if the server can't keep up. Your hosting provider may be able to tell you more about that.

Your web stats program that comes with your hosting should show you a section of information with the IP addresses of those that use the most bandwidth. Using something like network-tools.com you can find out some information about them and make sure it's someone you are having a problem with. Then if you are on a Linux host you can block them from downloading any files if they keep using the same IP address.

If you have one or 700mb RAR files then you can easily start burning through your bandwidth even with "normal" requests". One thing to keep in mind is that some clients, like web site copiers, can make many requests for many files at one time. And some programs can many requests to copy just one file. You may also be getting hit with spiders and bots that are just trying to see what you have on your site. Using a robots.txt file can help keep them away from your large download files.

I have a site with about 25 million expired domain names in HTML files. When I spot an IP address downloading over 500MB during a month I take a look at where they are located. If they are in China or some other countries I may block them. I can't afford to have people sucking down huge parts of my site if it's going to cost me more for bandwidth.

Dr_Test · Jul 8, 2008

Hmm, okay, thanks for the help, all.

New confusion to add to the mix:

An IP that has the User Agent of the Google Bot (IP: 66.249.70.104 / Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) has been downloading stuff from my site, thought I have ALL bots disabled.

My robots.txt says this:

User-agent: *
Disallow: /archive/
Disallow: /files/
Disallow: / <notice this one, which should block ALL the bots (at least the ones that respect robots.txt) from all dirs, not to mention the ones above, which are *still* being entered.

But look what this IP is doing:

66.249.70.104 - - [03/Jul/2008:05:25:48 -0700] "GET /files/Thief/Faceless%20Part2%20-%20Ingame.zip HTTP/1.1" 206 16777216 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.70.104 - - [03/Jul/2008:05:27:15 -0700] "GET /files/Movies/deftonesvid.zip HTTP/1.1" 206 16777216 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I notice also that while IP's with "googlebot" user agents have been visiting and downloading all month long, they haven't made a single grab for my robots.txt file, as far as I see in my logs. The Yahoo bot has been hitting nothing BUT that file, and leaves immediately after.

Argh... I just wish the whole thing were more simple.

weblord · Jul 8, 2008

so if you don't want se to index your site, try putting that ip in your ip deny manager and also on your deny entry on .htaccess.

do you have any firewall installed? if so block that ip as well.

Dr_Test said:
Hmm, okay, thanks for the help, all.

New confusion to add to the mix:

An IP that has the User Agent of the Google Bot (IP: 66.249.70.104 / Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) has been downloading stuff from my site, thought I have ALL bots disabled.

My robots.txt says this:

But look what this IP is doing:

Argh... I just wish the whole thing were more simple.

Dr_Test · Jul 8, 2008

Ok, I blocked a bunch of IP's... I'll see what happens.

Btw, I DID manage to figure out for sure that my raw logs are not logging everything. My site has had 300 visits so far this month, so I did a text-search in the log for one of the images on the front page, and it only came up about 30 times. Also, I examined some visits carefully, and noticed that not all of the images on index.htm were reported as downloaded. Usually they were, but about 1/4 of the visitors are reported as downloading a PORTION of the images (which are a combined total of about 20k. It's a fast-loading page).

weblord · Jul 8, 2008

it would also help if you change the filenames of the frequently downloaded files that you suspected, that goes the same to other huge files you might have and also apply some basic form of encryption if you're not into seo.

also instead of the direct link to the file offered for download put some download tracking software in between like
http://www.whatcounter.com/
is free and can be used to count actual downloads or much better as some practice of download site to put a basic form (only email and/or firstname) before it starts to download or a captcha anything to discourage mass downloads.

another thing is to report these attackers to their isp or hosting provider so you can at least delay their abusive actions on your site while it's giving you time to implement those tips i told you.

give us an update.

Dr_Test said:
Ok, I blocked a bunch of IP's... I'll see what happens.

Btw, I DID manage to figure out for sure that my raw logs are not logging everything. My site has had 300 visits so far this month, so I did a text-search in the log for one of the images on the front page, and it only came up about 30 times. Also, I examed some visits carefully, and noticed that not all of the images on index.htm were reported as downloaded. Usually they were, but about 1/4 of the visitors are reported as downloading a PORTION of the images (which are a combined total of about 20k. It's a fast-loading page).

nielsencl · Jul 8, 2008

Btw, I DID manage to figure out for sure that my raw logs are not logging everything. My site has had 300 visits so far this month, so I did a text-search in the log for one of the images on the front page, and it only came up about 30 times. Also, I examined some visits carefully, and noticed that not all of the images on index.htm were reported as downloaded.

What you are seeing may or may not be an indication that not all the traffic is being logged.

- When a visitor hits a page with images, the page file and the image files are all requested, UNLESS the visitor has the images turned off (rare, but it is possible).

- The second time that same visitor hits the same page, it may only load the page file and load the graphics from their cache. Also if the same graphic is used on all pages, it may only get loaded once.

- But the most likely thing you are seeing is spider and bot traffic. In general they don't care about your graphics and so will only load your pages. For your site, they may only load your zip and other download files.

Blocking by IP is what I was going to suggest. If you use the network lookup at Network-Tools.com it can show you information about the visitor and help you to decide if you really want to block them or not. And it can also show you if the "visitor" is really a bot if the IP is located at an ISP like "The Planet". Then you can block all traffic from there, since it's not likely to be users (although it may be user proxy traffic). When you get a range to block like
NetRange: 66.249.64.0 - 66.249.95.255

Just enter in
66.249.64.
66.249.65.
66.249.66.
etc.
etc.
66.249.95.

to keep them all out.

And since Googlebot seems awlful interested in your zip files, I would either contact Google about this, or add a line that names the googlebot. It should be all you need. Perhaps the IP is being spoofed and it's not really Google...?

Finally, remember that Robots.txt is pretty good for keeping out robots and spiders, but it also works in reverse with people. If you want to know where people have stuff that you might be interested in getting, the first place to look is the robots.txt.

One thing that can work well is to create a directory with a password-type name, like /883uJhh44-H3. Then all you have to do is control how people/bots learn about this directory and only tell people you want to know about it. The nice thing is that it's easy to change the folder name from time to time and keep old users out. Just make sure "directory browsing" is not enabled.

Dr_Test · Jul 9, 2008

Thanks for the pointers...
Btw:

And since Googlebot seems awlful interested in your zip files, I would either contact Google about this...

That's what I've been trying to do, but I can't find any contact information.

Any ideas? I tried their extensive Help and support sections, but they are highly frustrating to navigate, and seem bent on diverting you to endless help pages, with no contact info in sight.

Do raw logs show *everything*?

Account Closed

Top Member

VIP Member

Account Closed

VIP Member

Account Closed

Top Member

Account Closed

Top Member

VIP Member

Account Closed

Rename.co

Nuggies.Ninja

CopySpider.com

vosk.xyz

DCAS.com

CatchTheVision.com

Gesture.com

EasyTrans.com

UnlockWeb.com

Beechtree.com

Similar threads

We're social

Rename.co

Nuggies.Ninja

CopySpider.com

vosk.xyz

DCAS.com

CatchTheVision.com

Gesture.com

EasyTrans.com

UnlockWeb.com

Beechtree.com

Pinned

Appreciation

Agreement

Answers

Relevance

Reaction

Status

Feeling

Do raw logs show everything?