NamePros
Welcome, Guest! Ready to make a name for yourself in the domain business? We welcome both the hobbyist and professional domainer to join the discussion as part of the NamePros community.

Click here to create your profile to start earning reputation for posting, and trader ratings for buying & selling in our free e-marketplace. Build your trader rating with each successful sale. Our system has tracked over 100,000 sales and counting!
FAQ & TOS Register Search Today's Posts Mark Forums Read

Go Back   NamePros.com > Website Development Discussion Forums > Web Design Discussion
Reload this Page Do raw logs show *everything*?

Web Design Discussion Discussion of web design techniques, advice, browser issues, software, design firms.

Advanced Search
9 members in live chat ~  
NamePros Design Contests NamePros Design Contests
Forum Sponsorship
Join in on the FUN! You can start an affordable design contest and pick from entries talented members submit or you can enter a design contest for a chance to win CASH PRIZES! What are you waiting for? Get started in the fun TODAY! - Banners, Logos, Mascots, and MORE! (Please READ the design Contest section rules Prior to starting or entering a contest)



Closed Thread
 
LinkBack Thread Tools
Old 07-03-2008, 12:13 AM THREAD STARTER               #1 (permalink)
Account Suspended
Join Date: Jun 2008
Posts: 30
Dr_Test is an unknown quantity at this point
 



Do raw logs show *everything*?


I have a question... Are Raw logs supposed to show *everything* that users are doing on my site? My site uses CPanel, and spits out what they call a "raw access log" I think, which seems to show all site file requests, etc... but sometimes I don't see certain things...

Example: Often, someone will load my main page, but instead of getting a log entry showing that they're downloading index.htm, I'll get one showing that they're downloading header.htm (an inline frame that displays my header.htm file).

Am I missing something here?

Also, if there is NO reference in my logs to someone downloading a certain file, does that mean the file was never downloaded, period? I'm wondering how the Yahoo bot used up 20 gigs last month on my tiny site, if it didn't download my huge RAR archives. (There's no mention of them ever being downloaded in the logs...)
Dr_Test is offline  
Old 07-03-2008, 12:53 AM   #2 (permalink)
NamePros Legend
 
weblord's Avatar
Join Date: Dec 2005
Location: Philippines - www.Nabaza.com
Posts: 19,784
weblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatness
 


Autism Protect Our Planet
it does show you the following on a raw log
ip address
date/time
Get < - someone accessed it
the filename if it's an .exe it also shows it.
Browser used
it will also shows if it's a se bot
if you're file is being downloaded by someone you can see the ip
__________________
Nabaza.com - Amaia
weblord is offline  
Old 07-05-2008, 02:01 PM   #3 (permalink)
Senior Member
 
nielsencl's Avatar
Join Date: Jul 2006
Location: Minneapolis
Posts: 2,516
nielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond repute
 



The log files have everything in the way of site activity, so you should see all the page files, graphic files, include files, page errors, and any other kind of file that can be found on your site.
????: NamePros.com http://www.namepros.com/web-design-discussion/488755-do-raw-logs-show-everything.html

if you don't see any activity for a file, then it may not have been accessed. However, a busy web server may not always record all the activity 100%. I only say this because I know that in some of my reseller accounts I will see some gaps in the reporting, where it will look like the site was down for a couple of days, but I can tell that it was still working. The log process is something that may not always work as it should all the time, but in general it is very good and does show you everything.

If you look at the entire log, you should see by IP what a person does, but the files may not be in the order that you expect them to be, and other log entries from other people may be mixed up with different users.

Keep in mind that the home page could have been a request for ".../index.htm" or for ".../". :-)
nielsencl is offline  
Old 07-05-2008, 02:35 PM THREAD STARTER               #4 (permalink)
Account Suspended
Join Date: Jun 2008
Posts: 30
Dr_Test is an unknown quantity at this point
 



Okay, thanks.

So, here's the scenario: I'm suspecting that a certain user might be using a program like UpdatePatrol or NeoDownloader (or both) to keep tabs on my site, and basically bum-rush my files at certain intervals. (kind of like a bot, I guess) Could it be that this torrent of transfers is creating gaps in my raw logs? I imagine it could put the site under stress, if there are suddenly 10 requests for 700mb RAR files, all at the same instant.
Dr_Test is offline  
Old 07-05-2008, 05:39 PM   #5 (permalink)
Senior Member
 
nielsencl's Avatar
Join Date: Jul 2006
Location: Minneapolis
Posts: 2,516
nielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond repute
 



It's hard to say if you have missing entries or just large gaps between file requests. if your server is really overloaded it could be the logging process is a low priority and data may get dropped if the server can't keep up. Your hosting provider may be able to tell you more about that.

Your web stats program that comes with your hosting should show you a section of information with the IP addresses of those that use the most bandwidth. Using something like network-tools.com you can find out some information about them and make sure it's someone you are having a problem with. Then if you are on a Linux host you can block them from downloading any files if they keep using the same IP address.

If you have one or 700mb RAR files then you can easily start burning through your bandwidth even with "normal" requests". One thing to keep in mind is that some clients, like web site copiers, can make many requests for many files at one time. And some programs can many requests to copy just one file. You may also be getting hit with spiders and bots that are just trying to see what you have on your site. Using a robots.txt file can help keep them away from your large download files.
????: NamePros.com http://www.namepros.com/showthread.php?t=488755

I have a site with about 25 million expired domain names in HTML files. When I spot an IP address downloading over 500MB during a month I take a look at where they are located. If they are in China or some other countries I may block them. I can't afford to have people sucking down huge parts of my site if it's going to cost me more for bandwidth.
nielsencl is offline  
Old 07-07-2008, 11:00 PM THREAD STARTER               #6 (permalink)
Account Suspended
Join Date: Jun 2008
Posts: 30
Dr_Test is an unknown quantity at this point
 



Hmm, okay, thanks for the help, all.

New confusion to add to the mix:

An IP that has the User Agent of the Google Bot (IP: 66.249.70.104 / Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) has been downloading stuff from my site, thought I have ALL bots disabled.

My robots.txt says this:
Quote:
User-agent: *
Disallow: /archive/
Disallow: /files/
Disallow: / <notice this one, which should block ALL the bots (at least the ones that respect robots.txt) from all dirs, not to mention the ones above, which are *still* being entered.
But look what this IP is doing:
Quote:
66.249.70.104 - - [03/Jul/2008:05:25:48 -0700] "GET /files/Thief/Faceless%20Part2%20-%20Ingame.zip HTTP/1.1" 206 16777216 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
????: NamePros.com http://www.namepros.com/showthread.php?t=488755

66.249.70.104 - - [03/Jul/2008:05:27:15 -0700] "GET /files/Movies/deftonesvid.zip HTTP/1.1" 206 16777216 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I notice also that while IP's with "googlebot" user agents have been visiting and downloading all month long, they haven't made a single grab for my robots.txt file, as far as I see in my logs. The Yahoo bot has been hitting nothing BUT that file, and leaves immediately after.

Argh... I just wish the whole thing were more simple.
Last edited by Dr_Test; 07-07-2008 at 11:15 PM.
Dr_Test is offline  
Old 07-07-2008, 11:03 PM   #7 (permalink)
NamePros Legend
 
weblord's Avatar
Join Date: Dec 2005
Location: Philippines - www.Nabaza.com
Posts: 19,784
weblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatness
 


Autism Protect Our Planet
so if you don't want se to index your site, try putting that ip in your ip deny manager and also on your deny entry on .htaccess.

do you have any firewall installed? if so block that ip as well.

Originally Posted by Dr_Test
Hmm, okay, thanks for the help, all.

New confusion to add to the mix:

An IP that has the User Agent of the Google Bot (IP: 66.249.70.104 / Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) has been downloading stuff from my site, thought I have ALL bots disabled.
????: NamePros.com http://www.namepros.com/showthread.php?t=488755

My robots.txt says this:


But look what this IP is doing:


Argh... I just wish the whole thing were more simple.
__________________
Nabaza.com - Amaia
weblord is offline  
Old 07-08-2008, 01:18 AM THREAD STARTER               #8 (permalink)
Account Suspended
Join Date: Jun 2008
Posts: 30
Dr_Test is an unknown quantity at this point
 



Ok, I blocked a bunch of IP's... I'll see what happens.

Btw, I DID manage to figure out for sure that my raw logs are not logging everything. My site has had 300 visits so far this month, so I did a text-search in the log for one of the images on the front page, and it only came up about 30 times. Also, I examined some visits carefully, and noticed that not all of the images on index.htm were reported as downloaded. Usually they were, but about 1/4 of the visitors are reported as downloading a PORTION of the images (which are a combined total of about 20k. It's a fast-loading page).
Last edited by Dr_Test; 07-08-2008 at 01:23 AM.
Dr_Test is offline  
Old 07-08-2008, 01:26 AM   #9 (permalink)
NamePros Legend
 
weblord's Avatar
Join Date: Dec 2005
Location: Philippines - www.Nabaza.com
Posts: 19,784
weblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatnessweblord Has achieved greatness
 


Autism Protect Our Planet
it would also help if you change the filenames of the frequently downloaded files that you suspected, that goes the same to other huge files you might have and also apply some basic form of encryption if you're not into seo.

also instead of the direct link to the file offered for download put some download tracking software in between like
http://www.whatcounter.com/
????: NamePros.com http://www.namepros.com/showthread.php?t=488755
is free and can be used to count actual downloads or much better as some practice of download site to put a basic form (only email and/or firstname) before it starts to download or a captcha anything to discourage mass downloads.

another thing is to report these attackers to their isp or hosting provider so you can at least delay their abusive actions on your site while it's giving you time to implement those tips i told you.

give us an update.


Originally Posted by Dr_Test
Ok, I blocked a bunch of IP's... I'll see what happens.

Btw, I DID manage to figure out for sure that my raw logs are not logging everything. My site has had 300 visits so far this month, so I did a text-search in the log for one of the images on the front page, and it only came up about 30 times. Also, I examed some visits carefully, and noticed that not all of the images on index.htm were reported as downloaded. Usually they were, but about 1/4 of the visitors are reported as downloading a PORTION of the images (which are a combined total of about 20k. It's a fast-loading page).
__________________
Nabaza.com - Amaia
weblord is offline  
Old 07-08-2008, 08:21 AM   #10 (permalink)
Senior Member
 
nielsencl's Avatar
Join Date: Jul 2006
Location: Minneapolis
Posts: 2,516
nielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond reputenielsencl has a reputation beyond repute
 



Quote:
Btw, I DID manage to figure out for sure that my raw logs are not logging everything. My site has had 300 visits so far this month, so I did a text-search in the log for one of the images on the front page, and it only came up about 30 times. Also, I examined some visits carefully, and noticed that not all of the images on index.htm were reported as downloaded.
What you are seeing may or may not be an indication that not all the traffic is being logged.

- When a visitor hits a page with images, the page file and the image files are all requested, UNLESS the visitor has the images turned off (rare, but it is possible).

- The second time that same visitor hits the same page, it may only load the page file and load the graphics from their cache. Also if the same graphic is used on all pages, it may only get loaded once.

- But the most likely thing you are seeing is spider and bot traffic. In general they don't care about your graphics and so will only load your pages. For your site, they may only load your zip and other download files.

Blocking by IP is what I was going to suggest. If you use the network lookup at Network-Tools.com it can show you information about the visitor and help you to decide if you really want to block them or not. And it can also show you if the "visitor" is really a bot if the IP is located at an ISP like "The Planet". Then you can block all traffic from there, since it's not likely to be users (although it may be user proxy traffic). When you get a range to block like
NetRange: 66.249.64.0 - 66.249.95.255

Just enter in
66.249.64.
66.249.65.
66.249.66.
etc.
etc.
66.249.95.

to keep them all out.

And since Googlebot seems awlful interested in your zip files, I would either contact Google about this, or add a line that names the googlebot. It should be all you need. Perhaps the IP is being spoofed and it's not really Google...?

Finally, remember that Robots.txt is pretty good for keeping out robots and spiders, but it also works in reverse with people. If you want to know where people have stuff that you might be interested in getting, the first place to look is the robots.txt. :-(

One thing that can work well is to create a directory with a password-type name, like /883uJhh44-H3. Then all you have to do is control how people/bots learn about this directory and only tell people you want to know about it. The nice thing is that it's easy to change the folder name from time to time and keep old users out. Just make sure "directory browsing" is not enabled.
nielsencl is offline  
Old 07-08-2008, 06:28 PM THREAD STARTER               #11 (permalink)
Account Suspended
Join Date: Jun 2008
Posts: 30
Dr_Test is an unknown quantity at this point
 



Thanks for the pointers...
Btw:

Quote:
And since Googlebot seems awlful interested in your zip files, I would either contact Google about this...
That's what I've been trying to do, but I can't find any contact information. Any ideas? I tried their extensive Help and support sections, but they are highly frustrating to navigate, and seem bent on diverting you to endless help pages, with no contact info in sight.
Dr_Test is offline  
Closed Thread


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools


 
All times are GMT -7. The time now is 07:51 PM.

Domain name forum recommended by Domaining.com Powered by: vBulletin® Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.6.0 Ad Management plugin by RedTyger