The Official (!) "developed LLLL.com" distributed scanning project thread

filter · Dec 3, 2007

(Apologies to Sashas and other like-minded NPers for starting yet another "Official" thread ... couldn't help myself ... sense of humor and all that.)

Picking up where this left off ... http://www.namepros.com/2364748-post433.html

filter said:
Maybe time to put together a distributed scanning project (as was done to discover that 25% of L-L-L.com are currently developed). For the bigger LLLL.com universe, might take something like 26 people scanning 17,576 LLLL.com each ... One person scans all the A---.com, another scans all the B---.com, and so on. (Okay, so I better start another thread to track this, I keep finding new ways to clutter up this "sales report" thread!)

Will take me a couple days to put together a user-friendly app to do this, what needs to be worked out in the meantime is how many people can participate, and what block of LLLL namespace you'd like to scan for developed sites.

ah ... hmmm. How about this - post :sold: to claim your preferred block ->

A---.com
B---.com
C---.com
[...]
X---.com
Y---.com
Z---.com

like this

->

Z---.com :sold:

netklick · Dec 3, 2007

This is a great idea. I will be happy to take part in this. Just need some thoughts on how we can achieve this successful and chart it. This is an uncharted territory and would be a valuable piece of the puzzle.

I claim A---.com

Let's discuss how we can get started.

-NK

filter · Dec 3, 2007

netklick said:
This is a great idea. I will be happy to take part in this. Just need some thoughts on how we can achieve this successful and chart it. This is an uncharted territory and would be a valuable piece of the puzzle.

I claim A---.com

Let's discuss how we can get started.

-NK

alrighty then! Thanks netclick! We've got A---.com and Z---.com covered ... the rest is just details!

I did a quick pass at this once before on linux - just used a bash script to call "curl" in a loop to write a file for each target, then grepped the files to see which had returned some html vs those not resolving ... Could then refine this to flag the obvious fingerprints of "parked" pages to prune it down to a smaller pool of files to check more carefully.

For distributing an app to run on other folks' computers, probably would want to give them a script so they could know what they were running on their box ... Perl / Python / PHP interpreters are all easy to install on PCs (and come already built in on Mac OS X) so that seems like the way to go. Java is an option as well if there's a demand for it. All these platforms have libraries to grab web pages. That's the core of the application + some logic to make sure the http requests go out at a reasonable rate so as not to hog your bandwidth as the task runs in the background - and the rest is just parsing the results to see which pages are developed vs parked vs not even parked ...

-REECE- · Dec 3, 2007

I don't know anything about scanning scripts, but I'd be happy to help out in any way I can

DNK.it · Dec 3, 2007

I can't help to programming the script but I could run the script on my machine.

B---.com :sold:

-REECE- · Dec 3, 2007

nullmind said:
I can't help to programming the script but I could run the script on my machine.

B---.com :sold:

^^ Same here. I'll take C, S, and E

mtorregiani · Dec 4, 2007

I---.com here
I'll be glad to help you on the scanning.

netklick · Dec 4, 2007

filter said:
alrighty then! Thanks netclick! We've got A---.com and Z---.com covered ... the rest is just details!

I did a quick pass at this once before on linux - just used a bash script to call "curl" in a loop to write a file for each target, then grepped the files to see which had returned some html vs those not resolving ... Could then refine this to flag the obvious fingerprints of "parked" pages to prune it down to a smaller pool of files to check more carefully.

For distributing an app to run on other folks' computers, probably would want to give them a script so they could know what they were running on their box ... Perl / Python / PHP interpreters are all easy to install on PCs (and come already built in on Mac OS X) so that seems like the way to go. Java is an option as well if there's a demand for it. All these platforms have libraries to grab web pages. That's the core of the application + some logic to make sure the http requests go out at a reasonable rate so as not to hog your bandwidth as the task runs in the background - and the rest is just parsing the results to see which pages are developed vs parked vs not even parked ...

Thanks Filter.

I have some ideas to make it easier and I can help to program some of this.

Option 1)

- Get a list of all of the top Parking service like i.e. Sedo and other ones.
- Find out their Name Servers
- Scan the Domains Starting from AAAA to AZZZ and get their WHOIS information to get the Name Servers
- Store the name server information on a database for each domain name
- ELIMINATE the domain names that have well known Parking Service's Name Servers as their DNS information (You know these pages are not developed but just Parked!)
- Scan the rest of the domain name for information like Filter mentioned above - See if it returns HTML and if it does return HTML does it have the word "for sale", "sale", "domain for sale" anything like that? If it does, then most likely those domains are not developed. Put them aside and scan the rest.

I think this way we can reduce the script for scanning each and every domain name.

Option 2)
- Proceed like Filter explained above.

Any thoughts? I can help with some of this. I will set this on my site. It would be nice if someone already have such thing. Otherwise I will have to make some time between regular work hours!

Thanks,
NK

italiandragon · Dec 4, 2007

If you need help, I`m available.

Just send me a PM once you know what I need to do.

Thanks

ojm · Dec 4, 2007

I'd happily run a script on my dedicated server. Should be able to scan quick and constantly.

filter · Dec 4, 2007

cooool - good to see interested volunteers and great suggestions!

Tempting to try to re-invent the wheel here but also makes sense to at least compare notes with other tools already out there (and maybe solicit some advice from the L-L-L scanmeisters and others who have done scans on this scale before). Going to take a few hours to research all the angles tonight with goal of putting some PHP code up by Wednesday for folks to start using on their chosen LLLL territories. (It looks like PHP is popular choice from poll feedback so far. Assuming people running Windows may need to install PHP - I'll post a basic how-to on setting that up as well).

Thinking about Netklick's ideas for improving efficiency - very cool, keep the good ideas coming! Makes sense, some details to consider but main idea holds up -> first pass: check nameservers ... if non-resolving or matched to known parking servers, then probably no point in doing an http request for the home page. (I say "probably" just wondering if worth tabulating custom content portions of pages at parked.com).

Detail to consider -> parked pages may not use "parking service" nameservers, may use redirects or framed page loads via another server (which I've done with a few of mine, just to check if reported traffic matches my own logs). So the second pass may still hit many parked sites ... Still makes sense to prune whatever possible in the first pass if it saves significant time / bandwidth - if that's an issue.

So ... what kind of time / bandwidth are we looking at anyway? 17,576 sites at say 5 per minute would take 3515 minutes or 59+ hours - about 2.5 days ... If each site's homepage size averages 100 kbytes then that will be roughly 1.7 Gigabytes of bandwidth used for the scan. (Just be aware of this if you're scanning from a dedicated server and on the hook for overage fees if you hit a densely populated region of developed sites - might want to code scan to do "text only" downloads to save bandwidth.)

Re-inventing the wheel ... fun stuff. Time to check in with people who've done this before I think!

Also - just found this - haven't looked at it carefully yet but am about to check it out to see if it might in fact be a useful "off the shelf" open-source tool for this LLLL scan project -> http://phpdig.net ...

Going to google some more & research for a bit now!

VURG · Dec 6, 2007

Personally, I much prefer to use a good sample than look up 456976 domains.

I discovered that 25% of L-L-L.com were developed in January with a sample of 520 domains. It took a lot longer for the Arnie/Bluebecker team to manually go through the whole 17576 and the results were the same. At the same time I searched through 520 LLL.biz and found 15% developed.

I also used sampling to analyse the LLLL.com countdown trends before DYYO.com was launched. I only trusted DYYO.com stats because I knew that I was getting similar results. I trust sampling to give a very accurate picture of developed rates and I would think that a sample of 2500+ should be more than enough.

I'm a statistician. That's my job.

I have attached a stratified random sample of 3600 LLLL.com domains which should give a very accurate estimate of how many 4 letter domains are developed. If you want the huge job of analysing all 456976 LLLL.com, go ahead. I recommend using this sample and not wasting your time with the total list.

Also, If you do end up doing the big list, this sample (or a similar one) may be good to work with initially to get an initial estimate.

Malaysia · Dec 6, 2007

wow we have a a very supportive community here..
feel bad i cant help cozi dunno the scripts stuff like reece..
but good job~ Thank you for all who contributed

filter · Dec 6, 2007

thanks Malaysia - and VURG, right on! Perfect starting point - I'm psyched to build a whole little "mini-google" to track the LLLL.com universe (though why not use the big Google for that, I don't know) ... but obviously some quick results from a "statistician-generated" sample will be a great appetizer and some good food for thought for our hungrily inquiring minds! I've got to go offline for a few hours now, will update later tonight with The First Step (for which VURG's list will come in very handy!) ...

Anyone need help installing PHP on their PC ? PM me & I'll help work out any necessary details (a good place to start with that is of course php.net -> http://www.php.net/downloads.php )

:sold:

VURG · Dec 6, 2007

Glad to help. Good luck with everything.

filter · Dec 6, 2007

okay ... well ... letting the computer do the work tonight with the help of VURG's randomized sample - may take another 5 hours or so to chunk through the whole list, but will probably take me at least that amount of time to hack together the more sophisticated "divide and conquer / check DNS first / do other clever things" PHP script tonight! So just set loose a simple "bash" script to call the "curl" program in a loop through the list (I'm running this on Mac OS X right now but should run fine on any Linux distro that has "curl" installed ... which most do if I recall correctly!)

Code:

#!/bin/bash

for domain in $( cat domainlist ); do
  curl -s -D header.$domain -A "Mozilla/4.0" -o www.$domain http://www.$domain
  exitcode=$?
  echo "$domain: curl exitcode $exitcode" >> curl_status_log
done

got the clue to use the '-D' option to tell curl to grab HTTP headers - this flags redirected domains (though doesn't grab the content of the target page since that content is being served from whatever domain the browser got redirected to rather than "as from" the LLLL.com itself ...)

will update with partial findings in a couple hours (should be at least a third of the way through VURG's list by then).

edit: ah ... would be terribly irresponsible of me not to mention that this little quick and dirty brute force bash script will litter whatever directory it is run in with thousands of files (if run against the entire ~3900 LLLL.com random sample in the list VURG provided). A simple improvement would be at least to consolidate the "header" capture files into a single file ... And perhaps really no good reason why all the HTML from the live LLLL.com sites shouldn't be concatenated into one big file as well.

But if you do run the script as is ... beware - some file systems don't do so well with thousands of files in a single directory ... and I'm about to find out if the version of Mac OS X I'm running right now might be one of them! :hehe:

filter · Dec 6, 2007

[unixgeek] heh ... woke up to find no major problems on my trusty (old) Mac OS X notebook slogging around approximately 7000 files in the same directory now - though do get an "argument list too long" when I try to pipe or redirect output of the 'ls' (directory list) command ... Works fine if I just do whatever I need to do with the files in a one-liner loop from bash command line - ie, 'for f in *; do whatever; done' - which is a good trick to know if ever need to remove 10,000 selected files from a directory that causes the 'ls' command to give you that "argument list too long" backtalk! [/unixgeek]

anyway - meta-results from this first pass over VURG's 3900 LLLL.com stratified sample - took about 4 hours to retrieve 3094 HTML pages returned from the curl poll on http:// www. LLLL. com (so would have missed the few that served only via SSL at https:// urls and/or the ones that didn't have a 'www.' subdomain setup in DNS) - so that's about 1000 pages an hour over a 6 Mbit/s DSL connection (many of the 896 that didn't return HTML results probably took longer than the ones that did serve up a page, since I just used the default timeout setting in curl).

There were 3674 HTTP header dumps returned (courtesy of the '-D' option to the curl command) - so at first glance this seems to indicate that 580 of the 3900 sample are redirected to another domain so definitely not "developed" (maybe parked though, will have to check details of the headers to get a better idea about that).

Other tidbits from the first glance at results ->

141 of the curl queries came back with "exitcode 6" (no DNS) 74 came back with "exitcode 7" (no connection to host).

I also ran a script to get the DNS info (as suggested by Netklick - using "dig" rather than "whois" though) ->

Code:

#!/bin/bash

for domain in $( cat domainlist ); do
  dig $domain >> dig_log
done

The "dig_log" file containing info about the nameservers + hosts if any for the LLLL.com sample is slightly under 3 MB in size - I can make it available on a server when I get some more time later today ... Running dig in a loop over 3900 domains took about half an hour.

:yell: :yell: :yell:

filter · Dec 6, 2007

raw results online now here -> http://Lxiq.com/np
(will prompt for login - username "np" / password "np")

3 files currently online (results from scanning VURG's 3900 stratified sample) ->

curl_status.txt - not that interesting, just reports exitcode '0' (successful connection made, may or may not redirect) / or exitcode 6 (no DNS) / or exitcode 7 (DNS but no connection) / or errorcode 18 (incomplete page returned)

dig_results.txt - very interesting (big file) - details of all LLLL which had nameservers set (and also host IP(s) as per A or CNAME records if any)

headers.txt - also very interesting - HTTP headers for all resolving LLLL - indicates redirection if any, server type, protocols, etc ...

hopefully I will get some time later tonight to crunch this raw data down to some useful numbers ... (also have about 3000 files of actual HTML pages returned from the resolving non-redirected sites to look at - but probably not going to put all that online ... There is some pretty funky stuff coming back from a few of these sites - some interesting / creative acronyms! (ie, some fairly specific and obscure porn niche concepts!)

:lol:

:sold:

VURG · Dec 8, 2007

I'm very confused about your results so far. I thought you were just looking for a number. (Percentage of Developed Domains)

Look forward to more updates.

filter · Dec 10, 2007

VURG said:
I'm very confused about your results so far. I thought you were just looking for a number. (Percentage of Developed Domains)

Look forward to more updates.

I guess this is the "data exploration" phase - trying to find angles to help easily narrow down the pool of candidates to vet for "definitely developed not just parked" category

First pass through the "dig" (DNS) info allows a quick prune of several hundred "low hanging fruit" just based on the nameservers associated with known parking/aftermarket sites. (May still find plenty more parked via redirection or frames rather than DNS, so these are certainly not definitive results, just a minimal baseline ...)

VURG - would you venture to extrapolate any interesting (abeit limited and tentative) trends for the rest of the LLLL.com universe from the following "known nameserver" results ?

285 at sedo:
aaqu adwf aeyw ajxw akly aqnk axnb axuh axyg bbmu bdvy bexd bgdw bgwx bmag bpjk bqeg brjg buqy cgwt ckhd cmkz coha cqif crrz cuyg cuzd cxbl dbgc dgdo difu dkgj dnny dojh dqvf dsry dtzy dvjv edqj egkp eism enkb epft epgk eztn fbwy fdph fdso ffve ffvf fhvf fitd fjfo fkqr fmfi fmqv fpmk fptx fyuf gbad gdhw giyt gker gnhe gozr gqip gupc hasb heqn hfxv hksz hlga hoqp hozq hqdp hurj hxxv hyef hzzv iadn ialz ibnw iczr idnm ihrj ihty ijnv immw ipdv ipwm iqyd irfz iuzs ivib ivyf iwge izmk jdnh jllz jubn jumm jwht jxze jyrg jzwc kbif kbwh kfqe kglx kkcl klme kmhn kmiq knqk kojq kpdo kpef kqpb kqvi ktku lccv ldve lecr lgmm lnsb lpau lvrk mbho meqm mevj mfct mhda mkvk msez mtst mtun mtup muan mvwz mwwc mwzp myqx nbzt ngix nknx npnz nqrs nsei nwat nyqg nyrl nzyw ocqu oecw oidf oikc omvb opkn osqx osqz otbi ouok oyew pbjz phgv pjxt ppyo pwaq pwmb pxyd pyds qazb qkao qkfe qkvt qkxi qlzu qpvy quvh qvfe qxaj qyqe qzpq rbmq rexe rjfy rjnc rlsw rmqp rnum rozk rrto rrtu rwxo rxqv rzzt sahh sasn sgrf sltm sqfx tdvg tehz tipq tlkm tnfd tnxj tpja tqcd tqde tsux txug ualq udwn ueey uijo umlk upxs uvgd uvhz uvua uwxf uxuj uxyc vaiy vdqc vjbu vjcf vjgd vjlf vkhu vlkt vlrt vokp vqel vtlg vwie vxay vxpu vzeo wcro wdfy wejq widx wivv wmez wmuy wnyz wpzb wrqj wtqb wynf wznu xeqg xfnk xfzi xmqo xogc xoow xqdi xqny xtph yblv ydve yewx yfmo yfne yhiy yhka yiqp yokj yxas yxjy zekt zglk zmqr znpr zoyf zsqo zttl zvgy zvsa zxek zxwv zziv

102 at buydomains:
abgj achq akpn avsd bhpi cuyn ddmu dmxk drvs dujf dzes dzka eblv eegu elvf ento erhf fbww fclp fgnm gmsy gpyc gsyn hbfu hdsc hgcz hkcz hnjo iadb icbn ieoe ifvy ikyr isgj jbem jgnv jjsh jlna jorz jpug jyyf keja kjbv ldts ljpm lnjf lnrd lslb ltck lwbd mcgp mcjw mlnf mnmj mrtn muwf nper nshf nulm nvur octm ojxl owkc oxbk pebj pjbp psym puhl pwyf qhfw qlmr qmce qtlo qxgx rcff rosl sffj sfqu smds solg szso tbdf tipb tjhe tlkj udgw upia utji vekj vnsb wbfx wobu wtly wtqn xbdf xwma ymyf zcis zcpe zldb zxnc

93 at hitfarm:
aegp ahdt ajhg anim aruv auin auir barp bdif brvv btcn ccfg ccpk cnku curj cvvb dcmm dgrl ditn druj dvpf eted eyhs fbkh fshf groi gtxt gubd gwmg haal heae hrtn hxxx idni igtg inhn ioge iqna iwmg jatn jcpj jftb jhdt jjmv jrop jtmp kdsq kogc kxbo kxpt lfla mdcl mirn mjod mpif mrmc mutj myuj nneb nnsm ntoe nvav nvsh orng pdrk psyn pynu pyor qkmn rglh rvca sctd shrs snob soii subw thar thmp tlec trkg tvcc txas tyin ubro uppm upwp utif vlox whwt wtgp wtrb yacm zskf

90 at parked.com:
apmu argv bbbu bpbu cbnj cvob dckq dxgn dzum eahx ejqz ekzw elqm enpe epxw ewqc eyty fqtz fxwu fyev fzxt hfvy hfxx hnqa hpyi hyyt imzv ispf jrkm jwoh kbvx lfib lhnz lhtk lzot mpuf nnvq obxk odgx olzi onrz oqak otox owkr oxds pbuo pqjj pyub qdsh qewv qfwc qial qlvn quoy qyeo rmzs rvyu sccq sxmg twyi uamj uffr uidq ujwr uqmo urgk urxh uxiw vejm vtlg vwal wsiy xdri xecu xfjw xgfa xjap xmvj xovw xqag yczy ydtd yrbw ysds zeap zevm zkaa zlhe zqez zxih

64 at namedrive:
brok bvxw byjv cujz dzmc ekyy etgt fcyk ghjs gioz grkc gssj hgoq hoaj ihbb ikhx jsmw jwhj lfhu lfsx lrqw mhvq mrnr nclo nlom noum opij owor pszw qfmk qhuh qpyt qpzt rfvk rinr rwvx rxcj swru ugma uqjr uwgz uxof vccx vfqg vhqt vjom vjqf vkev vlnp vpae vqgh vqzr vrqj vyxy wiak wksw wnln xkxb xtgf xvxq yfga ymae yoin yvkz

53 at trafficz:
aown awwz bmzl bvbt bydi cusz cztk dmlc fehq fkir fzip gkeu hnod iobo jcwa jgem kniy kzeu lgol mnut nyaz obvw ofvw ojob olrk pdid pflm pxdg qbku qeeu qgex rdfr revn rqvv rvek sjuv sldo spjh sxvf uabc ujsh unnw uwam vlnr vsgo vyes wfxh wlpp wteu ymad zelx zubx zvee

28 at trafficclub:
ccrm chvt cigy cuqz drvg dvpb ebrz hoql irhn isww iyxn jfbq krhf lzuh mxqk nigu niuv pgcq pqmh qvde rlss rqcu usoz wjfg xbpz ykks ypxo zkoj

26 at bodis:
ejnu fvzj lvbq midv omuu ooyp qgor svxs tgyv tqkr uixe ujpv ulxu uudw uupe uvse uzac vwth vykd wxvz xpvy ygvl yrfj yvdl yvfy yvjf

19 at tradenames:
bmtq fejd gsmn ifrd isyp mium mvty olat otzp ovmh pyhk qfnb rkvv tnzk ttvf twuk vmru zbxx zvdf

17 at smartbuy:
abbp aphw aqjd bcvm bysx cvvi czxo exgc fcgn hsow ixoo msmz plef qfcx vnfo ywga yxmr

16 at smartname:
auux cycc jfgd kofu lkqk mhxy mwgo nqyi spow tang urre vdzd wyxz xscr yxsu zfec

15 at mdnhparking:
asrp ffbz fhmb hban hceo igoh nvbi psom ptbw qrro rcvn sedr tdlh vcwe zjpr

6 at parkingpanel:
apsv gvuv iewj myyl ntwu sjsu

... probably quite a few more at the thin end of the tail here that I likely missed in this iteration (I see a couple with DNS on activeaudience nameservers, but then also 46 at fabulous which I can't assume are parked, then 70 with "dsredirection" nameservers ... and so on.)

Will need to find some more time to crunch through the real bulk of the sample scan but starting to get swamped with some other things coming up this week (day job) - so wanted to at least get this bit of processed data up for inspection. Hoping will see some other interested number crunchers (hi netklick) give their take on data collected so far, and also see if ideas on how to best move forward with scanning larger samples (and ultimately yes just maybe the whole LLLL universe, when we've got an efficient scan method + enough volunteers + time to put the details (multi-platform code) for this sort of project together nicely - may be a bit over-ambitious to jump into it at this point, but ... still seems like fun, though expecting more useful results from smaller samples long before we get the whole ball of LLLL.com wax together!) :yell:

The Official (!) "developed LLLL.com" distributed scanning project thread

What platform / language would you run a scanning "script" type application on?

Linux / PERL

Linux / PHP

Linux / Python

Linux / Java

Mac OS X / PERL

Mac OS X / PHP

Mac OS X / Python

Mac OS X / Java

MS-Windows / PERL

MS-Windows / PHP

MS-Windows / Python

MS-Windows / Java

Domain WankerVIP Member

Established Member

Domain WankerVIP Member

Top Member

Domain Name KeyVIP Member

Top Member

Established Member

Established Member

VIP Member

Established Member

Domain WankerVIP Member

VURGTop Member

VIP Member

Domain WankerVIP Member

VURGTop Member

Domain WankerVIP Member

Domain WankerVIP Member

Domain WankerVIP Member

VURGTop Member

Domain WankerVIP Member

Similar threads

We're social

hunts.me

EzOwl.com

AssuranceWellness.com

knoxe.com

DallasRemodel.com

soulboundtoken.io

SensualAF.com

TeamCut.com

CryptoFoundation.com

Youssef.net

Pinned

Appreciation

Agreement

Answers

Relevance

Reaction

Status

Feeling