Thank you for the additional reply
@jmcc, although I feel a number of your arguments seem to overlook that the W3Techs is
not a sample based approach.
- You say "It is not a good sample and far from random. " Of course the Alexa 10M used as the starting point (not the final selection) by the W3Techs is not random (it is after all the 10M most visited websites). W3Techs does not use a sample approach. It is not logical to argue it is poor sampling when it is not a sample based approach!
- You say "Web usage is generally expressed as a percentage of domain names in the TLD" That is only by those who are using a sample approach since they need to correct their data to project to the entire set had they been able to sample it all. It is not relevant to the W3Techs approach, since it is not a sample approach. They (and I in using their data) always stressed that it was based on the most popular sites. I did note in my writeup that might bias the data for some TLDs (I suspect mainly against new and newly popular extensions).
- I totally accept, and always have, that there are arguments for and against each off the two approaches (sampling vs major website analysis). A most visited websites approach emphasizes those sites with a lot of traffic, whereas by sampling it is possible that a hugely important website is not sampled and data could be skewed as a result, despite the attempted correction factors. Public opinion polls are sampling based, and there have been famous cases where their predictions have been very wrong. On the other hand, the Alexa 10M, or any similar, list could, and probably at least occasionally is, skewed by attempts to make a few sites more popular than they genuinely are. This is partly mitigated by the fact that W3Techs don't use the actual ranking, just the list as a starting point, and they do adjust it for things like redirect traffic and subdomains. The sampling approach is more resource intensive (at least if the sample is large) and the task of sorting out how to do the adjustments (you have given us some idea of the correction complexity in your comments). I think there is no simple answer as to which is better (probably some combination using data from both is best).
I am not arguing that you should abandon your sample and correction based approach. Not at all. As in all big debates, I totally accept that each approach has virtues. No data set is perfect.
I do however feel comfortable with using the W3Tech data. I pointed out earlier that their data is widely used (top 1000 of all global websites). I also, just now, checked how often their data is used in professional circles. I realize there will be occasional duplication, but a quick check on Google Scholar, shows that
W3Techs is cited in just over 2500 scientific papers and studies. Obviously the W3Tech dataset is complex, covering many factors and simple web use is not the only or major use, but the fact that their data is so widely used by computer science and public policy professionals is encouraging, at least to me.
I continue to feel that the W3Tech data can inform temporal studies of website use in different TLDs. I accept that you strongly feel that a sample based approach is the only, or at least the best, way to collect the data. I don't think we will settle the debate at NPs, and I don't plan to invest additional effort in detailed responses on this topic, as I think what needed to be stressed we have said.
Should you know of freely available web site use data, with clear methodology stated, that covers most TLDs, and that has at least 5+ years of results is available, I, and I am sure others, would welcome a link.
Thank you for the length, and tone, of your last two replies (the first one, not so much
).
Have a good day.
Bob