It just occurred to me that I should record distinct useragents who request robots.txt in the database, and then I could run the referrer logs against this list and come up with, I think, sort of okay human traffic numbers. Maybe filter the robot inserts through a black list of real browser useragents to cut down on the chances incorrect robot identifications.
Would any real robots obeying the robots.txt provision identify themselves with actual broser useragent strings? And how many robots don't request robots.txt? And how many human browsers do? (Hackers seeing where you don't want robots to look? noob web developers looking for examples? Can't amount to much.)
Blacklist (of known human browser useragent) could be compiled similarly by inserting distinct useragents of account holders into the database.
Probably not worth it, but as another barier against false robot id you could check if new identified-as-robot useragent subsequently request javascript files, as probably robots don't request those.
And while I'm thinking about this, distinct IP total numbers might be improved by having distinct IP plus distinct useragents within the same IP. So, for instance, a page requested from the same IP by 2 different useragents should probably be counted as 2 people, not 1 as the "by distinct IP" view would give it. This might be wrong, as I could use Safari today and FireFox tomorrow while still being one person, but I think it's at least as possible that I actually am two people behind a NAT. Distinct IP really gives something more akin to number of households (or businesses) requesting, not number of people.
|
Would any real robots obeying the robots.txt provision identify themselves with actual broser useragent strings? And how many robots don't request robots.txt? And how many human browsers do? (Hackers seeing where you don't want robots to look? noob web developers looking for examples? Can't amount to much.)
Blacklist (of known human browser useragent) could be compiled similarly by inserting distinct useragents of account holders into the database.
Probably not worth it, but as another barier against false robot id you could check if new identified-as-robot useragent subsequently request javascript files, as probably robots don't request those.
And while I'm thinking about this, distinct IP total numbers might be improved by having distinct IP plus distinct useragents within the same IP. So, for instance, a page requested from the same IP by 2 different useragents should probably be counted as 2 people, not 1 as the "by distinct IP" view would give it. This might be wrong, as I could use Safari today and FireFox tomorrow while still being one person, but I think it's at least as possible that I actually am two people behind a NAT. Distinct IP really gives something more akin to number of households (or businesses) requesting, not number of people.
- jim 10-23-2007 3:42 am