Jim Bassett's Weblog comments

We've been getting a lot of hits from the TurnitinBot. This comes from the Turn it in site which provides anti-plagiarism services to educational institutions. I don't get a real happy feeling from them so I blocked their robot. I'm not exactly against them, I'm just not for them enough to give them so much bandwidth (they are second in total requests to the much more lovable googlebot.) If anyone else here wants to debate this move, feel free to speak up. (Or if you have any other candidates for blocking...)

back to Jim Bassett's Weblog

I'm not all that crazy about recent swells in hits to my page. I liked it better when it was 15 to 100 a month. Block away, at your most excellent discretion. I guess I could be a private page if I was all that concerned about it.
- jimlouis 9-10-2002 12:18 am

Where are your hits coming from? I could block google for your page if you want which would probably cut a lot of the random hits out.
- jim 9-10-2002 12:31 am

I've looked into this a little more. Most of your hits are coming from robots. Most of those are google, but altavista (I'm 90% sure it's altavista) has recently started hitting us hard too (about half as hard as google.) That robot seems to like the /pageback/ links which is really filling up people's logs. In any case, you're only getting slightly more hits from actual people than you used to.

You can see this by changing the pull down at the top of your log page to 'by useragent'. This shows you how each visitor is identifying the browser they are using. Google is the googlebot (obviously) and altavista is Scooter/3.3. The one that contains 'Slurp/cat' is inktomi which is used by microsoft.

I'm not sure how to think about it. Having more than one search engine in the world is obviously a good thing. But I'm tempted to just ban everyone except google in order to decrease traffic (and since I only ever use google.)
- jim 9-10-2002 1:09 am

Okay, thanks for looking. I don't know what I want to do or if it really matters. I know I asked for this, due to my use of the language, but I wasn't all that happy about recently being the # 1 google hit, out of 39,000, for "Black N(word)." I'll think about it. I might visit next month, maybe I'll know by then. Hey Bill, can I visit next month?
- jimlouis 9-10-2002 1:26 am

see u then
- bill 9-10-2002 2:50 am

Thanks for turning off TurnitinBot. What a creepy concept. The other one I'd love to see disabled, which I mentioned in the core in June, is SlySearch, the bot for plagiarism.org. The "pageback crawl" from Scooter/altavista is new and rather intrusive, but by and large I don't mind the non-snitch-oriented spiders.
- tom moody 9-10-2002 2:51 am

I've gone ahead and attempted to block alta-vista as well. It's not that I don't like them, it's just that their robot seems sort of stupid. It is absolutely hammering us, and I can see in my logs that it requests the same pages over and over again (every day it seems like!) Even pages that never change. And even pages that aren't there any more. It just seems a little out of control. We'll see if I have the robots.txt syntax correct to actually stop it. Might take me a few times to get it right.
- jim 9-11-2002 6:31 pm

Now I'm getting mad. I'm pretty sure the robot I want to stop is from Alta Vista. I know they use Scooter/3.3 (although I think that is just a robot program that any number of people might be using.) All the hits are coming from 64.152.75.193 which definitely resolves to an alta-vista owned domain. Maybe someone else is renting space from them? I don't think so though, because I've seen that ip block be listed in public lists of search engine ip addresses. 99% sure this is alta vista.

Like I said, the robot identifies itself as Scooter/3.3, but when I disallow that useragent in robots.txt this robot pays no attention. So I went to alta vista, and after searching around a bit I finally found a page where they talk about this:

AltaVista crawlers obey the Robots Exclusion Standard. This standard allows you to indicate to visiting robots, such as the AltaVista crawlers, which parts of your site should not be visited by the robot. If you would like the AltaVista crawlers to not crawl and index your Web site, please read the Robots Exclusion Standard and add a robots.txt file to your Web pages.

Yeah, thanks, I'm familiar with the standard. What I fscking need to know is what useragent you respond to (SINCE IT DOESN'T SEEM TO BE THE ONE YOU IDENTIFY YOURSELF AS!)

By way of comparison, the googlebot not only identifies itself as Googlebot/2.1 it also adds this onto the useragent string: (+http://www.googlebot.com/bot.html) Going to that address details exactly how to prevent their robot from hitting your site.

robots.txt is a standard document that can tell Googlebot not to download some or all information from your web server. The format of the robots.txt file is specified in the Robot Exclusion Standard. When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-Agent starting with "googlebot". If no such entry exists, it will obey the first entry with a User-Agent of "*".

See how easy it is? Really seems like alta vista doesn't want you to be able to block their robots, and they take a demeaning tone when you try to figure out how ("read the standard for yourself!") while not giving you the one piece of information you need.

Enough. Now I'm just sniffing for that ip at the top of the set up PHP file that always gets called first for any hits to the site. If it's from that ip I just die().

I've noticed Scooter before under "useragent" (usually several hundred hits over a several day period). It usually follows the googlebot. This is the first time it's been in the "log (but not you)" part of the log. It does seem awfully redundant.
- tom moody 9-11-2002 8:04 pm

So what is the plan... I am having the same problem, scooter is running rampant through our websites, absolutely killing our bandwidth. Do you think it is wise to block the scooter bot at this time, when they are actually doing a refresh of thier index, as apposed to "riding it out" and potentially getting some listings? Talk back to me!
- LaptopGear.com (guest) 9-18-2002 12:00 am

I stopped it, but I'm not trying to sell anything. Maybe it's worth the hassle for you, LaptopGear.com.

I couldn't seem to exclude it in robots.txt (although maybe someone more skilled could make that work.) But since every request here results in a call to the same PHP script to do authentification and set up the database, I just stuck a few lines at the top of that to check and see if the request is from that IP address, and if so, I just send a blank page and die(). So it's still hogging cycles, but almost no bandwidth.

Depending on how your site is built you might consider a middle course. Let it index main pages, but not every page on the site. Or let it index every page, but don't send any graphics to that IP. Or build one page, no graphics, with a list of every product (and a link to the full page that product appears on) and let the bot only index the one big list.

For me it wasn't so much the bandwidth as the huge referer logs my system was generating. The googlebot doesn't send a referer when it follows intra-site links, and that works out much better for my log format. I'll just see that the googlebot made 5,000 requests. With Scooter I see all 5000 hits listed individually in my log as it follows, say, every link in my archive.

- jim 9-18-2002 12:23 am

quit your snivelling.. do some research its pretty simple

http://www.ahref.com/guides/technology/200009/0922piou.html for example
- mottie (guest) 8-10-2003 11:17 pm

[home] [subscribe] [login]