Friday, February 25, 2011

Stopping Scrapers From The Start


1) in htaccess, block a list of IP's from spamhaus
Cool

2) in htaccess block a large list of IP's from other countries?
Possibly... Definitely personal discretion IMO on this one.

3) in htaccess, block a lot of user agents (get the code from WebmasterWorld)?
Definitely

4) White list Google, Yahoo and MSN in robots.txt
Sure, but don't rely on robots for anything.
A real scraper probably won't even visit it, let alone listen.

5) block google and the other bots from crawling my images. I think this will block all robots from crawling gif's at any level of my site?
User-agent: Googlebot
Disallow: /*.gif$

User-agent: *
Disallow: /*.gif$

I wouldn't do that though... If I really didn't want them crawled I'd block non-referrer sending user-agents in the .htaccess.

RewriteCond %{HTTP_REFERER} !.+
RewriteRule \.gif$ - [F]

6) Then I think I'd like to block IP's from hosting companies. Is there an easy to use list of those IP's?
Not sure on this one.

7) after that I should do some IP blocking dynamically I think. Like trigger a block if someone is crawling too many pages too fast. But since I'm serving static html, how do I do that? Set up a cron job to run a script every minute that reads the log and takes action? This seems complex and burdensome.

ADDED: You can usually pre-pend php to your html pages in the .htaccess so you don't need to convert all your pages to php or parse your html as php.

The Apache Forum [webmasterworld.com] here is a great place for help with the .htaccess stuff.

There's also a php bot blocking script or two posted here somewhere that's great. I've used it (them?) on a couple of sites. (Been a long time, so I don't remember the specifics, but you should be able to find the info you need searching around a bit.)

There's actually a couple I think, one to time-out rapid requests and block the ip and another 'honey-pot' script that blocks the IP if the bot visits a disallowed page in the robots.txt

8) Since the content is static, Google and the rest don't need to download the html 8 times a month. Once a year is fine. What's the best way to tell the bots that a page hasn't changed, thus no need to crawl? etags? I think that stuff requires I change the page headers, and that's tough to do with static html pages.

ETags are usually set / sent with the server, but if you pre-pend php to your pages you can do all your headers there if necessary, including the following ... Set an Expires and Cache-Control header in the .htaccess with a long date in the future, and personally I would set a low priority on an xml sitemap for the pages too.



Powered By WizardRSS - WizardRSS.com For Sale

Source: http://www.webmasterworld.com/search_engine_spiders/4267704.htm

description tag search engine keyword

No comments:

Post a Comment