Why Websites Block Spiders, Crawlers, and Bots
There are many reasons to use automated systems to visit websites instead of us humans visiting manually via a web browser, such as Firefox or Chrome. Not surprisingly, websites have their own reasons for trying to block that non-human type activity. Often, an endless game of cat-and-mouse ensues with collectors trying to hide and remain unnoticed during automated pulls of website data, all while websites and their administrators try to discover and stop this activity.
Common reasons websites try to block automated collection activities include: protecting valuable data, limiting website expenses, maintaining website performance and accessibility, and simply trying to stop or prevent “bad” behavior. If you think about the real reasons websites exist, it’s to market or inform about a product, service, and/or cause. Using a web browser is about as common as having a cell phone nowadays. We buy, review, research, and comment all through a web browser. Sometimes we even do this all from our phone! Since there are costs associated with developing and maintaining a website, what’s being marketed isn’t really “free” for the host. Which is why websites work so hard to get your attention and ultimately want you to either buy something or subscribe.
When spiders, crawlers, and bots visit websites (commonly referred to as “web scraping” or “web harvesting”), they typically aren’t there to buy. They are collecting data… intel per se. The data collected can be re-used on other websites/platforms or relied on to make intelligent business and financial decisions. Take the online consumer retail or travel industry for instance. As a business in these highly price sensitive and volatile industries, it’s important to know your competitors’ prices and trends so you can set yours accordingly. It’s imperative that you have access to accurate data, and that you get it fast. Your organization’s success could be in jeopardy if that competitor blocks access to their website or provides you with fake data because they have identified you.
Regardless of the different flavors, sizes, and objectives of a website, usually there’s an expense incurred to maintain the site. This is especially true for sites that change design or data frequently, and receive a steady stream of visitors. As traffic to a website increases, so does the owner’s cost associated with bandwidth and potential hardware upgrades to maintain their site’s performance. Websites often block web harvesting activities because of the extra expense incurred and realization that these visitors aren’t buying! After all, they are often competitors.
Read more on why websites block automated collection activities and how to successfully use spiders, crawlers, and bots.