Why Websites Block Spiders, Crawlers, and Bots

Case Studies

There are many reasons to use automated systems to visit websites instead of us humans visiting manually via a web browser, such as Firefox or Chrome. Not surprisingly, websites have their own reasons for trying to block that non-human type activity. Often, an endless game of cat-and-mouse ensues with collectors trying to hide and remain unnoticed during automated pulls of website data, all while websites and their administrators try to discover and stop this activity.

Common reasons websites try to block automated collection activities include: protecting valuable data, limiting website expenses, maintaining website performance and accessibility, and simply trying to stop or prevent “bad” behavior. If you think about the real reasons websites exist, it’s to market or inform about a product, service, and/or cause. Using a web browser is about as common as having a cell phone nowadays. We buy, review, research, and comment all through a web browser. Sometimes we even do this all from our phone! Since there are costs associated with developing and maintaining a website, what’s being marketed isn’t really “free” for the host. Which is why websites work so hard to get your attention and ultimately want you to either buy something or subscribe.

When spiders, crawlers, and bots visit websites (commonly referred to as “web scraping” or “web harvesting”), they typically aren’t there to buy. They are collecting data… intel per se. The data collected can be re-used on other websites/platforms or relied on to make intelligent business and financial decisions. Take the online consumer retail or travel industry for instance. As a business in these highly price sensitive and volatile industries, it’s important to know your competitors’ prices and trends so you can set yours accordingly. It’s imperative that you have access to accurate data, and that you get it fast. Your organization’s success could be in jeopardy if that competitor blocks access to their website or provides you with fake data because they have identified you.

Regardless of the different flavors, sizes, and objectives of a website, usually there’s an expense incurred to maintain the site. This is especially true for sites that change design or data frequently, and receive a steady stream of visitors. As traffic to a website increases, so does the owner’s cost associated with bandwidth and potential hardware upgrades to maintain their site’s performance. Websites often block web harvesting activities because of the extra expense incurred and realization that these visitors aren’t buying! After all, they are often competitors.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
esctx	session	The esctx cookie is set by Microsoft for secure authentication of the users' login details.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
stsservicecookie	session	This cookie is set by Microsoft for secure authentication of the users' login details.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
x-ms-gateway-slice	session	This cookie is set by Microsoft for secure authentication of the users' login details.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-37785135-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
buid	1 month	No description available.
fpc	1 month	No description available.
muc_ads	2 years	No description available.
RpsContextCookie	10 minutes	No description available.
visitor_id456132	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.
visitor_id456132-hash	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.

Why Websites Block Spiders, Crawlers, and Bots

Share this post

Why Websites Block Spiders, Crawlers, and Bots

Related Posts