Internet Cookies and Web Scraping

Case Studies

Ntrepid Podcast 4: Internet Cookies and Web Scraping

When setting up a web scraping process, many people’s first instinct is to remove as much identifying information as possible in order to be more anonymous. Unfortunately, this action actually can make you stand out even more, and cause you to be quickly flagged and blocked by the websites against which you are trying to collect.

Take cookies for example, the best known and easiest to remove identifiers. While used to track visitors, they are also often required for the website to function correctly. When a website tries to set a cookie, either in the response header or in JavaScript, that cookie should be accepted and returned to the website.

That is not to say that you should let them hang around forever, and therein lies the art. The key is to keep them around for a moderate number of queries, but only a number that a human might reasonably be expected to do in a single sitting.

Cookies need to be managed in concert with many other identifiers, and changed together between those sessions. IP addresses are the most important identifier after cookies. It is particularly important that these change together. Many websites will actually embed a coded version of the visitor’s IP address in a cookie, and then in every page, check that they still match. If you change IP midstream while keeping the cookies, the website will flag your activity, and is likely to return an error page or bounce you back to the home page without the desired data.

When switching to a new session, we suggest going back to an appropriate landing page, and working down through the website from there. Some websites will set a cookie on their landing pages. If they don’t see it when a visitor hits a deep page, it is evidence that the hit is from a scraper, and not from a real person who came to the website and navigated to that page.

When you change sessions, it is also a good time to change your browser fingerprint. Browsers and OS versions, supported languages, fonts, and plugins can collectively create an almost unique identifier of your computer. Changing these slightly between sessions reduces the likelihood of being detected and blocked.

Finally, you can get tripped up by the information that you explicitly pass to the target website. Many scraping activities require filling out search fields or other forms. We had one situation where a customer was tripped up because they used the same shipping zip code for every query. That zip became so dominant for the website that they investigated and discovered the scraping activity.

It is important to avoid detection if at all possible because it keeps the target at a lower level of alertness. Once they are aware of scraping activity, they are more likely to take countermeasures, and to look more carefully for future scraping. Staying below the radar from the start will make things much easier in the long run.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on Twitter @Ntrepid.

You can reach me by email with any questions or suggestions for future topics through my email at lance.cottrell@ntrepidcorp.com.

Thanks for listening.

Transcript

Welcome to the Ntrepid Podcast, Episode #4.

My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.

In this episode I will be talking about how cookies and other information you provide can impact your web scraping success.

When setting up a web scraping process, many people’s first instinct is to remove as much identifying information as possible in order to be more anonymous. Unfortunately, that actually can make you stand out even more, and cause you to be quickly flagged and blocked by the websites against which you are trying to collect.

Take cookies for example, the best known and easiest to remove identifiers. While they can be used to track visitors, they are often required for the website to function correctly. When a website tries to set a cookie, either in the response header or in JavaScript, that cookie should be accepted and returned to the website.

Cookies need to be managed in concert with many other identifiers, and changed together between those sessions. The most important identifier after cookies are IP addresses. It is particularly important that these change together. Many websites will actually embed a coded version of the visitor’s IP address in a cookie, and then in every page, check that they still match. If you change IP mid stream while keeping the cookies, the website will flag your activity, and is likely to return an error page or bounce you back to the home page without the data you were looking for.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on Twitter @ntrepidcorp.

You can reach me by email with any questions or suggestions for future topics through my email at lance.cottrell@ntrepidcorp.com.

Thanks for listening.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
esctx	session	The esctx cookie is set by Microsoft for secure authentication of the users' login details.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
stsservicecookie	session	This cookie is set by Microsoft for secure authentication of the users' login details.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
x-ms-gateway-slice	session	This cookie is set by Microsoft for secure authentication of the users' login details.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-37785135-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
buid	1 month	No description available.
fpc	1 month	No description available.
muc_ads	2 years	No description available.
RpsContextCookie	10 minutes	No description available.
visitor_id456132	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.
visitor_id456132-hash	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.

Internet Cookies and Web Scraping

Share this post

Internet Cookies and Web Scraping

Ntrepid Podcast 4: Internet Cookies and Web Scraping

Related Posts