Internet Cookies and Web Scraping
Ntrepid Podcast 4: Internet Cookies and Web Scraping
When setting up a web scraping process, many people’s first instinct is to remove as much identifying information as possible in order to be more anonymous. Unfortunately, this action actually can make you stand out even more, and cause you to be quickly flagged and blocked by the websites against which you are trying to collect.
Take cookies for example, the best known and easiest to remove identifiers. While used to track visitors, they are also often required for the website to function correctly. When a website tries to set a cookie, either in the response header or in JavaScript, that cookie should be accepted and returned to the website.
That is not to say that you should let them hang around forever, and therein lies the art. The key is to keep them around for a moderate number of queries, but only a number that a human might reasonably be expected to do in a single sitting.
Cookies need to be managed in concert with many other identifiers, and changed together between those sessions. IP addresses are the most important identifier after cookies. It is particularly important that these change together. Many websites will actually embed a coded version of the visitor’s IP address in a cookie, and then in every page, check that they still match. If you change IP midstream while keeping the cookies, the website will flag your activity, and is likely to return an error page or bounce you back to the home page without the desired data.
When switching to a new session, we suggest going back to an appropriate landing page, and working down through the website from there. Some websites will set a cookie on their landing pages. If they don’t see it when a visitor hits a deep page, it is evidence that the hit is from a scraper, and not from a real person who came to the website and navigated to that page.
When you change sessions, it is also a good time to change your browser fingerprint. Browsers and OS versions, supported languages, fonts, and plugins can collectively create an almost unique identifier of your computer. Changing these slightly between sessions reduces the likelihood of being detected and blocked.
Finally, you can get tripped up by the information that you explicitly pass to the target website. Many scraping activities require filling out search fields or other forms. We had one situation where a customer was tripped up because they used the same shipping zip code for every query. That zip became so dominant for the website that they investigated and discovered the scraping activity.
It is important to avoid detection if at all possible because it keeps the target at a lower level of alertness. Once they are aware of scraping activity, they are more likely to take countermeasures, and to look more carefully for future scraping. Staying below the radar from the start will make things much easier in the long run.
For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on Twitter @Ntrepid.
You can reach me by email with any questions or suggestions for future topics through my email at lance.cottrell@ntrepidcorp.com.
Thanks for listening.
Transcript
Welcome to the Ntrepid Podcast, Episode #4.
My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.
In this episode I will be talking about how cookies and other information you provide can impact your web scraping success.
When setting up a web scraping process, many people’s first instinct is to remove as much identifying information as possible in order to be more anonymous. Unfortunately, that actually can make you stand out even more, and cause you to be quickly flagged and blocked by the websites against which you are trying to collect.
Take cookies for example, the best known and easiest to remove identifiers. While they can be used to track visitors, they are often required for the website to function correctly. When a website tries to set a cookie, either in the response header or in JavaScript, that cookie should be accepted and returned to the website.
That is not to say that you should let them hang around forever, and therein lies the art. The key is to keep them around for a moderate number of queries, but only a number that a human might reasonably be expected to do in a single sitting.
Cookies need to be managed in concert with many other identifiers, and changed together between those sessions. The most important identifier after cookies are IP addresses. It is particularly important that these change together. Many websites will actually embed a coded version of the visitor’s IP address in a cookie, and then in every page, check that they still match. If you change IP mid stream while keeping the cookies, the website will flag your activity, and is likely to return an error page or bounce you back to the home page without the data you were looking for.
When switching to a new session, we suggest going back to an appropriate landing page, and working down through the website from there. Some websites will set a cookie on their landing pages. If they don’t see it when a visitor hits a deep page, it is evidence that the hit is from a scraper, and not from a real person who came to the website and navigated to that page.
When you change sessions, it is also a good time to change your browser fingerprint. Browsers and OS versions, supported languages, fonts, and plugins can collectively create an almost unique identifier of your computer. Changing these slightly between sessions reduces the likelihood of being detected and blocked.
Finally, you can get tripped up by the information that you explicitly pass to the target website. Many scraping activities require filling out search fields or other forms. We had one situation where a customer was tripped up because they used the same shipping zip code for every query. That zip became so dominant for the website that they investigated and discovered the scraping activity.
It is important to avoid detection if at all possible because it keeps the target at a lower level of alertness. Once they are aware of scraping activity, they are more likely to take countermeasures, and to look more carefully for future scraping. Staying below the radar from the start will make things much easier in the long run.
For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on Twitter @ntrepidcorp.
You can reach me by email with any questions or suggestions for future topics through my email at lance.cottrell@ntrepidcorp.com.
Thanks for listening.