Welcome to the Ntrepid Podcast, Episode #5
My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.
In this episode I will be talking about how browser fingerprinting can impact your web scraping activities.
In the last podcast I touched on the issue of browser fingerprinting. In this episode I want to dig a little deeper.
The three primary identifiers that a website can track are IP address, Cookies, and browser fingerprint.
Of course, the website can tell if you are using Firefox, IE, Safari, Chrome, or whatever other browser. It also knows what version you are running, and what operating system and version of the operating system you are running on; Windows 8, Mac Mountain Lion, or Linux for example.
You almost certainly have a ton of both. Many programs and websites install fonts or plugins. For example, if you download audio from Amazon, you get a plugin. If you update your GPS from your computer, you get a plugin. If you configure your Jambox Bluetooth speaker, you get a plugin, and so on.
Lots of software uses non-standard fonts to make them look unique, or to allow the user more design flexibility. At the moment I have 299 fonts installed on my home computer, and I have made no particular effort to collect fonts.
Taken together, all this information creates a virtually unique pattern, your browser fingerprint. Even if you change your IP address and delete all your cookies, a website can recognize you just by recognizing your browser’s fingerprint.
But do they actually do that?
A recent study showed that over 400 of the top 10,000 websites are actively using this technique to track users who may be trying to prevent that by changing their IP addresses or deleting cookies. These are major mainstream websites, not hackers, not security agencies, and they are using browser fingerprinting to identify visitors to their websites, and this practice is growing quickly.
So, how does this impact you if you are engaged in web scraping?
I will assume that you are already addressing cookies and IP addresses in a way that emulates many different virtual visitors. This would include making sure that any multi-step process on a website would be conducted using a single IP address and keeping cookies, until the process is complete, then changing them all at once.
So, for each virtual visitor you are trying to create, you should have an individual fingerprint discoverable by the website. Those fingerprints need to be created with care, you can’t just randomly create them. For example a very new browser might not be able to run on an older operating system, certain fonts might be unique and specific to a particular OS, and certain plugins only compatible with certain browsers (and even versions of browsers).
In many cases, mobile devices may be the best thing to emulate. Most do not allow installing any additional plugins or fonts, and so there is much less variation, and therefore the fingerprint is much smaller. A tradeoff is that you may be shown the mobile version of the website, but because that is usually smaller and less graphics intensive, that might actually be an advantage for you.
Ntrepid can help you optimize your browser fingerprints, and other web scraping tools and techniques, to stay ahead in this accelerating arms race.
For more information about anonymous web scraping tools and other Ntrepid products please visit us on the web atntrepidcorp.com, and follow us on Facebook, and on twitter @ntrepidcorp.
You can reach me directly by email with any questions or suggestions for future topics through my email firstname.lastname@example.org.
Thanks for listening.