Browser Fingerprinting and Its Effects On Web Scraping
Ntrepid Podcast 5: Browser Fingerprinting and Its Effects On Web Scraping
The three primary identifiers that a website can track are IP address, cookies, and browser fingerprint.
By browser fingerprint, I mean all the information a website can obtain about your web browser and computer from within a web page using Javascript and/or Flash. It turns out that there is a lot more information there than you might guess.
Of course, the website can tell if you are using Firefox, IE, Safari, Chrome or whatever other browser. It also knows what version you are running, and what operating system and version of the operating system you are running on; Windows 8, Mac Mountain Lion, or Linux, for example.
Using Javascript and Flash the website can see much more. It can get your time zone, screen size and color depth. But the real goldmine is in the fonts and plugins.
You almost certainly have a ton of both. Many programs and websites install fonts or plugins. For example, if you download audio from Amazon, you get a plugin. If you update your GPS from your computer, you get a plugin. If you configure your Jambox Bluetooth speaker, you get a plugin, and so on.
Lots of software uses non-standard fonts to make them look unique, or to allow the user more design flexibility. At the moment I have 299 fonts installed on my home computer, and I have made no particular effort to collect fonts.
Taken together, all this information creates a virtually unique pattern known as your browser fingerprint. Even if you change your IP address and delete all your cookies, a website can still recognize you just by browser fingerprint.
But do they actually do that?
A recent study showed that over 400 of the top 10,000 websites are actively using this technique to track users who may be trying to prevent that by changing their IP addresses or deleting cookies. These are major mainstream websites, not hackers, not security agencies, and they are using browser fingerprinting to identify visitors to their websites, and this practice is growing quickly.
So, how does this impact you if you are engaged in web scraping?
I will assume that you are already addressing cookies and IP addresses in a way that emulates many different virtual visitors. This would include making sure that any multi-step process on a website would be conducted using a single IP address and keeping cookies, until the process is complete, then changing them all at once.
If you are not also addressing your browser fingerprint, however, any website could still identify you as being the same person, obviating your attempts to hide. You can reduce the size of your browser fingerprint by blocking Flash and/or Javascript. Now, many people block Flash for security reasons, so you will not stand out too much if you choose to do the same. However, blocking Javascript will really stand out because for a real person it would break most of the interesting websites on the Internet.
So, for each virtual visitor you are trying to create, you should have an individual fingerprint discoverable by the website. Those fingerprints need to be created with care, you can’t just randomly create them. For example a very new browser might not be able to run on an older operating system, certain fonts might be unique and specific to a particular OS, and certain plugins only compatible with certain browsers (and even versions of browsers).
In many cases, mobile devices may be the best thing to emulate. Most do not allow installing any additional plugins or fonts, so there is much less variation, and therefore the fingerprint is much smaller. A tradeoff is that you may be shown the mobile version of the website, but because that is usually smaller and less graphics intensive it might actually be an advantage for you.
Ntrepid can help you optimize your browser fingerprints, and other web scraping tools and techniques to stay ahead in this accelerating arms race.
For more information about anonymous web scraping tools and other Ntrepid products please visit us on the web at ntrepidcorp.com, and follow us on Facebook @NtrepidCorporation, and on twitter @Ntrepid.
You can reach me directly by email with any questions or suggestions for future topics through my email address lance.cottrell@ntrepidcorp.com.
Thanks for listening.
Transcript
Welcome to the Ntrepid Podcast, Episode #5
My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.
In this episode I will be talking about how browser fingerprinting can impact your web scraping activities.
In the last podcast I touched on the issue of browser fingerprinting. In this episode I want to dig a little deeper.
The three primary identifiers that a website can track are IP address, Cookies, and browser fingerprint.
By browser fingerprint, I mean all the information a website can obtain about your web browser and computer from within a web page, using Javascript and/or Flash. It turns out that there is a lot more information there than you might guess.
Of course, the website can tell if you are using Firefox, IE, Safari, Chrome, or whatever other browser. It also knows what version you are running, and what operating system and version of the operating system you are running on; Windows 8, Mac Mountain Lion, or Linux for example.
Using Javascript and Flash the website can see much more. It can get your time zone, screen size and color depth. But the real goldmine is in the fonts and plugins.
You almost certainly have a ton of both. Many programs and websites install fonts or plugins. For example, if you download audio from Amazon, you get a plugin. If you update your GPS from your computer, you get a plugin. If you configure your Jambox Bluetooth speaker, you get a plugin, and so on.
Lots of software uses non-standard fonts to make them look unique, or to allow the user more design flexibility. At the moment I have 299 fonts installed on my home computer, and I have made no particular effort to collect fonts.
Taken together, all this information creates a virtually unique pattern, your browser fingerprint. Even if you change your IP address and delete all your cookies, a website can recognize you just by recognizing your browser’s fingerprint.
But do they actually do that?
A recent study showed that over 400 of the top 10,000 websites are actively using this technique to track users who may be trying to prevent that by changing their IP addresses or deleting cookies. These are major mainstream websites, not hackers, not security agencies, and they are using browser fingerprinting to identify visitors to their websites, and this practice is growing quickly.
So, how does this impact you if you are engaged in web scraping?
I will assume that you are already addressing cookies and IP addresses in a way that emulates many different virtual visitors. This would include making sure that any multi-step process on a website would be conducted using a single IP address and keeping cookies, until the process is complete, then changing them all at once.
If you are not also addressing your browser fingerprint, however, any website could still identify you as being the same person, obviating your attempts to hide. You can reduce the size of your browser fingerprint by blocking Flash and/or Javascript. Now, many people block Flash for security reasons, so you will not stand out too much if you choose to do the same. However, blocking Javascript will really stand out because for a real person it would break most of the interesting websites on the Internet.
So, for each virtual visitor you are trying to create, you should have an individual fingerprint discoverable by the website. Those fingerprints need to be created with care, you can’t just randomly create them. For example a very new browser might not be able to run on an older operating system, certain fonts might be unique and specific to a particular OS, and certain plugins only compatible with certain browsers (and even versions of browsers).
In many cases, mobile devices may be the best thing to emulate. Most do not allow installing any additional plugins or fonts, and so there is much less variation, and therefore the fingerprint is much smaller. A tradeoff is that you may be shown the mobile version of the website, but because that is usually smaller and less graphics intensive, that might actually be an advantage for you.
Ntrepid can help you optimize your browser fingerprints, and other web scraping tools and techniques, to stay ahead in this accelerating arms race.
For more information about anonymous web scraping tools and other Ntrepid products please visit us on the web atntrepidcorp.com, and follow us on Facebook, and on twitter @ntrepidcorp.
You can reach me directly by email with any questions or suggestions for future topics through my email addresslance.cottrell@ntrepidcorp.com.
Thanks for listening.