Overcome Big Data Web Scraping Obstacles

Case Studies

Ntrepid Podcast 1: Overcome Big Data Web Scraping Obstacles

Big data is the big buzzword right now, and rightly so. There are really two kinds of big data out there: what you collect in the course of business, internally generated big data, and the big data that you go out and get. I’m going to focus on the second here. Going from basic Internet data collection to big data Internet collection introduces some real problems.

So let’s consider a couple scenarios. In the first case, imagine you’re trying to collect a large amount of data from a web search engine to look at your SEO (Search Engine Optimization) rankings. You’re going to want to look at lots and lots of different search terms and not just the first page of results, but many pages of results, and this is going to add up to a lot of hits on the search engine site. They’re fairly quickly going to detect this activity and you’ll hit their throttles and they’ll block your activity – they’ll prevent you from being able to do the searches. And staying below that threshold may make your activity take hours or days versus just minutes if you could go as fast as you possibly can.

Another scenario would be looking for competitive intelligence. Imagine you need to get information on pricing or product information, trademark infringement, monitoring your resellers – lots of different reasons you’d want to look at your competitors or even subsidiaries on the Internet. And we see a lot of blocking here too when you’re doing too much activity and exceeding some kind of threshold. But we’re also seeing sites getting really smart.

Imagine you’re an airline and you want to look up pricing for your competitors. Airline A wants to look at Airline B’s prices, and they don’t want to just look at one price, they want to look at every pair of cities for every departure time for every day between now and several months from now, because we know these prices aren’t static, they’re changing continuously. Now what happens is that if you’re detected, you actually get fed wrong information, right? The prices will be systematically incorrect, they may make all the prices higher than they appear, higher than they really are, to trick you into competing against those prices, and therefore, you won’t get to fill your seats. Or they’ll make them look lower than they really are, get you to underprice and undercut your margins. So it’s really very important to avoid detection when you’re going about these kinds of activities.

Many things that can lead to these variations in information. It may not just be who you are, it may be by location, or time of day or many other kinds of characteristics. For example, Orbitz for quite a while was showing more expensive hotels to people searching from Mac computers versus Windows computers.

The general principle here is that websites aren’t things. We often talk about “the” Internet, but that’s really very misleading. Much of the web is now created on the fly, it’s all dynamic, it’s more of a process than a thing. So, when you go to the webpage, it’s created in the moment you look at it, based on who you are, where you’re coming from, what information they have in the database. And then they assemble that page to order, just for you.

So the Internet, rather than being some thing that you can look at, is more like a hologram: you need to be able to look at it from multiple perspectives to really understand what it looks like.

When I talk about the main obstacles to big data collection, I’m usually thinking about blocking and cloaking. And blocking is what it says it is, the website simply prevents access, and I talked about that in the initial scenarios. And cloaking is when a website is set up to provide different, false information, and that was what I talked about in the airline example – you need to get access to some kind of data, and it’s important that you be able to access it, that you not be blocked, and that when you do access it, the data you’re getting is correct and real, and that’s avoiding cloaking.

In some cases, you just want to understand the targeting. So, if a website is providing different information to different people, you may simply want to understand who is it they show what information to because that may be important from a competitive positioning point of view.

The real thing that sets big data collection on the Internet apart from simple data collection on the Internet, is volume. You could be hitting a website hundreds of thousands to millions of times in a relatively short period.

And so even if you’re anonymous, even if you’ve done a thorough job of hiding who you are and where you’re coming from, it’s still going to be obvious to the website that someone is hitting them a hundred thousand times. It’s like shining a huge spotlight on their website. They’re going to see this activity, your IP address, will show up right at the top of their logs. So the trick here is to diffuse your activity – rather than looking like one huge visitor hitting a hundred thousand times, you need to look like a huge number of relatively low activity visitors, all of which who are sort of behaving in a normal way, at normal levels of intensity.

What’s the give-away? The IP address is the real common denominator, it’s the thing everyone tracks, and it’s one of the hardest things to hide yourself. And the magic metric that you want to watch is the hits per target, per source IP address, per time period.

You need a realistic number of connections coming, not just per day, but also per hour and per minute, to look plausible. You need to stay human. When you’re looking at hits per source IP per day, you might want to stay below, say, looking at fifty pages, while looking at it at a per minute basis, you probably need to make sure you’re staying below five pages depending on the website. And you’ll notice here that the number per minute, multiplied by the number of minutes in a day does not add up to the number per day, because no one sits at their computer clicking continuously all day on the same website, right? Looking realistic involves all different timescales.

Now, some more paranoid sites are also looking for realistic surfing patterns. They’re looking more closely at how you visit the website, how you load the pages, do you, say, just grab the text off the pages and not the images, which is very common for basic web harvesting because it cuts down on the amount of data you need to grab a lot. But, it also really stands out – it looks very mechanical, it’s not the way a human accesses things. And also most scraping is faster than humans can access the web –if you’re clicking to a new page every second, now that doesn’t leave a lot of time for reading the information that’s out there. So, when trying to go against more sophisticated or paranoid websites, it’s very important to make sure your patterns look appropriate.

Cookies and other tracking mechanisms are another give-away. If they’re blocked entirely, many sites will just fail. But they also need to be turned over frequently or all the activity gets correlated. If you’re pretending to be a hundred people, you can’t have all hundred people using the same cookie, or you’ve undone all the work. Many sites also check that all traffic with a given cookie comes from the same IP address. In many cases, they’ll embed an encrypted or scrambled version of the IP address in one of their cookies, so they can very quickly check to make sure that you haven’t changed addresses in mid-session. They’re mostly doing this to avoid session highjacking, but it always causes problems for scrapers.

So Ntrepid solutions enable quick integration with your existing scraping solutions to allow you to spread your activity across thousands of different source addresses. For more sophisticated targets, we enable the creation of massively parallel independent sessions to emulate large numbers of individual realistic agents, ensuring the traffic will stand up to even detailed scrutiny.

Transcript

Welcome to the Ntrepid audio briefs: Issue Number 1. My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation. In this issue, I will be talking about collecting big data against resisting targets.

Big data is the big buzzword right now, and rightly so. There’s really two kinds of big data out there: there’s what you collect in the course of business, internally generated big data, and the big data that you go out and get. And I’m really going to focus on the second here. Going from basic Internet data collection to big data Internet collection introduces some real problems.

So let’s consider a couple scenarios. In the first case, imagine you’re trying to collect a large amount of data from a web search engine to look at your SEO (Search Engine Optimization) rankings. So you’re going to want to look at lots and lots of different search terms and not just the first page of results, but many pages of results, and this is going to add up to a lot of hits on the search engine site. They’re fairly quickly going to detect this activity and you’ll hit their throttles and they’ll block your activity – they’ll prevent you from being able to do the searches. And staying below that threshold may make your activity take hours or days versus just minutes if you could go as fast as you possibly can.

Another scenario would be looking for competitive intelligence. So, imagine you need to be getting information on pricing or product information, trademark infringement, monitoring your resellers – lots of different reasons you’d want to look at your competitors or even subsidiaries on the Internet. And we see a lot of blocking here too when you’re doing too much activity and exceeding some kind of threshold. But we’re also seeing sites getting really smart.

So, imagine you’re an airline and you want to look up pricing for your competitors. So, Airline A wants to look at Airline B’s prices, and they don’t want to just look at one price, they want to look at every pair of cities for every departure time for every day between now and several months from now, because we know these prices aren’t static, they’re changing continuously. Now what happens is that if you’re detected, you actually get fed wrong information, right? The prices will be systematically incorrect, they may make all the prices higher than they appear, higher than they really are, to trick you into competing against those prices, and therefore, you won’t get to fill your seats. Or they’ll make them look lower than they really are, get you to underprice and undercut your margins. So it’s really very important to avoid detection when you’re going about these kinds of activities.

Now there’s a lot of things that can lead to these variations in information. It may not just be who you are, it may be by location, or time of day or many other kinds of characteristics. For example, Orbitz for quite a while was showing more expensive hotels to people searching from Mac computers versus Windows computers.

So the Internet, rather than being some thing that you can look at, is more like a hologram: you need to be able to look at it from multiple perspectives to really understand what it looks like.

So when I talk about the main obstacles to big data collection, I’m usually thinking about blocking and cloaking. And blocking is what it says it is, the website simply prevents access, and I talked about that in the initial scenarios. And cloaking is when a website is set up to provide different, false information, and that was what I talked about in the airline example – you need to get access to some kind of data, and it’s important that you be able to access it, that you not be blocked, and that when you do access it, the data you’re getting is correct and real, and that’s avoiding cloaking.

And so even if you’re anonymous, even if you’ve done a thorough job of hiding who you are and where you’re coming from, it’s still going to be obvious to the website that someone is hitting them a hundred thousand times. It’s like shining a huge spotlight on their website. They’re going to see this activity, this, your IP address, will show up right at the top of their logs. So the trick here is to diffuse your activity – rather than looking like one huge visitor hitting a hundred thousand times, you need to look like a huge number of relatively low activity visitors, all of which who are sort of behaving in a normal way, at normal levels of intensity.

So what’s the give-away? The IP address is the real common denominator, it’s the thing everyone tracks, and it’s one of the hardest things to hide yourself. And the magic metric that you want to watch is the hits per target, per source IP address, per time period.

So you need a realistic number of connections coming, not just per day, but also per hour and per minute, to look plausible. You need to stay human. So, when you’re looking at hits per source IP per day, you might want to stay below, say, looking at fifty pages, while looking at it at a per minute basis, you probably need to make sure you’re staying below five pages depending on the website. And you’ll notice here that the number per minute, multiplied by the number of minutes in a day does not add up to the number per day, because no one sits at their computer clicking continuously all day on the same website, right? So, looking realistic involves all different timescales.

Many sites also check that all traffic with a given cookie comes from the same IP address. In many cases, they’ll embed an encrypted or scrambled version of the IP address in one of their cookies, so they can very quickly check to make sure that you haven’t changed addresses in mid-session. They’re mostly doing this to avoid session highjacking, but it always causes problems for scrapers.

So Ntrepid solutions enable quick integration with your existing scraping solutions to allow you to spread your activity across thousands of different source addresses.

For more sophisticated targets, we enable the creation of massively parallel independent sessions to emulate large numbers of individual realistic agents, ensuring the traffic will stand up to even detailed scrutiny.

For more information about this, and other Ntrepid products, please visit us at ntrepidcorp.com. You can also reach me directly with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com. Thank you for listening.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
esctx	session	The esctx cookie is set by Microsoft for secure authentication of the users' login details.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
stsservicecookie	session	This cookie is set by Microsoft for secure authentication of the users' login details.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
x-ms-gateway-slice	session	This cookie is set by Microsoft for secure authentication of the users' login details.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-37785135-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
buid	1 month	No description available.
fpc	1 month	No description available.
muc_ads	2 years	No description available.
RpsContextCookie	10 minutes	No description available.
visitor_id456132	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.
visitor_id456132-hash	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.

Overcome Big Data Web Scraping Obstacles

Share this post

Overcome Big Data Web Scraping Obstacles

Ntrepid Podcast 1: Overcome Big Data Web Scraping Obstacles

Related Posts