Overcome Big Data Web Scraping Obstacles

laptop computer on the table with notepad and coffee cup with Ntrepid logo in foam

Share this post

Overcome Big Data Web Scraping Obstacles

Ntrepid Podcast 1: Overcome Big Data Web Scraping Obstacles

Big data is the big buzzword right now, and rightly so. There are really two kinds of big data out there: what you collect in the course of business, internally generated big data, and the big data that you go out and get. I’m going to focus on the second here. Going from basic Internet data collection to big data Internet collection introduces some real problems.

So let’s consider a couple scenarios. In the first case, imagine you’re trying to collect a large amount of data from a web search engine to look at your SEO (Search Engine Optimization) rankings. You’re going to want to look at lots and lots of different search terms and not just the first page of results, but many pages of results, and this is going to add up to a lot of hits on the search engine site. They’re fairly quickly going to detect this activity and you’ll hit their throttles and they’ll block your activity – they’ll prevent you from being able to do the searches. And staying below that threshold may make your activity take hours or days versus just minutes if you could go as fast as you possibly can.

Another scenario would be looking for competitive intelligence. Imagine you need to get information on pricing or product information, trademark infringement, monitoring your resellers – lots of different reasons you’d want to look at your competitors or even subsidiaries on the Internet. And we see a lot of blocking here too when you’re doing too much activity and exceeding some kind of threshold. But we’re also seeing sites getting really smart.

Imagine you’re an airline and you want to look up pricing for your competitors. Airline A wants to look at Airline B’s prices, and they don’t want to just look at one price, they want to look at every pair of cities for every departure time for every day between now and several months from now, because we know these prices aren’t static, they’re changing continuously. Now what happens is that if you’re detected, you actually get fed wrong information, right? The prices will be systematically incorrect, they may make all the prices higher than they appear, higher than they really are, to trick you into competing against those prices, and therefore, you won’t get to fill your seats. Or they’ll make them look lower than they really are, get you to underprice and undercut your margins. So it’s really very important to avoid detection when you’re going about these kinds of activities.

Many things that can lead to these variations in information. It may not just be who you are, it may be by location, or time of day or many other kinds of characteristics. For example, Orbitz for quite a while was showing more expensive hotels to people searching from Mac computers versus Windows computers.

The general principle here is that websites aren’t things. We often talk about “the” Internet, but that’s really very misleading. Much of the web is now created on the fly, it’s all dynamic, it’s more of a process than a thing. So, when you go to the webpage, it’s created in the moment you look at it, based on who you are, where you’re coming from, what information they have in the database. And then they assemble that page to order, just for you.

So the Internet, rather than being some thing that you can look at, is more like a hologram: you need to be able to look at it from multiple perspectives to really understand what it looks like.

When I talk about the main obstacles to big data collection, I’m usually thinking about blocking and cloaking. And blocking is what it says it is, the website simply prevents access, and I talked about that in the initial scenarios. And cloaking is when a website is set up to provide different, false information, and that was what I talked about in the airline example – you need to get access to some kind of data, and it’s important that you be able to access it, that you not be blocked, and that when you do access it, the data you’re getting is correct and real, and that’s avoiding cloaking.

In some cases, you just want to understand the targeting. So, if a website is providing different information to different people, you may simply want to understand who is it they show what information to because that may be important from a competitive positioning point of view.

The real thing that sets big data collection on the Internet apart from simple data collection on the Internet, is volume. You could be hitting a website hundreds of thousands to millions of times in a relatively short period.

And so even if you’re anonymous, even if you’ve done a thorough job of hiding who you are and where you’re coming from, it’s still going to be obvious to the website that someone is hitting them a hundred thousand times. It’s like shining a huge spotlight on their website. They’re going to see this activity, your IP address, will show up right at the top of their logs. So the trick here is to diffuse your activity – rather than looking like one huge visitor hitting a hundred thousand times, you need to look like a huge number of relatively low activity visitors, all of which who are sort of behaving in a normal way, at normal levels of intensity.

What’s the give-away? The IP address is the real common denominator, it’s the thing everyone tracks, and it’s one of the hardest things to hide yourself. And the magic metric that you want to watch is the hits per target, per source IP address, per time period.

You need a realistic number of connections coming, not just per day, but also per hour and per minute, to look plausible. You need to stay human. When you’re looking at hits per source IP per day, you might want to stay below, say, looking at fifty pages, while looking at it at a per minute basis, you probably need to make sure you’re staying below five pages depending on the website. And you’ll notice here that the number per minute, multiplied by the number of minutes in a day does not add up to the number per day, because no one sits at their computer clicking continuously all day on the same website, right? Looking realistic involves all different timescales.

Now, some more paranoid sites are also looking for realistic surfing patterns. They’re looking more closely at how you visit the website, how you load the pages, do you, say, just grab the text off the pages and not the images, which is very common for basic web harvesting because it cuts down on the amount of data you need to grab a lot. But, it also really stands out – it looks very mechanical, it’s not the way a human accesses things. And also most scraping is faster than humans can access the web –if you’re clicking to a new page every second, now that doesn’t leave a lot of time for reading the information that’s out there. So, when trying to go against more sophisticated or paranoid websites, it’s very important to make sure your patterns look appropriate.

Cookies and other tracking mechanisms are another give-away. If they’re blocked entirely, many sites will just fail. But they also need to be turned over frequently or all the activity gets correlated. If you’re pretending to be a hundred people, you can’t have all hundred people using the same cookie, or you’ve undone all the work. Many sites also check that all traffic with a given cookie comes from the same IP address. In many cases, they’ll embed an encrypted or scrambled version of the IP address in one of their cookies, so they can very quickly check to make sure that you haven’t changed addresses in mid-session. They’re mostly doing this to avoid session highjacking, but it always causes problems for scrapers.

So Ntrepid solutions enable quick integration with your existing scraping solutions to allow you to spread your activity across thousands of different source addresses. For more sophisticated targets, we enable the creation of massively parallel independent sessions to emulate large numbers of individual realistic agents, ensuring the traffic will stand up to even detailed scrutiny.

Transcript

Welcome to the Ntrepid audio briefs: Issue Number 1. My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation. In this issue, I will be talking about collecting big data against resisting targets.

Big data is the big buzzword right now, and rightly so. There’s really two kinds of big data out there: there’s what you collect in the course of business, internally generated big data, and the big data that you go out and get. And I’m really going to focus on the second here. Going from basic Internet data collection to big data Internet collection introduces some real problems.

So let’s consider a couple scenarios. In the first case, imagine you’re trying to collect a large amount of data from a web search engine to look at your SEO (Search Engine Optimization) rankings. So you’re going to want to look at lots and lots of different search terms and not just the first page of results, but many pages of results, and this is going to add up to a lot of hits on the search engine site. They’re fairly quickly going to detect this activity and you’ll hit their throttles and they’ll block your activity – they’ll prevent you from being able to do the searches. And staying below that threshold may make your activity take hours or days versus just minutes if you could go as fast as you possibly can.

Another scenario would be looking for competitive intelligence. So, imagine you need to be getting information on pricing or product information, trademark infringement, monitoring your resellers – lots of different reasons you’d want to look at your competitors or even subsidiaries on the Internet. And we see a lot of blocking here too when you’re doing too much activity and exceeding some kind of threshold. But we’re also seeing sites getting really smart.

So, imagine you’re an airline and you want to look up pricing for your competitors. So, Airline A wants to look at Airline B’s prices, and they don’t want to just look at one price, they want to look at every pair of cities for every departure time for every day between now and several months from now, because we know these prices aren’t static, they’re changing continuously. Now what happens is that if you’re detected, you actually get fed wrong information, right? The prices will be systematically incorrect, they may make all the prices higher than they appear, higher than they really are, to trick you into competing against those prices, and therefore, you won’t get to fill your seats. Or they’ll make them look lower than they really are, get you to underprice and undercut your margins. So it’s really very important to avoid detection when you’re going about these kinds of activities.

Now there’s a lot of things that can lead to these variations in information. It may not just be who you are, it may be by location, or time of day or many other kinds of characteristics. For example, Orbitz for quite a while was showing more expensive hotels to people searching from Mac computers versus Windows computers.

The general principle here is that websites aren’t things. We often talk about “the” Internet, but that’s really very misleading. Much of the web is now created on the fly, it’s all dynamic, it’s more of a process than a thing. So, when you go to the webpage, it’s created in the moment you look at it, based on who you are, where you’re coming from, what information they have in the database. And then they assemble that page to order, just for you.

So the Internet, rather than being some thing that you can look at, is more like a hologram: you need to be able to look at it from multiple perspectives to really understand what it looks like.

So when I talk about the main obstacles to big data collection, I’m usually thinking about blocking and cloaking. And blocking is what it says it is, the website simply prevents access, and I talked about that in the initial scenarios. And cloaking is when a website is set up to provide different, false information, and that was what I talked about in the airline example – you need to get access to some kind of data, and it’s important that you be able to access it, that you not be blocked, and that when you do access it, the data you’re getting is correct and real, and that’s avoiding cloaking.

In some cases, you just want to understand the targeting. So, if a website is providing different information to different people, you may simply want to understand who is it they show what information to because that may be important from a competitive positioning point of view.

The real thing that sets big data collection on the Internet apart from simple data collection on the Internet, is volume. You could be hitting a website hundreds of thousands to millions of times in a relatively short period.

And so even if you’re anonymous, even if you’ve done a thorough job of hiding who you are and where you’re coming from, it’s still going to be obvious to the website that someone is hitting them a hundred thousand times. It’s like shining a huge spotlight on their website. They’re going to see this activity, this, your IP address, will show up right at the top of their logs. So the trick here is to diffuse your activity – rather than looking like one huge visitor hitting a hundred thousand times, you need to look like a huge number of relatively low activity visitors, all of which who are sort of behaving in a normal way, at normal levels of intensity.

So what’s the give-away? The IP address is the real common denominator, it’s the thing everyone tracks, and it’s one of the hardest things to hide yourself. And the magic metric that you want to watch is the hits per target, per source IP address, per time period.

So you need a realistic number of connections coming, not just per day, but also per hour and per minute, to look plausible. You need to stay human. So, when you’re looking at hits per source IP per day, you might want to stay below, say, looking at fifty pages, while looking at it at a per minute basis, you probably need to make sure you’re staying below five pages depending on the website. And you’ll notice here that the number per minute, multiplied by the number of minutes in a day does not add up to the number per day, because no one sits at their computer clicking continuously all day on the same website, right? So, looking realistic involves all different timescales.

Now, some more paranoid sites are also looking for realistic surfing patterns. They’re looking more closely at how you visit the website, how you load the pages, do you, say, just grab the text off the pages and not the images, which is very common for basic web harvesting because it cuts down on the amount of data you need to grab a lot. But, it also really stands out – it looks very mechanical, it’s not the way a human accesses things. And also most scraping is faster than humans can access the web –if you’re clicking to a new page every second, now that doesn’t leave a lot of time for reading the information that’s out there. So, when trying to go against more sophisticated or paranoid websites, it’s very important to make sure your patterns look appropriate.

Cookies and other tracking mechanisms are another give-away. If they’re blocked entirely, many sites will just fail. But they also need to be turned over frequently or all the activity gets correlated. If you’re pretending to be a hundred people, you can’t have all hundred people using the same cookie, or you’ve undone all the work.

Many sites also check that all traffic with a given cookie comes from the same IP address. In many cases, they’ll embed an encrypted or scrambled version of the IP address in one of their cookies, so they can very quickly check to make sure that you haven’t changed addresses in mid-session. They’re mostly doing this to avoid session highjacking, but it always causes problems for scrapers.

So Ntrepid solutions enable quick integration with your existing scraping solutions to allow you to spread your activity across thousands of different source addresses.

For more sophisticated targets, we enable the creation of massively parallel independent sessions to emulate large numbers of individual realistic agents, ensuring the traffic will stand up to even detailed scrutiny.

For more information about this, and other Ntrepid products, please visit us at ntrepidcorp.com. You can also reach me directly with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com. Thank you for listening.