A Global Perspective on Web Scraping

laptop computer on the table with notepad and coffee cup with Ntrepid logo in foam

Share this post

A Global Perspective on Web Scraping

Ntrepid Podcast 3: A Global Perspective on Web Scraping

I have a problem with the phrase “The Internet”, because it implies that there is a “thing” out there, and that if we all look we will all see the same thing. In reality, the Internet is really more like a hologram, it looks different to every viewer and from every direction.

In the early days, web pages were simply flat files. If you requested a web page, that file was just sent to you. The same file would be sent to everyone who asked. That is not how things work any more. These days, most web pages are dynamically generated. The page literally does not exist except as a set of rules and logic for how to create the page when requested. Those rules can include information about date, time, recent events, evolving content on the server, the location of the user, and that visitor’s history of activity on the website. The server then pulls together and delivers the website the visitor sees, which might be slightly or significantly different from any other visitor.

A news site, for example, might show stories about your local area, a search engine could rank results based on your previous patterns of interest, and storefronts might adjust prices based on income levels in your area. There have even been examples of targeting based on computer brand, where more expensive hotels were shown to Mac users than to Windows users.

Consider this scenario: You’re traveling to Australia for a summer vacation. You plan to fly into Sydney and use that as your home base. Throughout the three weeks you will be Down Under, you will be making trips to Brisbane, Perth, and Melbourne.

Being the early planner that you are, you book your flights within Australia before you leave the U.S., from your U.S. based IP address. Now, flash forward to your vacation — once settled into your hotel room in Sydney, happily connected to the local hotel WiFi — you happen to browse flight prices from Sydney to your other Aussie locations. Not only are you getting killed by the exchange rate of the Australian dollar to American, but the Australian airline knocks off an additional 10% to its domestic travelers.

It is not enough to get just one picture of a website. To really understand what is there, it must be observed from multiple perspectives. One of the most important perspectives is location. Altering content based on the country or region of the visitor is really quite common.

Imagine that you are the Product Manager for a high-tech consumer product. You are constantly keeping your eyes on your competitors to make sure you are staying ahead of them in technology, market share, and price. You are in the U.S., but your main competitors are overseas. So you conduct your research from your work computer, unaware that your corporate-branded U.S. IP address stands out like a sore thumb, every time you hit their site. In fact, they noticed your pattern of checking pricing on Mondays and Fridays, tech specs every Tuesday, and financials on the first of every month. After a while, you might notice that their site is getting quite stagnant. While they used to adjust their pricing weekly and their tech specs every month, they have not changed a thing in the last couple of months… or so you thought.

Some emails from overseas partners suggest that you are missing something. Turns out, your competition got wise to what you were doing and is now spoofing you by posting old data every time their website is visited by your company’s IP address range. If you had access to non-attributable U.S. IP addresses, or better yet, IP addresses that are regionally close to your competitor, you would be able to get the scoop on what they were doing, and they would be none the wiser.

Obviously this pattern would have been even clearer, and the change probably less noticeable, if you had been doing automated scraping, as opposed to just being a human at the keyboard. In order to detect this, your scraping activity needs to be duplicated and originate from different areas. Any given website should be tested to detect if they are doing this kind of modification by scraping random samples of data from the site and comparing them to your standard scraping results. If they are different, then you may need to repeat most or all of your activity from one or even more than one other location in addition to your primary scraping location.

Ntrepid maintains facilities in many different countries around the world specifically for this purpose. It is easy to specify the location of origin of any given scraping activity. Our large pools of IP addresses in each location allow you to disguise your activity just as you would when scraping from our domestic IP address space.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on Twitter @ntrepid.

Thanks for listening.

Transcript

Welcome to the Ntrepid Podcast, Episode #3.

My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.

In this episode, I will be talking a global perspective on information scraping.

I have a problem with the phrase “The Internet”, because it implies that there is a “thing” out there, and that if we all look we will all see the same thing. In reality, the Internet is really more like a hologram, it looks different to every viewer and from every direction.

In the early days, web pages were simply flat files. If you requested a web page, that file was just sent to you. The same file would be sent to everyone who asked. That is not how things work any more. These days, most web pages are dynamically generated. The page literally does not exist except as a set of rules and logic for how to create the page when requested. Those rules can include information about date, time, recent events, evolving content on the server, the location of the user, and that visitor’s history of activity on the website. The server then pulls together and delivers the website the visitor sees, which might be slightly or significantly different from any other visitor.

A news site, for example, might show stories about your local area, a search engine could rank results based on your previous patterns of interest, and storefronts might adjust prices based on income levels in your area. There have even been examples of targeting based on computer brand, where more expensive hotels were shown to Mac users than to Windows users.

Consider this scenario: You’re traveling to Australia for a summer vacation. You plan to fly into Sydney and use that as your home base. Throughout the three weeks you will be Down Under, you will be making trips to Brisbane, Perth, and Melbourne.

Being the early planner that you are, you book your flights within Australia before you leave the U.S., from your U.S. based IP address. Now, flash forward to your vacation…  Once settled into your hotel room in Sydney, happily connected to the local hotel WiFi, you happen to browse flight prices from Sydney to your other Aussie locations. Not only are you getting killed by the exchange rate of the Australian dollar to American, but the Australian airline knocks off an additional 10% to its domestic travelers.

So it is not enough to get just one picture of a website. To really understand what is there, it must be observed from multiple different perspectives. One of the most important perspectives is location. Altering content based on the country or region of the visitor is really quite common.

Imagine that you are the Product Manager for a high-tech consumer product. You are constantly keeping your eyes on your competitors to make sure you are staying ahead of them in technology, market share, and price. You are in the U.S., but your main competitors are overseas.  So you conduct your research from your work computer, unaware that your corporate-branded U.S. IP address stands out like a sore thumb, every time you hit their site. In fact, they noticed your pattern of checking pricing on Mondays and Fridays, tech specs every Tuesday, and financials on the first of every month. After a while, you might notice that their site is getting quite stagnant. While they used to adjust their pricing weekly and their tech specs every month, they have not changed a thing in the last couple of months… or so you thought.

Some emails from overseas partners suggest that you are missing something. Turns out, your competition got wise to what you were doing and is now spoofing you by posting old data every time their website is visited by your company’s IP address range. If you had access to non-attributable U.S. IP addresses, or better yet, IP addresses that are regionally close to your competitor, you would be able to get the scoop on what they were doing, and they would be none the wiser.

Obviously this pattern would have been even clearer, and the change probably less noticeable, if you had been doing automated scraping, as opposed to just being a human at the keyboard. In order to detect this, your scraping activity needs to be duplicated and originate from different areas. Any given website should be tested to detect if they are doing this kind of modification by scraping random samples of data from the site and comparing them to your standard scraping results. If they are different, then you may need to repeat most or all of your activity from one or even more than one other location in addition to your primary scraping location.

Ntrepid maintains facilities in many different countries around the world specifically for this purpose. It is easy to specify the location of origin of any given scraping activity. Our large pools of IP addresses in each location allow you to disguise your activity just as you would when scraping from our domestic IP address space.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on twitter @ntrepidcorp.

You can also reach me by email with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com.

Thanks for listening.