International Data Scraping Episode

Case Studies

Welcome to the Ntrepid Podcast, Episode #3.

My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.

In this episode, I will be talking a global perspective on information scraping.

I have a problem with the phrase “The Internet”, because it implies that there is a “thing” out there, and that if we all look we will all see the same thing. In reality, the Internet is really more like a hologram, it looks different to every viewer and from every direction.

In the early days, web pages were simply flat files. If you requested a web page, that file was just sent to you. The same file would be sent to everyone who asked. That is not how things work any more. These days, most web pages are dynamically generated. The page literally does not exist except as a set of rules and logic for how to create the page when requested. Those rules can include information about date, time, recent events, evolving content on the server, the location of the user, and that visitor’s history of activity on the website. The server then pulls together and delivers the website the visitor sees, which might be slightly or significantly different from any other visitor.

A news site, for example, might show stories about your local area, a search engine could rank results based on your previous patterns of interest, and storefronts might adjust prices based on income levels in your area. There have even been examples of targeting based on computer brand, where more expensive hotels were shown to Mac users than to Windows users.

Consider this scenario: You’re traveling to Australia for a summer vacation. You plan to fly into Sydney and use that as your home base. Throughout the three weeks you will be Down Under, you will be making trips to Brisbane, Perth, and Melbourne.

Being the early planner that you are, you book your flights within Australia before you leave the U.S., from your U.S. based IP address. Now, flash forward to your vacation… Once settled into your hotel room in Sydney, happily connected to the local hotel WiFi, you happen to browse flight prices from Sydney to your other Aussie locations. Not only are you getting killed by the exchange rate of the Australian dollar to American, but the Australian airline knocks off an additional 10% to its domestic travelers.

So it is not enough to get just one picture of a website. To really understand what is there, it must be observed from multiple different perspectives. One of the most important perspectives is location. Altering content based on the country or region of the visitor is really quite common.

Imagine that you are the Product Manager for a high-tech consumer product. You are constantly keeping your eyes on your competitors to make sure you are staying ahead of them in technology, market share, and price. You are in the U.S., but your main competitors are overseas. So you conduct your research from your work computer, unaware that your corporate-branded U.S. IP address stands out like a sore thumb, every time you hit their site. In fact, they noticed your pattern of checking pricing on Mondays and Fridays, tech specs every Tuesday, and financials on the first of every month. After a while, you might notice that their site is getting quite stagnant. While they used to adjust their pricing weekly and their tech specs every month, they have not changed a thing in the last couple of months… or so you thought.

Some emails from overseas partners suggest that you are missing something. Turns out, your competition got wise to what you were doing and is now spoofing you by posting old data every time their website is visited by your company’s IP address range. If you had access to non-attributable U.S. IP addresses, or better yet, IP addresses that are regionally close to your competitor, you would be able to get the scoop on what they were doing, and they would be none the wiser.

Obviously this pattern would have been even clearer, and the change probably less noticeable, if you had been doing automated scraping, as opposed to just being a human at the keyboard. In order to detect this, your scraping activity needs to be duplicated and originate from different areas. Any given website should be tested to detect if they are doing this kind of modification by scraping random samples of data from the site and comparing them to your standard scraping results. If they are different, then you may need to repeat most or all of your activity from one or even more than one other location in addition to your primary scraping location.

Ntrepid maintains facilities in many different countries around the world specifically for this purpose. It is easy to specify the location of origin of any given scraping activity. Our large pools of IP addresses in each location allow you to disguise your activity just as you would when scraping from our domestic IP address space.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on twitter @ntrepidcorp.

You can also reach me by email with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com.

Thanks for listening.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
esctx	session	The esctx cookie is set by Microsoft for secure authentication of the users' login details.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
stsservicecookie	session	This cookie is set by Microsoft for secure authentication of the users' login details.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
x-ms-gateway-slice	session	This cookie is set by Microsoft for secure authentication of the users' login details.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-37785135-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
buid	1 month	No description available.
fpc	1 month	No description available.
muc_ads	2 years	No description available.
RpsContextCookie	10 minutes	No description available.
visitor_id456132	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.
visitor_id456132-hash	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.

International Data Scraping Episode

Share this post

International Data Scraping Episode

Related Posts