Advanced OSINT Training Series: Using the Internet Archive

What is The Internet Archive?

The Internet Archive, or Archive.org, is a digital library that provides public access to an archive of both current and historical versions of digitized materials, such as web pages, newspapers, software applications, images, books, and more. The non-profit organization started this massive web archiving project in 1996, and it has grown to be one of the most powerful tools for open-source research. Doubling as an activist organization, Archive.org advocates for a free and open internet for everyone, stating their mission to provide “universal access to all knowledge.”

Though not the only web archiving service available, the Internet Archive is expansive. The site currently provides digital access to over 618 billion web pages, 28 million books and texts, 14 million audio recordings, 6 million videos (including 2 million TV news programs), 3.5 million images, and 580,000 software programs. The bulk of data is automatically indexed by web crawlers that comb through the public web, attempting to preserve as much as possible. However, anyone can create a free account with the Internet Archive and upload media. The site serves millions of users each day, putting it in the top 300 most-used sites in the world. Users can utilize the site’s “Wayback Machine” feature, which allows users to search through archived web history.

How Can I Use the Internet Archive for OSINT Research?

When conducting OSINT research in these fast-paced digital times, analysts often need access to historical versions of websites or content that no longer exists. This is where the Wayback Machine comes into play. For instance, an analyst could search for the original version of an online news article that was altered after publication. Older versions of web pages can often provide relevant information like names, phone numbers, social media posts, email addresses, images, metadata, or even illicit or illegal content that has since been deleted or hidden.

In a recent blog, the team at OSINTCurio.us reviewed some methods to extract information using the Wayback Machine. We have included a few of their helpful techniques below:

Quick Search for a URL

To see all archived files on a particular website, enter https://web.archive.org/*/ into your browser followed by the web URL of your interest. Example: https://web.archive.org/web/*/www.example.com
You can also manually access the Wayback Machine (https://archive.org/web) and enter the URL of interest into the search bar.

Quick Search for a Domain

To view all archives for a particular domain, enter an asterisk at the end representing a wildcard, https://web.archive.org/*/www.example.com/*

Advanced Search Techniques

OSINT researchers can easily conduct basic keyword searches for topics or persons of interest. The archive also enables advanced search features, for more targeted queries. Some files require an account to access the content. Additionally, the email address connected to the account that uploaded the file is often discoverable using a few short techniques. This causes most users to create research accounts using a pseudonym and a burner email address. After identifying an email address, OSINT researchers can run additional searches to see if it has been used elsewhere. Because of these types of risks, Ntrepid recommends using a comprehensive managed attribution system when collecting from—or uploading content to—the Internet Archive.

Archive.org continuously adds to their already massive library of digital content. However, some sites or files may not be available due to issues with robots.txt files or at the request of the site owner. Luckily, researchers can use additional web archive services, such as Archive.today, CachedPages, or Webcitation, to help locate their desired content.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
esctx	session	The esctx cookie is set by Microsoft for secure authentication of the users' login details.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
stsservicecookie	session	This cookie is set by Microsoft for secure authentication of the users' login details.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
x-ms-gateway-slice	session	This cookie is set by Microsoft for secure authentication of the users' login details.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-37785135-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
buid	1 month	No description available.
fpc	1 month	No description available.
muc_ads	2 years	No description available.
RpsContextCookie	10 minutes	No description available.
visitor_id456132	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.
visitor_id456132-hash	10 years	This is a cookie pattern that appends a unique identifier for a website visitor, used for tracking purposes. The cookies in this domain have a lifespan of 10 years.

Advanced OSINT Training Series: Using the Internet Archive

Share this post