Advanced OSINT Training Series: Using the Internet Archive

stacks of library books for research with a blue overlay

Share this post

Advanced OSINT Training Series: Using the Internet Archive

What is The Internet Archive?  

The Internet Archive, or Archive.org, is a digital library that provides public access to an archive of both current and historical versions of digitized materials, such as web pages, newspapers, software applications, images, books, and more. The non-profit organization started this massive web archiving project in 1996, and it has grown to be one of the most powerful tools for open-source research. Doubling as an activist organization, Archive.org advocates for a free and open internet for everyone, stating their mission to provide “universal access to all knowledge.”  

Though not the only web archiving service available, the Internet Archive is expansive. The site currently provides digital access to over 618 billion web pages, 28 million books and texts, 14 million audio recordings, 6 million videos (including 2 million TV news programs), 3.5 million images, and 580,000 software programs. The bulk of data is automatically indexed by web crawlers that comb through the public web, attempting to preserve as much as possible. However, anyone can create a free account with the Internet Archive and upload media. The site serves millions of users each day, putting it in the top 300 most-used sites in the world. Users can utilize the site’s “Wayback Machine” feature, which allows users to search through archived web history.  

How Can I Use the Internet Archive for OSINT Research? 

When conducting OSINT research in these fast-paced digital times, analysts often need access to historical versions of websites or content that no longer exists. This is where the Wayback Machine comes into play.  For instance, an analyst could search for the original version of an online news article that was altered after publication. Older versions of web pages can often provide relevant information like names, phone numbers, social media posts, email addresses, images, metadata, or even illicit or illegal content that has since been deleted or hidden.   

In a recent blog, the team at OSINTCurio.us reviewed some methods to extract information using the Wayback Machine. We have included a few of their helpful techniques below: 

Quick Search for a URL  

  • To see all archived files on a particular website, enter https://web.archive.org/*/ into your browser followed by the web URL of your interest. Example: https://web.archive.org/web/*/www.example.com 
  • You can also manually access the Wayback Machine (https://archive.org/web) and enter the URL of interest into the search bar.   

Quick Search for a Domain 

  • To view all archives for a particular domain, enter an asterisk at the end representing a wildcard, https://web.archive.org/*/www.example.com/*  

Advanced Search Techniques 

OSINT researchers can easily conduct basic keyword searches for topics or persons of interest. The archive also enables advanced search features, for more targeted queries. Some files require an account to access the content. Additionally, the email address connected to the account that uploaded the file is often discoverable using a few short techniques.  This causes most users to create research accounts using a pseudonym and a burner email address. After identifying an email address, OSINT researchers can run additional searches to see if it has been used elsewhere. Because of these types of risks, Ntrepid recommends using a comprehensive managed attribution system when collecting from—or uploading content to—the Internet Archive.  

Archive.org continuously adds to their already massive library of digital content. However, some sites or files may not be available due to issues with robots.txt files or at the request of the site owner. Luckily, researchers can use additional web archive services, such as Archive.today, CachedPages, or Webcitation, to help locate their desired content.  

Related articles: Advanced OSINT Training: Google DorkingHiding in Plain Sight