Grabbing Data From the Web? Our Copyright Guide Outlines What You Need to Know About Web-Scraping, Web-Crawling and APIs

By Taylor Faires, Copyright Student Fellow

The Center for Academic Innovation has released a new guide to help people understand the complexities and considerations needed when deciding the appropriate use of fetching and using content such as databases, metadata and code from other sources. 

The guide outlines the laws governing online material, relevant examples of the laws in action, and a copyright consideration checklist available for people to utilize when evaluating any potential collection and use of online content. 

Engaging With Online Content

When you think of online content, you might first think of the front-facing content you can see: text, images, and videos. There are a handful of considerations you must think about when using front-facing content, including who owns the content, the copyrightability of the content, and whether or not your use of the content would be a fair use. But sometimes, you want to use back-facing content, such as databases, metadata, or code, and using these types of content requires additional considerations.

The purpose of this guide is to go over some common engagements with online content that go beyond the use of front-facing matter on websites: the use of online databases or APIs and using web-scraping or web-crawling to obtain and use online content. The first section of this guide provides a brief explanation of each type of engagement, followed by a section describing the rules and regulations that affect these engagements. Finally, this guide closes with a checklist to start the process of working through your own intended use.

Types of Engagement

Web-Scraping and Web Crawling

Web scraping is using a program to gather content from a website. This is most commonly done on the HTML (which governs the structure of a website). Scraping the HTML of a website allows a user to download and organize the content of a website including links and data that it uses. Web crawling works similarly to web scraping; however, web scraping only scrapes one web page at a time, while web crawling automatically scrapes a web page and all pages that are linked to that web page

APIs and Databases

API stands for Application Programming Interface. It has many different uses, but this guide will only discuss APIs that are meant to allow users to easily request and store a website’s data through programming. Databases can, but do not always, take the form of an API. 

Laws Governing Online Content

Although it is legal to use web-crawling, web-scraping, and APIs to gather data, there are limitations. The limitations typically fall within these four categories:

  1. Computer Fraud and Abuse Act
  2. Terms of Service
  3. Copyright
  4. Trespass to Chattel

This guide will go through all of these categories more in depth. But first, there are a few common practices that can help ensure your use is authorized: 

  1. Does the website’s Terms of Service forbid any of your uses? If the terms say that reproduction or copying is not permitted, you cannot legally display any information you gather from the site. If the Terms of Service prohibits web-crawling or web-scraping, you should not use web-crawling or web-scraping on the site. 
  2. Is the information private? Determine whether the information you’re using could be considered “private.” Did you have to log-in to see the information? Is it personal in nature? If so, there may be additional restrictions on its use.
  3. Is there a “robots.txt” file in the code that stops web-crawling or web-scraping? Laws surrounding web-crawling and web-scraping prohibit you from bypassing this code

If you answer “yes” to any of these questions, consider finding another source for the data. Answering “no” to all of these questions does not guarantee that your use is legal, though it does make it more likely that you are able to use the content. 

screencap of Sean Swider using ViewPoint tool