Updated · Jan 10, 2024
Muninder Adavelli is a core team member and Digital Growth Strategist at Techjury. With a strong bac... | See full bio
Updated · Oct 25, 2023
Muninder Adavelli is a core team member and Digital Growth Strategist at Techjury. With a strong bac... | See full bio
Lorie is an English Language and Literature graduate passionate about writing, research, and learnin... | See full bio
IBM defines web scraping as the process that:
“[...] involves using a program or algorithm to extract and process large amounts of data from the web.” |
Most people wonder: “Is web scraping legal?” Since the process involves the collection of data on someone else’s website, disputes about its ethics and legality are raised.
Web scraping is legal. However, it does not mean you can scrape all data types. It still depends on how you use the scraped data and its effects on the target.
This article will cover more about the legality of web scraping and how you can perform web scraping legally and ethically.
Read on.
🔑 Key Takeaways
|
As mentioned above, web scraping is the process of extracting specific data sets from websites. It can be through manual copy-paste method or automated scraper tools.
Though it is a common practice already, web scraping legality remains disputed. Web scraping legal issues have come to light as more websites file lawsuits against scrapers.
The following section will discuss some of these cases.
👍 Helpful Article Web scraping is often compared to APIs as both are effective methods for extracting data. However, they differ in many ways. If you’re still thinking about whether to use API or web scraping, knowing their differences will help you see which approach best suits your project. |
The best way to understand web scraping ethics and legality is to see the related cases. These lawsuits have cleared some of the "gray" areas of the matter.
Here are some of the most crucial and recent scraping lawsuits:
The hiQ Labs v. LinkedIn case has made an essential landmark in understanding web scraping legality.
It established that scraping publicly available data is legal. Also, it clarified the scope of the Computer Fraud and Abuse Act (CFAA).
The case started in 2017 when LinkedIn sent a cease-and-desist letter to hiQ Labs. This letter aimed to stop the analytics company’s scraping operations of LinkedIn profiles.
hiQ Labs used the scraped data for human resources services, such as gauging an employee's "flight" probability.
The Ninth Circuit's opinion in April 2022 declared that the CFAA is an anti-hacking law, so it's not applicable to LinkedIn’s case—or scraping public websites in general.
The 9th Circuit affirmed its prior decision, holding that LinkedIn could not block hiQ, a scraping entity, from scraping public LinkedIn profiles. The court found it was unlikely that hiQ had violated the CFAA. Read the analysis: https://t.co/Se8G6UvHlF pic.twitter.com/asvb3vUqca — Farella Braun+Martel (@FarellaBraun) April 27, 2022
|
However, in October 2022, LinkedIn succeeded in its breach of contract claims.
To be able to scrape and advertise, hiQ agents created LinkedIn accounts. This means the agents agreed to LinkedIn's Terms of Use.
The eBay v. Bidder's Edge case made the first successful "trespass to chattel" claim in an online dispute.
📖 Definition Trespass to chattel refers to a civil claim where someone intentionally disrupts another party’s possession. This emphasizes the act of deliberate prying, which can lead to compensation for damages. |
The case started in December 1999 when Bidder's Edge refused to stop scraping eBay listings. Bidder's Edge is an auction aggregator company that uses data collected from sites like eBay.
It's worth noting that Bidder's Edge's scraping only amounts to 1.53% of the requests received by eBay at that time. This platform gets around 80,000 to 100,000 requests daily.
However, the danger of a "slippery slope" supported the injunction. If not enforced, other companies will follow in scraping eBay. It will result in a complete overload of its servers.
The court ordered a preliminary injunction for Bidder's Edge to stop scraping. Both parties settled the legal dispute with an undisclosed amount in March 2001.
Though it is not an actual web scraping case, the Author's Guild v. Google case is an excellent example of the application of "fair use" in data collection.
In 2005, the Author's Guild and the Association of American Publishers filed two separate lawsuits against Google. The lawsuits cited that the Google Print Project is a massive copyright infringement.
The Google Print Project (Google Books Search at present) aims to digitize books for online indexing. To index, Google scans the entire text—making it available on search results as snippets.
The project started with books from the public domain. Unfortunately, it eventually included copyrighted books in partnership with significant libraries.
After the two failed settlement attempts, the court finally ruled in November 2013. The court decided that the digitization is non-infringing and the project is highly transformative.
Moreover, its social benefits, especially to the sciences and arts, are of great value.
Generative AI has wholly taken over the world stage in the last few months. However, AI still has many legal obstacles—especially in training.
Watch this video by Bloomberg Law to understand generative AI's current legal case.
|
Web scraping is still relatively deregulated. However, existing statutes cover some of its elements. Here are some of those laws:
One common argument against web scraping is it "exceeds authorized access" to website resources. It was the primary concern of the CFAA when they enacted it in 1986.
However, someone pointed out that the initial interpretation was broad and not applicable to scraping public websites. This issue happened in the hiQ Labs v. LinkedIn case.
📝 Note The original purpose of the CFAA was to prevent hacking credentials-protected computer systems. This is not the case with publicly rendered web pages. |
The California Consumer Privacy Act (CCPA) gives consumers control over the personal data companies collect. Consumer privacy rights included in CCPA are:
|
In this light, scraping personal information that is not publicly available is illegal.
The contract law is a common regulation that enforces the agreement made between parties. It is enforceable in web scraping through "browsewrap" or "clickwrap" agreements.
If a pop-up window asks you to agree to a website's Terms of Use before doing any activity, clicking through it binds you into a contract.
This means that if the Terms of Use prohibit any form of web scraping, you should comply. If you scrape, it will result in a breach of contract—which is punishable by law.
Developments in the US are reassuring, but these statutes do not apply to all countries. The legal treatment of web scraping may differ in each country at some point.
Here are some of the web scraping regulations in other countries.
EU countries regulate web scraping based on personal data protection, copyright, and database ownership. Below are the following EU policies concerning those.
The GDPR is an essential data protection and privacy law in the European Economic Area (EEA). It took effect in May 2018 and has inspired many countries' data privacy laws.
The scope and key provisions of GDPR are almost the same as the CCPA. The significant difference is that the GDPR protects personal data—whether publicly available or not.
To get a better grasp of web scraping legality in relation to GDPR, check out the statement of Zyte’s legal counsel in the video below:
|
The Digital Single Market Directive was approved to protect press publications. It also reduces profit gaps between online platforms and creators and some copyright exceptions for text and data mining (TDM).
Here are some of its crucial components:
Machine-readable policies are generally interpretable as the robot.txt file of some sites.
According to Directive 96/9/EC, a database is copyrightable if the maker has invested significantly in its collection.
The exceptions stated in the DSM Directive also apply here. However, implementing these directives depends on the EU's member states.
Indian courts have not yet ruled on the legality of web scraping except for some intellectual property rights claims.
In the case of OLX BV and Ors. v. Padawan Ltd., the Delhi High Court ruled in favor of OLX to restrain Padawan Ltd. from its scraping activities.
Padawan Ltd. used to post the scraped data from OLX on its product listing aggregator website. The defendant has scraped almost the entire database and listing compilations of OLX, which contained significant investments, logos, and trademarks.
India's Information Technology Act of 2000 penalizes unauthorized computer access. It includes data extraction of their resource without the owners' consent (Section 70).
A similar provision under CFAA was inapplicable in public website scraping.
Section 72 penalizes any electronic form of confidentiality and privacy breach. This law has the potential to rule against scraping personal data.
There has not been any notable case against web scrapers that used the IT Act of 2000. However, considering the IT Act is a comprehensive law, it is still possible.
Like in most countries, Canada has no law against web scraping. However, rulings against web scraping apply to other laws.
Some examples are the contract law and Canada's Personal Information Protection and Electronic Documents Act or PIPEDA.
In 2021, three provincial privacy protection authorities ordered Clearview AI to:
Clearview AI stopped its services in Canada. However, it did not stop its data collection. A Reddit user even said:
Comment |
This case is due to the lack of order-making powers of the Office of the Privacy Commissioner under PIPEDA. Some people called for amendments to this law for better enforcement.
Ethical and legal web scraping is possible. Considering the laws and lawsuits discussed above, here's a guide on how you can scrape ethically and legally:
1. Avoid Scraping Password-Protected Content
The CFAA does not apply to public websites. However, it is different if your target content is password-protected or has technical barriers.
2. Be Cautious in Scraping Personal Data
Be cautious and selective in scraping personal data. While the CCPA removes its protection of publicly available personal information, it is still best to be wary when handling personal data.
Also, keep in mind that the GDPR protects personal data. If you’re scraping sensitive personal information, you may face legal consequences.
3. Avoid Overburdening Any Website Server
Servers are much more powerful now than in the internet's infancy. A trespass to chattel claim may not hold as it did with the eBay v. Bidder's Edge case.
However, not overwhelming your target website with requests is basic decency and respect.
Do not scrape websites at peak hours. Limit the number of requests each session. Avoid stressing the site’s server with numerous requests.
✅ Pro Tip The best way to avoid being blocked when scraping is by using proxy servers. However, this depends on what type of proxy server you plan to use. For better chances, use residential instead of data center ones. Your requests will look like they come from different devices, easing the burden on the target site’s server. |
4. Automate robot.txt Recognition
If the site owner disallows scraping specific data on their website, it's best to respect it. Always review a website's robot.txt file or integrate its recognition on your web scraping tool script.
5. Review the Website's Terms of Use
Terms of Use agreements are enforceable. Bypassing these agreements implies a breach of contract.
It's better to postpone the project until you get consent. Do this if the owner asks you to agree to the terms before accessing the data.
6. Scrape Only Factual Content
Facts are non-copyrightable. If you scrape factual data only, you are likely safe from legal troubles like copyright.
⚠️ Warning Most databases are copyright-protected in some jurisdictions like the European Union. It's crucial to know the website's base location and check its regulations. |
7. Ensure the Project Aims at a Transformative End Result
You can overrule the database copyright protection if you're scraping for scientific research. If not, your end product would be highly transformative and non-competitive.
Make sure to measure the societal benefits and the welfare of the concerned parties.
There is no legal gray area in web scraping so long as you know what you are doing. Following the guidelines above can save you from most of the legal troubles you might face.
Also, it's best to study the law jurisdictions based on your project and target data or consult a lawyer.
Yes, as long as you follow their robot.txt file and Terms of Service. You must also note privacy law jurisdictions since TikTok is used globally.
Scrapers can encounter legal issues coming from indiscriminate web scraping. Site owners can face fraud, theft of confidential data, and overloaded servers.
People scrape websites for scientific or market research and lead generation. Developers also use scraped data to train AI models.
Twitter's Terms of Service prohibits web scraping. This ToS makes scraping illegal if viewed under contract law. Even if the risk of lawsuits is low, be cautious and seek legal advice before starting your project.
Your email address will not be published.
Updated · Jan 10, 2024
Updated · Jan 09, 2024
Updated · Jan 05, 2024
Updated · Jan 03, 2024