Web scraping and web crawling are two activities that can appear together, or erroneously used to describe each other, but they have some key differences. Not all web crawling is web scraping, and vice versa, even though they are fairly similar.
This article will explain the differences between web scraping and web crawling, as well as how they’re used together, and also why web scraping is sometimes viewed in a negative light.
What is web scraping?
Web scraping refers to the process of extracting structured data from a website, typically for storage in a database and using it for reference. The goal of web scraping is usually to target structured data, such as keywords, contact information, price lists, and other useful data. Web scrapers are typically coded in Python for specific purposes, but there are a lot of available data scraping tools as online services.
So for example, let us say that you own an ecommerce retail website, and you want to quickly check product prices on competing websites, so you can adjust your prices according to market demand. By pointing a web scraper towards your competitor’s product pages, you can create a database of prices, and update it in real-time throughout the day.
You could also build databases of flight ticket prices throughout the day from competing airlines, or how many times a competitor website used a specific keyword in their blog articles, etc. It’s really limitless what kind of data you can scrape, as long as the information is already publicly available.
Web scraping can involve web crawling, but not always. A web scraper can target a specific list of URLs, including individual pages, to increase the speed and efficiency of the web scraper.
What is web crawling?
Web crawlers essentially build maps and indexes of the internet, and they are what powers search engines such as Google, Bing, etc. Starting from a list of seed URLs, the web crawlers will go through the web and follow links on pages to discover new pages, and basically map out the internet. In their most basic form, web crawlers simply validate HTML and hyperlinks.
Technically speaking, there is some scraping involved with web crawlers, but for a different purpose than commercial scraping. Web crawlers don’t scrape targeted information, they just build indexes of what’s there, so search engines can give accurate search results.
So to summarize, the analogy I like to think of is dental work. Web crawling is the equivalent of an X-ray to give a map of your teeth, and web scraping would be a particular dental procedure, like…plaque scraping.
Why do web scrapers get a bad rap?
Web scrapers are a bit of a controversial subject, and they fall into a “grey” area of morality. Web scraping publicly available data isn’t illegal, per se, as the 9th Circuit ruled when LinkedIn tried to use CFAA (Computer Fraud and Abuse Act) laws against HiQ, a data analysis company that was scraping publicly available profiles from LinkedIn.
However, the legality of web scraping has on-going developments as courts continue to find lines to draw. For example, the 11th Circuit ruled in Compulife v. Newman that “a database may contain trade secret information even though the database contents can be accessed through a publicly available website”.
The 11th Circuit’s Compulife v. Newman ruling was particularly contextual to the case with interesting variables, but it’s an example of how web scrapers shouldn’t feel entirely safe in their activities, just because web scraping public information doesn’t violate CFAA.
Webmasters also particularly dislike web scrapers, as web scrapers can typically:
- Disregard robots.txt (instructions for web crawlers as defined by the webmaster)
- Identity as a human-controlled browser
Web scrapers can also put a lot of strain on a website’s resources. You know how some websites with low server resources can crash when thousands of people are browsing it normally? Well now imagine thousands of web scrapers pinging the website, sending requests much faster than a human normally would.
At the end of the day, it’s important to be mindful of how your web scraping activities can have an impact on a website’s resources and ability to load pages for normal users.