What is Web Scraping?

By | January 7, 2017

The World Wide Web(WWW) is large, complex and ever-evolving. From this huge amount of data, how does one get to the relevant piece of information? This is where web scraping comes into the picture.

Web scraping (also referred as screen scraping, web harvesting or web data extraction) is a software technique of collecting information from websites automatically. Generally, such software programs simulate human exploration of the World Wide Web. It requests HTML pages, parses them and transform page content and stores it into file or database. Hence a web scraper is a software program to extract data from a web site. It is also known as web scraping software, web scraping tool, screen scraper, web harvester or web data extractor.

Any web content that can be seen can be scraped without any trouble.

Web scrapping is closely associated with web indexing, which indexes web pages information using a web crawler or a bot and is a common method implemented by most search engines. While Web scrapping concentrates more on the conversion of unstructured website data into structured data that can be stored in a database or file and later it can be analyzed.

Web Scraping Techniques:

There are many ways to scrape information from the web which are:

  • Manual Human Copy and Paste
  • Use of API to extract data – Preferred over Web Scrapping if provided
  • Regular Expression (RegEx) matching and Text Grepping
  • HTTP & Socket Programming
  • HTML Parsing
  • DOM Parsing
  • Web Scraping tools
  • Vertical aggregation platforms
  • Semantic annotation recognizing
  • Web Page Analyzers

Web Scrapping can be implemented using various software programming languages and tools like Google Docs, Python, Perl, PHP, .NET etc. Python is very popular programming language for web scrapping as it is open source, easy to learn and readable syntax. It has wide variety of libraries for web scrapping like requests, lxml, BeautifulSoup, Urllib and Scrapy.

Uses of Web Scraping:

  • Online Price Scraping and Comparison
  • Contact Scraping for Lead Generation
  • Weather Data monitoring
  • Website change detection
  • Academic, Marketing or Scientific Research
  • Web Mashup
  • Web Data Integration
  • Scraping eCommerce listings
  • Collecting business or product reviews
  • Scraping people profiles from social networks for tracking online reputation
  • Scraping Search Engine Result Pages(SERPs) for Search(SEO) purpose
  • Scraping news websites
  • Real estate websites scraping
  • Scraping job websites to create central job boards

It is quite natural to have a question in mind like : “Is web scraping legal?”. The answer is Yes, web scraping is absolutely a legal concept. Extracting data from public websites is very common. Web scraping is a process to have a computer read a website automatically. There is absolutely no difference between an automated computer viewing a website and a human viewing a website. Web scraping may be against the terms of use of some websites. While in many cases duplication of original content will be illegal, according to United States law duplication of facts is permissible.

