Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically.
Advantages of Web Scraping:
The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the moment they are available etc. just like a human can do.
Some of the important uses of web scraping are discussed here :
Steps involved in web scraping:
Step 1: Downloading Contents from Web Pages
In this step, a web scraper will download the requested contents from multiple web pages.
Step 2: Extracting Data
The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper will parse and extract structured data from the downloaded contents.
Step 3: Storing the Data
Here, a web scraper will store and save the extracted data in any of the format like CSV, JSON or in database.
Step 4: Analyzing the Data
After all these steps are successfully done, the web scraper will analyze the data thus obtained
Why Python for Web Scraping?
Python is a popular tool for implementing web scraping. Python programming language is also used for other useful projects related to cyber security, penetration testing as well as digital forensic applications. Using the base programming of Python, web scraping can be performed without using any other third party tool.
Python Modules for Web Scraping:
Instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.
Here are the few useful Python libraries for web scraping:
1) Requests:
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Instead of creating a connection or a pool, you directly GET a URL response.
2) Urllib3:
It is another Python library that can be used for retrieving data from URLs similar to the requests library. This handles connection pooling and thread safety for you by using PoolManager object.
3) BeautifulSoup:
BeautifulSoup is used for extracting data points from the pages that are loaded. Beautiful Soup is quite robust and it handles nicely malformed markup. So, in other words, if you have a page that is not getting validated as a proper HTML but you know for a fact that it’s a page and that it’s an HTML specifically page.
The official docs are comprehensive, easy to read and with lots of examples. So they are really, just like with Requests, they are really, beginner-friendly.
4) Selenium:
It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome.
5) lxml:
lxml is just similar to Beautiful Soup. It handles or it’s used for scraping data. It’s the most feature-rich Python library for processing both XML and HTML. It’s also really fast and memory efficient. Beautiful Soup also supports it as a parser.
(Note: The above python packages can be installed using --pip-- command in an activated python virtual environment.)
Exercise 1: Scraping using requests module:
import requests
r = requests.get('https://authoraditiagarwal.com/')
r.text[:200]
Output:
'<!DOCTYPE html>\n<html lang="en-US"\n\titemscope
\n\titemtype="http://schema.org/WebSite" \n\tprefix="og: http://ogp.me/ns#"
>\n<head>\n\t<meta charset
="UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE'
Exercise 2 : Scraping using Urllib3 and BeautifulSoup:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print(soup.title)
print(soup.title.text)
<title>Learn and Grow with Aditi Agarwal</title>
Learn and Grow with Aditi Agarwal
Exercise 3 : web scraping using selenium.
from selenium import webdriver
path = r'C:\\Users\\test_user\\Desktop\\Chromedriver’
browser = webdriver.Chrome(executable_path = path)
browser.get('https://authoraditiagarwal.com/leadershipmanagement')
browser.find_element_by_xpath('/html/body').click()
You can check the browser, controlled by Python script, for output.
Scrape without getting blocked:
Web scraping can be difficult, particularly when most popular sites actively try to prevent developers from scraping their websites using a variety of techniques such as IP address detection, HTTP request header checking, CAPTCHAs, javascript checks, and more. To avoid scrape without getting blocked,
we can follow the below methods without getting blocked while scraping: