null

Web Scraping

Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically.

 

Advantages of Web Scraping:

   The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the moment they are available etc. just like a human can do. 

 

Some of the important uses of web scraping are discussed here : 

  • E-commerce Websites − Web scrapers can collect the data specially related to the price of a specific product from various e-commerce websites for their comparison.
  • Content Aggregators − Web scraping is used widely by content aggregators like news aggregators and job aggregators for providing updated data to their users.
  • Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails, phone number etc. for sales and marketing campaigns. 
  • Search Engine Optimization (SEO) − Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them. 
  • Data for Machine Learning Projects − Retrieval of data for machine learning projects depends upon web scraping. 
  • Data for Research − Researchers can collect useful data for the purpose of their research work by saving their time by this automated process.

 

Steps involved in web scraping:

Step 1: Downloading Contents from Web Pages

In this step, a web scraper will download the requested contents from multiple web pages.

Step 2: Extracting Data

The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper will parse and extract structured data from the downloaded contents.

Step 3: Storing the Data

Here, a web scraper will store and save the extracted data in any of the format like CSV, JSON or in database.

Step 4: Analyzing the Data

After all these steps are successfully done, the web scraper will analyze the data thus obtained

 

Why Python for Web Scraping?

   Python is a popular tool for implementing web scraping. Python programming language is also used for other useful projects related to cyber security, penetration testing as well as digital forensic applications. Using the base programming of Python, web scraping can be performed without using any other third party tool.

 

Python Modules for Web Scraping:

   Instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

Here are the few useful Python libraries for web scraping:

1) Requests:

It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Instead of creating a connection or a pool, you directly GET a URL response.

2) Urllib3:

It is another Python library that can be used for retrieving data from URLs similar to the requests library. This handles connection pooling and thread safety for you by using PoolManager object.

3) BeautifulSoup:

BeautifulSoup is used for extracting data points from the pages that are loaded. Beautiful Soup is quite robust and it handles nicely malformed markup. So, in other words, if you have a page that is not getting validated as a proper HTML but you know for a fact that it’s a page and that it’s an HTML specifically page. 

The official docs are comprehensive, easy to read and with lots of examples. So they are really, just like with Requests, they are really, beginner-friendly.

4) Selenium:

It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript.

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome.

5) lxml:

lxml is just similar to Beautiful Soup. It handles or it’s used for scraping data. It’s the most feature-rich Python library for processing both XML and HTML. It’s also really fast and memory efficient. Beautiful Soup also supports it as a parser.

(Note: The above python packages can be installed using --pip-- command in an activated python virtual environment.)

 

Exercise 1: Scraping using requests module:

import requests

r = requests.get('https://authoraditiagarwal.com/')

r.text[:200]

Output:

'<!DOCTYPE html>\n<html lang="en-US"\n\titemscope

\n\titemtype="http://schema.org/WebSite" \n\tprefix="og: http://ogp.me/ns#"

>\n<head>\n\t<meta charset

="UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE'

 

Exercise 2 : Scraping using Urllib3 and BeautifulSoup:

import urllib3

from bs4 import BeautifulSoup

 

http = urllib3.PoolManager()

r = http.request('GET', 'https://authoraditiagarwal.com')

soup = BeautifulSoup(r.data, 'lxml')

print(soup.title)

print(soup.title.text)

Output:

<title>Learn and Grow with Aditi Agarwal</title>

Learn and Grow with Aditi Agarwal

 

 

 

 

Exercise 3 : web scraping using selenium.

from selenium import webdriver

path = r'C:\\Users\\test_user\\Desktop\\Chromedriver’ 

browser = webdriver.Chrome(executable_path = path)

browser.get('https://authoraditiagarwal.com/leadershipmanagement')

browser.find_element_by_xpath('/html/body').click()

You can check the browser, controlled by Python script, for output.

 

Scrape without getting blocked: 

             Web scraping can be difficult, particularly when most popular sites actively try to prevent developers from scraping their websites using a variety of techniques such as IP address detection, HTTP request header checking, CAPTCHAs, javascript checks, and more. To avoid scrape without getting blocked, 

we can follow the below methods without getting blocked while scraping:

  • Make requests through Proxies and rotate them as needed.
  • Rotate User Agents and corresponding HTTP Request Headers between requests.
  • Use a headless browser like Selenium, Puppeteer, or Playwright.
  • Set Random Intervals In Between Your Requests.
  • Check if Website is Changing Layouts.
  • Avoid scraping data behind a login.
  • Use Captcha Solving Services.
Comentar