Menu Close

Python web scraping made easy #1

Python web scraping tutorial
web scraping carreer
Photo by Andri on Pexels.com

I did a lot of web scraping and I’m still doing it. I’ve been building web scrapers since I started my first job, in 2015. I created a lot of web spiders, parsers and contributed to crawling architectures from my past jobs. I had a lot of time to try and fail in order to scrape some public or protected data.

Thank you Akamai, Cloudflare, PerimeterX, and other anti-crawling services for letting me learn how to improve my scripts and bypass different bot protections, it was a fun journey.

What is web scraping?

The term web scraping is used for the process of using bots to extract data from a website. As a short example, you can make a script that extracts new posts from the Hacker News website.

Why you should use Python for web scraping?

I wrote web scrapers in Python, Go, and JS. For me, Python was the to-go language because it has an easy learning curve. A big plus is also that the community developed a lot of useful libraries and tools to use for web scraping, .

What should I learn first?

CSS selectors

Simple, CSS selectors and how an HTML DOM works. You can practice different CSS selectors in your browser, with the inspect window opened. Press cmd+f or ctrl+f and search a DOM element by using a CSS selector. 

Using CSS selector to select the div element that has w3-example as its class

HTTP GET and POST

Now you can advance to some HTTP requests basic knowledge. How a web request works, what verbs can be used, and what data is used for each verb. Mostly, you will use the GET and POST methods. So, learn how to do a basic request in Python (GET / POST).

Don’t forget to make HTTPBIN.ORG your best friend in your practice period. You’ll thank me later.

Python GET request example

import requests
response = requests.get('http://httpbin.org/get')
response.json()
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.25.1'}, 'origin': 'your-ip', 'url': 'http://httpbin.org/get'}

Python POST request example

import requests
response = requests.post('http://httpbin.org/post', data={'user': 'test'})
response.json()
{'args': {}, 'data': '', 'files': {}, 'form': {'user': 'test'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '9', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.25.1', }, 'json': None, 'origin': 'your-ip', 'url': 'http://httpbin.org/post'}

Short note:

You'll have to learn the difference between a POST with json data and the one with form data. The one above was sent with form data. To send the data as a JSON object, you have to encode it as string with json.dumps as follows data=json.dumps({'user': 'test'}).

The lxml library

The lxml library is the core library for XML and HTML parsing. Yeah, you can skip and jump directly to beautifulsoup or another mighty framework for web scraping. I keed, learn the basics first!

A scraper’s flow should be like this:

  • request a page
  • process its HTML
  • navigate through the parsed HTML DOM
  • parse the required data
  • transform the data
  • save it
  • go to the next page or whatever you want to do next

The lxml library will be used to process the HTML, to navigate through the DOM’s elements, and extract desired elements using CSS selectors.

Parsing HN posts with Python and lxml

You’ll have to add new libraries to your Python project.

pip install requests lxml csselect
import requests
from lxml import html

# Hacker News example
news = []  # empty list to add parsed data
response = requests.get('https://news.ycombinator.com/')  # HTTP GET request

dom = html.fromstring(response.text)  # parsing HTML from previous request
table_el = dom.cssselect('table.itemlist')[0]  # using CSS selector to select HTML table from DOM
title_els = table_el.cssselect('td.title:last-of-type')  # using CSS selector to select each post element from the table parsed before

# Parsing each DOM element selected before
for title_el in title_els:
    news.append({
        'title': title_el.text_content(),
        'url': title_el.cssselect('a')[0].get('href')
    })
print(news)
[{'title': 'How to Make the Universe Think for Us (quantamagazine.org)', 'url': 'https://www.quantamagazine.org/how-to-make-the-universe-think-for-us-20220531/'}, {'title': 'Show HN: An open source alternative to Evernote (Self Hosted) (github.com/git-noter)',...

Conclusion

Do some practice, harvest your knowledge, and wait for the next tutorial from this series.

Spread the love

Leave a Reply