Menu Close

Grab all titles from a website using Python and requests (or any programming language) [outdated]

techwetrust post image

In this tutorial, I will use an API for requests that take over the work of setting proxies or loading and parsing the HTML DOM.

I have made a simple API for making requests and parse responses https://urlworker.xyz/ also, you can read my last post about it: UrlWorker: an easy way to make requests and parse HTML.

I’m using Python for this tutorial. You are free to use any programming language you know.

Python imports

Let’s get to work!

import requests
import json

Firstly, import necessary libraries.

API_KEY = '__YOUR_API_KEY__'
API_URL = 'https://api.urlworker.xyz/{key}'.format(key=API_KEY)
URL = 'https://techwetrust.com/'

Where API_KEY is the key obtained from the UrlWorker API, API_URL is the endpoint used to make requests and URL is the website we want to crawl or parse.

API config

Now, make a config dictionary using the following structure for our API settings.

config = {
  "method": "get",
  "url": URL,
  "headers": {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    "(KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
  },
  "timeout": 10,
  "css_selectors": [
    {
      # https://www.w3schools.com/cssref/css_selectors.asp
      "selector": "h1",
      "text": True
    },
  ]
}

As you can see, we are specifying the HTTP method we want our API to use in requests, the url, a User-Agent for headers to be polite when it makes requests, a timeout for our request to not hang and a CSS selector.

As quick info, a CSS selector is a rule to identify HTML elements from the HTML DOM.

For example, the h1 selector will select all headings (h1) elements from the HTML DOM. Templates use h1 or h2 for post titles/headings.

# call the API making an HTTP request
response = requests.post(API_URL, data=json.dumps(config))

# load the received JSON into a dictionary
data = response.json()

The following line will print our parsed titled from the requested URL

data['css_selectors'][0]['results']
['\r\n                    How to use SSH Key-Based Authentication on Linux                ',
 '\r\n                    How to easily understand Unix jobs management (ctrl+z, bg, jobs, fg)                ',
 '\r\n                    How to use threads in Python 3 [the easy way]                ',
 '\r\n                    How to edit Mac OS hosts file?                ',
 '\r\n                    How to send emails through a Docker container using Flask API?                ',
 '\r\n                    How to turn on “Less secure app access” on used Google Account?                ',
 '\r\n                    How to run WordPress with docker on Linux/macOS/Windows?                ',
 '\r\n                    How to pass visitors real IP to WordPress from Nginx reverse proxy?                ',
 '\r\n                    How to run Nginx with docker-compose ?                ',
 '\r\n                    How to run Nginx with Docker?                ']

Final touches

As you can observe, each title needs to be stripped because it has spaces, new lines, you can use .strip()

titles = []
for element in data['css_selectors'][0]['results']:
  title = element.strip()
  titles.append(title)

If you print the titles variable, it will contain clean text

print(titles)
['How to use SSH Key-Based Authentication on Linux',
 'How to easily understand Unix jobs management (ctrl+z, bg, jobs, fg)',
 'How to use threads in Python 3 [the easy way]',
 'How to edit Mac OS hosts file?',
 'How to send emails through a Docker container using Flask API?',
 'How to turn on “Less secure app access” on used Google Account?',
 'How to run WordPress with docker on Linux/macOS/Windows?',
 'How to pass visitors real IP to WordPress from Nginx reverse proxy?',
 'How to run Nginx with docker-compose ?',
 'How to run Nginx with Docker?']

In conclusion, you can parse a webpage with basic knowledge of programming.

The hole code is available in this Google Colab file.

Spread the love

Leave a Reply