Web Scraping with Python and Beautiful Soup

A quote a day keeps the unmotivated-ness away

I recently built my first web scraper and it worked! It was so fun to learn how to do, and even more to fun to build. In this post I wanted to go step by step in how I built mine in hopes of teaching you to build your own.

Let’s start off by making sure you have Python installed on your machine. If you do not, please see here.

Quick note: I’m running everything in macOS and will try to post relevant Windows commands throughout this post!

We will be building a web scraper to get a new quote of the day every day, so let’s get started!

Once you have Python installed on your machine, let’s open a new project. Go to where you’d like this project to live and create a new folder. To do this I’ll be using the CLI commands:

mkdir quotescraper
cd quotescraper

The above code will create a new folder called quotescraper and move us into that directory. From there we’ll create the files needed for this project:

touch webscraper.py
touch todaysquote.txt

On Windows these commands are:

echo. >webscraper.py 
echo. >todaysquote.txt

Next up virtual environments! Still in the command line (and in our project folder) we’re going to create a virtual environment to build the rest of our script. To learn more about virtual environments check out this article.

If you’re using Python 3 you should have the package, virtualenv installed. If not, install it using the pip command:

pip install virtualenv OR pip3 install virtualenv

Once installed, let’s create and activate an environment. On macOS:

python3 -m venv env 
source env/bin/activate

On Windows:

python3 -m venv env 
env\Scripts\activate

Next we are going to install the packages needful this project.

pip install beautifulsoup4
pip install lxml
pip install time
pip install requests

Open your favorite code editor (mine is VS Code), import the packages, and get started to code!

from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

In order to scrape we need a URL. Let’s go grab it and set it to a variable named url.

url = ‘https://www.brainyquote.com/quote_of_the_day’

Next we need to use the requests package to actually get the url and its data.

html_text = requests.get(url).text

We use .text method to get the text of the site. If we don’t an error may occur when trying to run our script. We now need to create an instance of Beautiful Soup.

soup = BeautifulSoup(html_text, ‘lxml’)

Beautiful Soup is a Python package used to pull data from HTML and XML files. We pass our html_text variable as an argument and lxml as the second argument to process the data received.

Head over to the site we are scraping and click on the “Quote of the Day” link.

Scroll down to the quote of the day rectangle, highlight the quote, and inspect the element.

You’ll be taken to the exact spot in the HTML where this quote lives. We need to note the parent element, in this case a div and its class name.

This is what we’ll use to get the quote’s text! I’m going to create a variable named quote and set it to the soup.find() function as shown below.

quote = soup.find(‘div’, class_=‘clearfix’).a.text

Make sure to use an underscore when passing the class argument as an error will occur if not done so. We use dot notation to gain access to the a tag of this div, then use .text to gain access to the actual text of the quote. Very cool, right!?

Now we have the basic workings of our web scraper! Next we’re going to save the quote in our todaysquote.txt file. Make sure that this file lives in the same directory as your script file to make life easier. We’re going to use the datetime package and the file writing functionalities of Python to do so. Get excited!

Create a variable named today and set it to today’s date. We’ll also format the date to show as MM/DD/YYYY.

today = datetime.today().strftime('%m-%d-%Y')

Time to use the file writing functions of Python. Create a variable named quote_file and set it open to the open() function. We’re going to pass the name of the .txt we create and use w to tell the open() function that we will be writing to the text file. You can also use a to append to the file (since it already exists) if you’d like.

quote_file = open('todaysquote.txt', 'w')quote_file.write(f'Today is: {today} \n Quote of the day is: {quote}')quote_file.close()

We use the write and close functions to write the contents of our file, then close it. I will also add a print statement to let me know that the quote has been saved. We use f-strings in the write function to format the string saved to the text file. It allows us to use our variables in our strings! Learn more about it here.

Last step is to put all the above code into a function and control how often this function runs using the if __name__ == ‘__main__’ condition in Python.

def find_quote():
url = 'https://www.brainyquote.com/quote_of_the_day'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
quote = soup.find('div', class_='clearfix').a.text
today = datetime.today().strftime('%m-%d-%Y')
quote_file = open('todaysquote.txt', 'w')
quote_file.write(f'Today is: {today} \n Quote of the day is: {quote}')
quote_file.close()
print('Quote saved! \n')if __name__ == '__main__':
while True:
find_quote()
waiting_time = 100
print(f'Will run tomorrow')
time.sleep(time_wait * 864)

To learn more about if __name__ == ‘__main__’ check out this great article here.

We use the time module to put the function to sleep and run after one day. There are 86,400 seconds in a day, so that’s why we multiply our waiting_time of 100 by 864. You can modify this however you want! The code for the finished script can be found below:

from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time
def find_quote():
url = 'https://www.brainyquote.com/quote_of_the_day'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
quote = soup.find('div', class_='clearfix').a.text
today = datetime.today().strftime('%m-%d-%Y')
quote_file = open('todaysquote.txt', 'w')
quote_file.write(f'Today is: {today} \n Quote of the day is: {quote}')
quote_file.close()
print('Quote saved! \n')if __name__ == '__main__':
while True:
find_quote()
waiting_time = 100
print(f'Will run tomorrow')
time.sleep(time_wait * 864)

To run the script, go back to the directory of this project in your command line and type the following:

python3 webscraper.py

You should then be able to open the text file and see the quote of the day!

#command on macOS
open todaysquote.txt
#command on Windows
todaysquote.t

And that’s it! Web scraper complete. As this is my first web scraper I’ve ever built, I’m always open to learning about different ways to build the next one, so if you have any tips please let me know! Hope this tutorial was useful to you and that you learned something new :)

Thanks for reading! Check out my podcast, The Press Pod, for discussions on tech, web development, productivity, and so much more!

your weekly coffee talk