Using AI for Webscraping – St. Louis Area Digital Humanities Network

Contributed by Montaque Reynolds, Saint Louis University

Difficulty:
Field:
Tools and resources:

I wondered if I could useChatGPT to develop a webscraper for genius.com that I’ve been struggling with for some time. It turns out that I could. Here is the process and resulting code. This walkthrough demonstrates the iterative process of co-writing code with an AI.

Prompt

how do I scrape https://genius.com in python

ChaptGPT

To scrape data from https://genius.com in Python, you can use web scraping libraries such as Beautiful Soup and Requests. Here’s an example of how you can scrape the lyrics of a song from Genius.com:





import requests
from bs4 import BeautifulSoup

# Replace the URL with the Genius.com URL of the song you want to scrape
url = 'https://genius.com/Billie-eilish-bad-guy-lyrics'

# Make a GET request to the URL
response = requests.get(url)

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the div element that contains the lyrics
lyrics_div = soup.find('div', class_='lyrics')

# Get the text of the lyrics
lyrics = lyrics_div.get_text()

# Print the lyrics
print(lyrics)

This code will scrape the lyrics of the song “bad guy” by Billie Eilish from Genius.com and print them to the console. Note that scraping websites may violate their terms of service, so use this code at your own risk.

Prompt

That did not work

ChaptGPT

I apologize if the previous code did not work for you. There can be several reasons why the code did not work, such as changes in the website’s HTML structure or blocking of the requests by the website. Here is an updated code that may work for you:





import requests
from bs4 import BeautifulSoup

# Set the URL of the page you want to scrape
url = "https://genius.com/Billie-eilish-bad-guy-lyrics"

# Set the headers to avoid blocking of requests
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}

# Send a GET request to the URL with headers
response = requests.get(url, headers=headers)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find the lyrics div tag
lyrics_div = soup.find("div", class_="lyrics")

# Extract the text from the lyrics div tag
lyrics = lyrics_div.get_text()

# Print the lyrics
print(lyrics)

This updated code sets the User-Agent header to avoid blocking of requests by the website. If the website still blocks your requests, you may need to use a proxy server or consider using the Genius API instead of web scraping.

Prompt

I do have a genius api

ChatGPT

Great! If you have a Genius API, you can use it to programmatically access the data from Genius.com. Here is an example code that shows how to use the Genius API to get the lyrics of a song:





import lyricsgenius

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the name of the artist and the title of the song you want to get the lyrics for
artist_name = "Billie Eilish"
song_title = "bad guy"

# Search for the song on Genius.com
song = genius.search_song(song_title, artist_name)

# Get the lyrics of the song
lyrics = song.lyrics

# Print the lyrics
print(lyrics)

In this code, you first need to set your access token for the Genius API. Then, you create a Genius object with the access token and search for the song you want to get the lyrics for using the search_song method. Finally, you can get the lyrics of the song using the lyrics attribute of the Song object returned by the search_song method.

Prompt

Perfect. Now how do I save it to a csv file?

ChatGPT

To save the lyrics to a CSV file, you can use the csv module in Python. Here’s an example of how you can modify the previous code to save the lyrics to a CSV file:





import csv
import lyricsgenius

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the name of the artist and the title of the song you want to get the lyrics for
artist_name = "Billie Eilish"
song_title = "bad guy"

# Search for the song on Genius.com
song = genius.search_song(song_title, artist_name)

# Get the lyrics of the song
lyrics = song.lyrics

# Set the filename of the CSV file
filename = "lyrics.csv"

# Open the CSV file in write mode
with open(filename, "w", newline="") as file:
    writer = csv.writer(file)

    # Write the headers to the CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Write the data to the CSV file
    writer.writerow([artist_name, song_title, lyrics])
    
print("Lyrics saved to", filename)

In this code, after getting the lyrics of the song, we set the filename of the CSV file and open it in write mode using the with statement. Then, we create a csv.writer object and write the headers to the CSV file using the writerow method. Finally, we write the data (i.e., the artist name, song title, and lyrics) to the CSV file using the writerow method. The newline=“” parameter in the open function call is used to ensure that the rows are separated properly in the CSV file.

After running the code, the lyrics will be saved to the specified CSV file, and a message will be printed to the console indicating the filename where the lyrics are saved.

Prompt

Great. Now I would like to use it to scrape multiple songs.

ChatGPT

To scrape multiple songs using the Genius API, you can use a loop to iterate over a list of songs and write their lyrics to a CSV file. Here’s an example code that shows how you can modify the previous code to scrape the lyrics of multiple songs and save them to a CSV file:





import csv
import lyricsgenius

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the list of songs you want to get the lyrics for
songs = [
    {"artist": "Billie Eilish", "title": "bad guy"},
    {"artist": "The Weeknd", "title": "Blinding Lights"},
    {"artist": "Dua Lipa", "title": "Levitating"},
    # Add more songs here...
]

# Set the filename of the CSV file
filename = "lyrics.csv"

# Open the CSV file in write mode
with open(filename, "w", newline="") as file:
    writer = csv.writer(file)

    # Write the headers to the CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the CSV file
    for song in songs:
        # Search for the song on Genius.com
        search_results = genius.search_song(song["title"], song["artist"])

        # Check if the song was found
        if search_results is None:
            print(f"{song['title']} by {song['artist']} not found on Genius.com")
        else:
            # Get the lyrics of the song
            lyrics = search_results.lyrics

            # Write the data to the CSV file
            writer.writerow([song["artist"], song["title"], lyrics])

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {filename}")

In this code, we first set the list of songs we want to get the lyrics for. Each song is represented as a dictionary with keys for the artist and title of the song. Then, we use a loop to iterate over the list of songs and write their lyrics to a CSV file. Inside the loop, we search for the song on Genius.com using the search_song method, and if the song is found, we get its lyrics and write them to the CSV file.

Note that this code assumes that the search_song method returns the first result for each song. If there are multiple search results for a song, you may need to modify the code to select the correct result.

Prompt

Can you make it skip objects that are not found?

ChatGPT

Sure, you can modify the previous code to skip songs that are not found on Genius.com by adding a condition to check whether the search results are None. Here’s an example of how you can modify the code:





import csv
import lyricsgenius

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the list of songs you want to get the lyrics for
songs = [
    {"artist": "Billie Eilish", "title": "bad guy"},
    {"artist": "The Weeknd", "title": "Blinding Lights"},
    {"artist": "Dua Lipa", "title": "Levitating"},
    # Add more songs here...
]

# Set the filename of the CSV file
filename = "lyrics.csv"

# Open the CSV file in write mode
with open(filename, "w", newline="") as file:
    writer = csv.writer(file)

    # Write the headers to the CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the CSV file
    for song in songs:
        # Search for the song on Genius.com
        search_results = genius.search_song(song["title"], song["artist"])

        # Check if the song was found
        if search_results is None:
            print(f"{song['title']} by {song['artist']} not found on Genius.com")
        else:
            # Get the lyrics of the song
            lyrics = search_results.lyrics

            # Write the data to the CSV file
            writer.writerow([song["artist"], song["title"], lyrics])

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {filename}")

In this code, we check whether the search_song method returns None using the is operator. If the song is not found, we print a message to the console indicating that the song was not found and skip to the next song in the loop. If the song is found, we get its lyrics and write them to the CSV file as before.

Prompt

I’m getting a read timeout error

ChatGPT

A read timeout error can occur when the Genius API takes too long to respond to a request. To handle this error, you can add a try-except block around the code that calls the Genius API and retry the request a certain number of times before giving up. Here’s an example of how you can modify the code to handle read timeout errors:



import csv
import lyricsgenius
import time

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the list of songs you want to get the lyrics for
songs = [
    {"artist": "Billie Eilish", "title": "bad guy"},
    {"artist": "The Weeknd", "title": "Blinding Lights"},
    {"artist": "Dua Lipa", "title": "Levitating"},
    # Add more songs here...
]

# Set the filename of the CSV file
filename = "lyrics.csv"

# Set the maximum number of retries
max_retries = 3

# Open the CSV file in write mode
with open(filename, "w", newline="") as file:
    writer = csv.writer(file)

    # Write the headers to the CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the CSV file
    for song in songs:
        # Initialize the retry counter
        retries = 0

        # Loop until the request succeeds or the maximum number of retries is reached
        while retries < max_retries:
            try:
                # Search for the song on Genius.com
                search_results = genius.search_song(song["title"], song["artist"])

                # Check if the song was found
                if search_results is None:
                    print(f"{song['title']} by {song['artist']} not found on Genius.com")
                else:
                    # Get the lyrics of the song
                    lyrics = search_results.lyrics

                    # Write the data to the CSV file
                    writer.writerow([song["artist"], song["title"], lyrics])

                # Exit the retry loop if the request was successful
                break
            except Exception as e:
                # Print the error message
                print(f"An error occurred while getting lyrics for {song['title']} by {song['artist']}: {e}")

                # Increment the retry counter
                retries += 1

                # Wait for a few seconds before retrying
                time.sleep(5)

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {filename}")

In this code, we add a try-except block around the code that calls the Genius API, and retry the request a certain number of times before giving up. The maximum number of retries is set using the max_retries variable. If an error occurs while getting the lyrics for a song, we print the error message, increment the retry counter, and wait for a few seconds before retrying the request. Once the request succeeds, we exit the retry loop and write the lyrics to the CSV file as before.

Prompt

Perfect. Now how do I get it to read the requested songs from a csv file?

ChatGPT

To read the list of songs from a CSV file, you can use Python’s built-in csv module. Here’s an example of how you can modify the code to read the list of songs from a CSV file:




import csv
import lyricsgenius
import time

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the filename of the input CSV file
input_filename = "input_songs.csv"

# Set the filename of the output CSV file
output_filename = "output_lyrics.csv"

# Set the maximum number of retries
max_retries = 3

# Open the input CSV file in read mode
with open(input_filename, "r", newline="") as input_file:
    # Create a CSV reader object
    reader = csv.reader(input_file)

    # Skip the header row
    next(reader)

    # Loop over the rows in the input CSV file
    songs = []
    for row in reader:
        # Parse the artist and song title from the row
        artist, title = row

        # Add the song to the list of songs
        songs.append({"artist": artist, "title": title})

# Open the output CSV file in write mode
with open(output_filename, "w", newline="") as output_file:
    writer = csv.writer(output_file)

    # Write the headers to the output CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the output CSV file
    for song in songs:
        # Initialize the retry counter
        retries = 0

        # Loop until the request succeeds or the maximum number of retries is reached
        while retries < max_retries:
            try:
                # Search for the song on Genius.com
                search_results = genius.search_song(song["title"], song["artist"])

                # Check if the song was found
                if search_results is None:
                    print(f"{song['title']} by {song['artist']} not found on Genius.com")
                else:
                    # Get the lyrics of the song
                    lyrics = search_results.lyrics

                    # Write the data to the output CSV file
                    writer.writerow([song["artist"], song["title"], lyrics])

                # Exit the retry loop if the request was successful
                break
            except Exception as e:
                # Print the error message
                print(f"An error occurred while getting lyrics for {song['title']} by {song['artist']}: {e}")

                # Increment the retry counter
                retries += 1

                # Wait for a few seconds before retrying
                time.sleep(5)

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {output_filename}")

In this code, we open the input CSV file using the csv.reader object and skip the header row. Then, we loop over the rows in the input CSV file and parse the artist and song title from each row. We add each song to the songs list as a dictionary with keys “artist” and “title”. Finally, we open the output CSV file using the csv.writer object, loop over the list of songs, and write their lyrics to the output CSV file as before.

Prompt

Should the headings of the input csv be artist and title?

ChatGPT

Yes, that’s correct. The headings of the input CSV should be “artist” and “title” in order for the code to correctly parse the artist and song title from each row.

For example, your input CSV file should look something like this:



artist,title
The Beatles,Hey Jude
Queen,Bohemian Rhapsody
Led Zeppelin,Stairway to Heaven
The code I provided assumes that the first row of the input CSV file contains the column headings "artist" and "title", so it skips over this row using next(reader) in the for loop.

Prompt

Can you make a copy of this conversation?

ChatGPT

As an AI language model, I do not have the capability to copy the conversation as a whole. However, you can easily copy and save the conversation by selecting all the text and using the copy/paste function in your browser or text editor.

Prompt

There is some extra text in some of the lyrics that do not belong

ChatGPT

It’s possible that some of the lyrics retrieved from the Genius API include additional text that is not part of the song lyrics, such as song credits or annotations. One way to remove this extra text is to use regular expressions to identify and remove certain patterns of text.

Here’s an example of how you can modify the code to remove any text in the lyrics that is enclosed in square brackets:



import csv
import lyricsgenius
import re
import time

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the filename of the input CSV file
input_filename = "input_songs.csv"

# Set the filename of the output CSV file
output_filename = "output_lyrics.csv"

# Set the maximum number of retries
max_retries = 3

# Open the input CSV file in read mode
with open(input_filename, "r", newline="") as input_file:
    # Create a CSV reader object
    reader = csv.reader(input_file)

    # Skip the header row
    next(reader)

    # Loop over the rows in the input CSV file
    songs = []
    for row in reader:
        # Parse the artist and song title from the row
        artist, title = row

        # Add the song to the list of songs
        songs.append({"artist": artist, "title": title})

# Open the output CSV file in write mode
with open(output_filename, "w", newline="") as output_file:
    writer = csv.writer(output_file)

    # Write the headers to the output CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the output CSV file
    for song in songs:
        # Initialize the retry counter
        retries = 0

        # Loop until the request succeeds or the maximum number of retries is reached
        while retries < max_retries:
            try:
                # Search for the song on Genius.com
                search_results = genius.search_song(song["title"], song["artist"])

                # Check if the song was found
                if search_results is None:
                    print(f"{song['title']} by {song['artist']} not found on Genius.com")
                else:
                    # Get the lyrics of the song
                    lyrics = search_results.lyrics

                    # Remove any text in square brackets from the lyrics
                    lyrics = re.sub(r"\[(.*?)\]", "", lyrics)

                    # Write the data to the output CSV file
                    writer.writerow([song["artist"], song["title"], lyrics])

                # Exit the retry loop if the request was successful
                break
            except Exception as e:
                # Print the error message
                print(f"An error occurred while getting lyrics for {song['title']} by {song['artist']}: {e}")

                # Increment the retry counter
                retries += 1

                # Wait for a few seconds before retrying
                time.sleep(5)

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {output_filename}")

In this modified code, we import the re module to use regular expressions to remove text in square brackets from the lyrics. We then use the re.sub function to replace any text in square brackets with an empty string. This removes any text enclosed in square brackets from the lyrics.

Prompt

Okay that works, but there is also additional stuff before the [Verse 1]

ChatGPT

It’s possible that the lyrics retrieved from the Genius API include additional text before the actual lyrics begin, such as song credits or annotations. One way to remove this extra text is to use regular expressions to identify and remove certain patterns of text.

Here’s an example of how you can modify the code to remove any text in the lyrics that appears before the first occurrence of a section header (e.g., [Verse 1], [Chorus], etc.):




import csv
import lyricsgenius
import re
import time

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the filename of the input CSV file
input_filename = "input_songs.csv"

# Set the filename of the output CSV file
output_filename = "output_lyrics.csv"

# Set the maximum number of retries
max_retries = 3

# Open the input CSV file in read mode
with open(input_filename, "r", newline="") as input_file:
    # Create a CSV reader object
    reader = csv.reader(input_file)

    # Skip the header row
    next(reader)

    # Loop over the rows in the input CSV file
    songs = []
    for row in reader:
        # Parse the artist and song title from the row
        artist, title = row

        # Add the song to the list of songs
        songs.append({"artist": artist, "title": title})

# Open the output CSV file in write mode
with open(output_filename, "w", newline="") as output_file:
    writer = csv.writer(output_file)

    # Write the headers to the output CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the output CSV file
    for song in songs:
        # Initialize the retry counter
        retries = 0

        # Loop until the request succeeds or the maximum number of retries is reached
        while retries < max_retries:
            try:
                # Search for the song on Genius.com
                search_results = genius.search_song(song["title"], song["artist"])

                # Check if the song was found
                if search_results is None:
                    print(f"{song['title']} by {song['artist']} not found on Genius.com")
                else:
                    # Get the lyrics of the song
                    lyrics = search_results.lyrics

                    # Remove any text in square brackets from the lyrics
                    lyrics = re.sub(r"\[(.*?)\]", "", lyrics)

                    # Remove any text before the first section header
                    section_header_pattern = r"\[.*?\]"
                    match = re.search(section_header_pattern, lyrics)
                    if match is not None:
                        lyrics = lyrics[match.start():]

                    # Write the data to the output CSV file
                    writer.writerow([song["artist"], song["title"], lyrics])

                # Exit the retry loop if the request was successful
                break
            except Exception as e:
                # Print the error message
                print(f"An error occurred while getting lyrics for {song['title']} by {song['artist']}: {e}")

                # Increment the retry counter
                retries += 1

                # Wait for a few seconds before retrying
                time.sleep(5)

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {output_filename}")

In this modified code, we use the re.search function to find the first occurrence of a section header (e.g., [Verse 1], `[Chorus

Prompt

Okay getting there. I also what to remove the text at the beginning of the line that includes “12 ContributorsOne of Them Girls Lyrics”

ChatGPT

You can modify the regular expression to remove any text that occurs before the first section header and also before the line that includes “12 ContributorsOne of Them Girls Lyrics”. Here’s how you can modify the code to do this:




import csv
import lyricsgenius
import re
import time

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the filename of the input CSV file
input_filename = "input_songs.csv"

# Set the filename of the output CSV file
output_filename = "output_lyrics.csv"

# Set the maximum number of retries
max_retries = 3

# Open the input CSV file in read mode
with open(input_filename, "r", newline="") as input_file:
    # Create a CSV reader object
    reader = csv.reader(input_file)

    # Skip the header row
    next(reader)

    # Loop over the rows in the input CSV file
    songs = []
    for row in reader:
        # Parse the artist and song title from the row
        artist, title = row

        # Add the song to the list of songs
        songs.append({"artist": artist, "title": title})

# Open the output CSV file in write mode
with open(output_filename, "w", newline="") as output_file:
    writer = csv.writer(output_file)

    # Write the headers to the output CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the output CSV file
    for song in songs:
        # Initialize the retry counter
        retries = 0

        # Loop until the request succeeds or the maximum number of retries is reached
        while retries < max_retries:
            try:
                # Search for the song on Genius.com
                search_results = genius.search_song(song["title"], song["artist"])

                # Check if the song was found
                if search_results is None:
                    print(f"{song['title']} by {song['artist']} not found on Genius.com")
                else:
                    # Get the lyrics of the song
                    lyrics = search_results.lyrics

                    # Remove any text in square brackets from the lyrics
                    lyrics = re.sub(r"\[(.*?)\]", "", lyrics)

                    # Remove any text before the first section header
                    section_header_pattern = r"\[.*?\]"
                    match = re.search(section_header_pattern, lyrics)
                    if match is not None:
                        lyrics = lyrics[match.start():]

                    # Remove any text before the line that includes "12 ContributorsOne of Them Girls Lyrics"
                    contributors_pattern = r"^.*12 ContributorsOne of Them Girls Lyrics\n"
                    match = re.search(contributors_pattern, lyrics)
                    if match is not None:
                        lyrics = lyrics[match.end():]

                    # Write the data to the output CSV file
                    writer.writerow([song["artist"], song["title"], lyrics])

                # Exit the retry loop if the request was successful
                break
            except Exception as e:
                # Print the error message
                print(f"An error occurred while getting lyrics for {song['title']} by {song['artist']}: {e}")

                # Increment the retry counter
                retries += 1

                # Wait for a few seconds before retrying
                time.sleep(5)

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {output_filename}")

In this modified code, we use the re.search function to find the line that includes “12

Prompt

It is a pattern that begins with a number and ends with “Lyrics”

ChatGPT

In that case, you can modify the regular expression to match any line that begins with a number and ends with “Lyrics”. Here’s how you can modify the code to do this:




import csv
import lyricsgenius
import re
import time

# Set the access token for the Genius API
access_token = "<your_access_token>"

# Create a Genius object with the access token
genius = lyricsgenius.Genius(access_token)

# Set the filename of the input CSV file
input_filename = "input_songs.csv"

# Set the filename of the output CSV file
output_filename = "output_lyrics.csv"

# Set the maximum number of retries
max_retries = 3

# Open the input CSV file in read mode
with open(input_filename, "r", newline="") as input_file:
    # Create a CSV reader object
    reader = csv.reader(input_file)

    # Skip the header row
    next(reader)

    # Loop over the rows in the input CSV file
    songs = []
    for row in reader:
        # Parse the artist and song title from the row
        artist, title = row

        # Add the song to the list of songs
        songs.append({"artist": artist, "title": title})

# Open the output CSV file in write mode
with open(output_filename, "w", newline="") as output_file:
    writer = csv.writer(output_file)

    # Write the headers to the output CSV file
    writer.writerow(["Artist", "Song Title", "Lyrics"])

    # Loop over the list of songs and write their lyrics to the output CSV file
    for song in songs:
        # Initialize the retry counter
        retries = 0

        # Loop until the request succeeds or the maximum number of retries is reached
        while retries < max_retries:
            try:
                # Search for the song on Genius.com
                search_results = genius.search_song(song["title"], song["artist"])

                # Check if the song was found
                if search_results is None:
                    print(f"{song['title']} by {song['artist']} not found on Genius.com")
                else:
                    # Get the lyrics of the song
                    lyrics = search_results.lyrics

                    # Remove any text in square brackets from the lyrics
                    lyrics = re.sub(r"\[(.*?)\]", "", lyrics)

                    # Remove any text before the first section header
                    section_header_pattern = r"\[.*?\]"
                    match = re.search(section_header_pattern, lyrics)
                    if match is not None:
                        lyrics = lyrics[match.start():]

                    # Remove any text before the line that includes a number and "Lyrics"
                    contributors_pattern = r"^.*\d+.*Lyrics\n"
                    match = re.search(contributors_pattern, lyrics)
                    if match is not None:
                        lyrics = lyrics[match.end():]

                    # Write the data to the output CSV file
                    writer.writerow([song["artist"], song["title"], lyrics])

                # Exit the retry loop if the request was successful
                break
            except Exception as e:
                # Print the error message
                print(f"An error occurred while getting lyrics for {song['title']} by {song['artist']}: {e}")

                # Increment the retry counter
                retries += 1

                # Wait for a few seconds before retrying
                time.sleep(5)

# Print a message indicating where the lyrics are saved
print(f"Lyrics saved to {output_filename}")

In this modified code, we use the regular expression r”^..Lyrics” to match any line that begins with a number.

Prompt

That worked really well. Thank you.

ChatGPT

You’re welcome! I’m glad it worked for you. If you have any more questions or issues, feel free to ask.

Prompt

could you explain the regular expression in more detail?

ChatGPT

Sure, I’d be happy to explain the regular expression in more detail.

The regular expression r”^..Lyrics” has the following parts:

^: This matches the beginning of the string. .: This matches any character zero or more times. The . matches any character except a newline, and the means “zero or more times”. So, .* matches any sequence of characters (including no characters at all). : This matches one or more digits. The s a shorthand character class that matches any digit, and the + means “one or more times”. .: This matches any character zero or more times, just like the previous .. Lyrics: This matches the literal string “Lyrics” followed by a newline character. So, the regular expression as a whole matches any line that begins with any sequence of characters, followed by one or more digits, followed by any sequence of characters, followed by the string “Lyrics” and a newline character. This pattern matches lines like:


12 ContributorsOne of Them Girls Lyrics
7 ContributorsBoyfriend (with Social House) Lyrics
2 ContributorsDon't Start Now Lyrics

By using this regular expression to remove the text before the line that includes a number and “Lyrics”, we can extract only the lyrics themselves from the Genius.com webpage.