Bug Report and Fix in Web Q&A Tutorial

Hi, just to report a bug in the “crawler” function of the Web Q&A Tutorial that comes up when it crawls a link that has a question mark.
ChatGPT helped me fix it, and here’s the whole thing:

The error you’re encountering is due to the presence of characters in the filename that are not allowed by your file system. Specifically, the question mark ? is not a valid character for filenames in most file systems. Your updated code now checks for question marks in URLs and skips them if present, which is a correct step for avoiding processing URLs with query parameters as filenames. However, the error you’re seeing suggests that the code might not be correctly sanitizing or skipping URLs when it comes to writing files.

To address the error and ensure that filenames are valid:

  1. Sanitize Filenames: You need to sanitize the URLs to remove or replace characters that are not allowed in filenames. This includes not only the question mark ? but also other characters such as slashes /, colons :, asterisks *, and so on. The sanitization should happen when you’re creating the filename from the URL, not only when checking if a URL should be skipped or not.
  2. Correct File Naming Logic: The logic you use to create filenames from URLs (url[8:].replace("/", "_")) seems intended to strip the protocol (e.g., https://) and replace slashes with underscores. However, this logic does not account for other invalid characters. You should expand this logic to remove or replace all characters that are invalid in filenames.

Here is an updated approach to sanitizing filenames that you could use:

pythonCopy code

import re
from urllib.parse import urlparse

def sanitize_filename(url):
    # Remove the scheme (http, https) and replace invalid characters
    sanitized = re.sub(r'https?://', '', url)  # Remove scheme
    sanitized = re.sub(r'[^a-zA-Z0-9\-_\.]', '_', sanitized)  # Replace invalid chars with underscore
    return sanitized

And then use this function when creating the filename:

pythonCopy code

# Inside your crawl function, when saving text to a file
filename = 'text/' + local_domain + '/' + sanitize_filename(url) + ".txt"
with open(filename, "w", encoding="UTF-8") as f:
    # Your file handling code here

This approach removes the protocol (http:// or https://) and replaces any character that is not a letter, number, hyphen, underscore, or dot with an underscore, making sure the filename is valid for most file systems.

Remember, the logic for skipping URLs with query parameters (? in the URL) or fragments (#) is separate from sanitizing filenames. Ensure both are correctly implemented in your code to avoid processing URLs that you intend to skip and to create valid filenames for the URLs you do process.

Here’s another one that comes up in the pandas fragment when running it through Jupyter Notebooks. First the error, then the fix.

PermissionError Traceback (most recent call last)
Cell In[36], line 10
6 # Get all the text files in the text directory
7 for file in os.listdir(“text/” + domain + “/”):
8
9 # Open the file and read the text
—> 10 with open(“text/” + domain + “/” + file, “r”, encoding=“UTF-8”) as f:
11 text = f.read()
13 # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.

File ~\somePath\interactiveshell.py:282, in _modified_open(file, *args, **kwargs)
275 if file in {0, 1, 2}:
276 raise ValueError(
277 f"IPython won’t let you open fd={file} by default "
278 "as it is likely to crash IPython. If you know what you are doing, "
279 “you can use builtins’ open.”
280 )
→ 282 return io_open(file, *args, **kwargs)

PermissionError: [Errno 13] Permission denied: ‘text/openai.com/.ipynb_checkpoints’

Fix:

To modify your code to skip files (or directories) that start with a dot (such as .ipynb_checkpoints which is a directory used by Jupyter to store checkpoint files), you can add a simple condition within your loop that checks the filename before proceeding with the file opening and reading process.

Here’s how you can adjust your loop:

pythonCopy code

import os
import pandas as pd

# Create a list to store the text files
texts = []

# Define a function to remove newlines
def remove_newlines(text):
    return text.replace('\n', ' ').replace('\r', '')

# Get all the text files in the text directory, skipping files starting with a dot
for file in os.listdir("text/" + domain + "/"):
    if file.startswith('.'):  # Skip any files starting with a dot
        continue

    # Open the file and read the text
    with open("text/" + domain + "/" + file, "r", encoding="UTF-8") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        processed_text = text.split('\n')[11:-4]  # Assuming you want to skip lines based on index
        processed_text = ' '.join(processed_text).replace('-', ' ').replace('_', ' ').replace('#update', ' ')
        texts.append((file.replace('-', ' ').replace('_', ' ').replace('#update', '').replace('.txt', ''), processed_text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns=['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df['fname'] + ". " + df['text'].apply(remove_newlines)
df.to_csv('processed/scraped.csv', index=False)
df.head()

Key Changes and Notes:

  1. Skip Files Starting with a Dot: Added a condition if file.startswith('.'): continue to skip any files or directories that start with a dot.
  2. Processing Text: I’ve added a placeholder function remove_newlines to demonstrate how you might remove newlines from your text. This function simply replaces newline characters (\n) and carriage returns (\r) with spaces. You should adjust this function based on your exact requirements for processing the text.
  3. Filename Processing: When appending filenames to the texts list, I’ve ensured that .txt is also removed from the filename when it’s used as part of the text column. Adjust the slicing as needed based on the actual structure of your filenames.
  4. Indexing for Line Skipping: The line processed_text = text.split('\n')[11:-4] assumes you want to skip the first 11 lines and the last 4 lines of the file content based on their index after splitting the text by newline characters. Adjust these numbers as needed.
  5. DataFrame Column Update: The line that updates the df['text'] column now correctly applies the remove_newlines function to the text column.

This adjusted code should avoid attempting to open special directories like .ipynb_checkpoints and properly handle the text processing as described.