Reddit is a powerful source of NLP data

daveshapautomator · September 3, 2021, 10:49am

Color me impressed.

"""
write a python function that accepts a subreddit name and an integer and returns the top n posts from that subreddit
"""

def get_top_n_posts(subreddit, n):
    """
    returns the top n posts from a given subreddit
    """
    url = 'https://www.reddit.com/r/{}/hot.json'.format(subreddit)
    headers = {'User-agent': 'Bleep blorp bot 0.1'}
    params = {'limit': n}
    res = requests.get(url, headers=headers, params=params)
    if res.status_code == 200:
        return res.json()

However, it did not quite understand a more specific function:

"""
write a python function that accepts a reddit permalink and returns the comments.
"""

import requests
import json

def get_comments(url):
    """
    accepts a reddit permalink and returns the comments
    """
    # get the comments from the url
    comments = requests.get(url)
    # convert the comments to a json object
    comments = comments.json()
    # return the comments
    return comments

def main():
    """
    main function
    """
    # get the comments
    comments = get_comments("https://www.reddit.com/r/funny/comments/3g1jfi/buttons/")
    # print the comments
    print(comments)

if __name__ == "__main__":
    main()

Instead it should have done something like replace the last / with .json and prepend https://reddit.com (ex: https://www.reddit.com/r/Anxiety/comments/pgo3k8/professor_picked_on_me_to_answer_a_question_i_did.json) which seems to work well enough

As for why I’m trawling the anxiety subreddit, it’s because I’m curious about building a fine-tuning dataset specifically for mental health. Yes, I know this is a high stakes arena but I am working with a clinical researcher on the topic, and we will be going through all the proper gatekeeping channels for approval and rigorous testing. This is more a scientific and humanitarian endeavor than a commercial one.

Anyways, I was curious about the “bleep blorp bot” which was pretty distinct and led to this GitHub repo: GitHub - cheekeet86/project_3: Games Reddits Classifier (General Assembly SG Data Science Immersive Batch 9)

So it seems like Codex got the idea for this header from that repo but the function does not appear anywhere in the repo.

daveshapautomator · September 3, 2021, 2:39pm

I don’t think you understand the point of the robots.txt file. That is for search engines, which this is not. Robots.txt is merely to tell the search engine what endpoints to disregard to avoid wasteful indexing.

https://www.cloudflare.com/learning/bots/what-is-robots.txt/

Furthermore, Reddit documents their own API quite extensively. reddit.com: api documentation

daveshapautomator · September 3, 2021, 3:27pm

Meh, I’m not remotely concerned. If they didn’t want anonymous people to download JSON they would disable it. It’s not as though anyone is exploiting or hacking their platform. There are literally thousands of tools, techniques, settings, and utilities to harden web resources out there.

bakztfuture · September 3, 2021, 3:27pm

going through reddit data is a brilliant idea . Here’s the prompt I ran and results I got:

"""
write a python function that:
1) chooses a random subreddit
2) gets the top 100 posts from that subreddit
3) parses the text and extracts any links
"""

import praw
import re
import random

def get_random_subreddit():
    """
    returns a random subreddit
    """
    return random.choice(['python', 'learnpython', 'programming', 'datascience', 'cscareerquestions'])

def get_top_posts(subreddit):
    """
    returns a list of the top 100 posts from the given subreddit
    """
    r = praw.Reddit(user_agent='my_cool_application')
    submissions = r.get_subreddit(subreddit).get_top(limit=100)
    return [x for x in submissions]

def parse_text(posts):
    """
    returns a list of links from the top 100 posts
    """
    links = []
    for post in posts:
        text = post.selftext.encode('ascii', 'ignore')
        links += re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
    return links

def get_random_link(links):
    """
    returns a random link from the list of links
    """
    return random.choice(links)

if __name__ == '__main__':
    subreddit = get_random_subreddit()
    posts = get_top_posts(subreddit)
    links = parse_text(posts)
    print get_random_link(links)

one of my earliest videos was using backlinks from Reddit data to find “niche” audiences:

boris · September 6, 2021, 8:43am

Great examples!

Interesting discussion regarding the usage. I don’t know if it’s feasible for our tool to decide which usage is legitimate vs not, as there will be tons of edge cases, and external knowledge which is needed to ascertain the validity of such operations. Maybe a warning might be appropriate in the ideal case, so the person writing the code can decide how to proceed.

jxl38 · September 6, 2021, 2:40pm

That’s pretty damn cool.

Topic		Replies	Views
When Codex returns existing code verbatim API codex	14	917	September 4, 2021
AI creates game in Python in 140 lines of code API codex	15	2610	August 22, 2021
Proven and reliable productivity use cases for GPT4 Community gpt-4	32	5384	June 20, 2023
SearchGPT: More Than Just a Perplexity Clone? Community searchgpt	13	4813	September 12, 2024
I tried using GPT-3 for market research Prompting	1	1650	December 20, 2022

Reddit is a powerful source of NLP data

Related topics