Scraping Reddit data

Scraping Reddit data

As its name suggests PRAW is a Python wrapper for the Reddit API, which enables you to scrape data from  subreddits, create a bot and much more.

In this article, we will learn how to use PRAW to scrape posts from different subreddits as well as how to get comments from a specific post.

Getting Started

PRAW can be installed using pip or conda:

pip install praw
or
conda install praw

Now PRAW can be imported by writting:

import praw

Before PRAW can be used to scrape data we need to authenticate ourselves. For this we need to create a Reddit instance and provide it with a  client_id , client_secret and a user_agent .

reddit = praw.Reddit(client_id='my_client_id', client_secret='my_client_secret', user_agent='my_user_agent')

To get the authentication information we need to create a reddit app by navigating to this page and clicking create app or create another app.

Reddit Application
Figure 1: Create a Reddit Application

This will open a form where you need to fill in a name, description and redirect uri. For the redirect uri you should choose http://localhost:8080 as described in the excellent PRAW documentation.

Create a new Reddit Application
Figure 2: Create new Reddit Application

After pressing create app a new application will appear. Here you can find the authentication information needed to create the praw.Reddit instance.

Get Authentication Information
Figure 3: Authentication information

Get subreddit data

Now that we have a praw.Reddit  instance we can access all available  functions and use it, to for  example get the 10 “hottest” posts from the Machine Learning subreddit.

# get 10 hot posts from the MachineLearning subreddit
hot_posts = reddit.subreddit('MachineLearning').hot(limit=10)
for post in hot_posts:
    print(post.title)

Output:

[D] What is the best ML paper you read in 2018 and why?
[D] Machine Learning - WAYR (What Are You Reading) - Week 53
[R] A Geometric Theory of Higher-Order Automatic Differentiation
UC Berkeley and Berkeley AI Research published all materials of CS 188: Introduction to Artificial Intelligence, Fall 2018
[Research] Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks
...

We can also get the 10 “hottest” posts of all subreddits combined by specifying “all” as the subreddit name.

# get hottest posts from all subreddits
hot_posts = reddit.subreddit('all').hot(limit=10)
for post in hot_posts:
    print(post.title)

Output:

I've been lying to my wife about film plots for years.
I don’t care if this gets downvoted into oblivion! I DID IT REDDIT!!
I’ve had enough of your shit, Karen
Stranger Things 3: Coming July 4th, 2019
...

This variable can be iterated over and features including the post title, id  and url can be extracted and saved into an .csv file.

import pandas as pd
posts = []
ml_subreddit = reddit.subreddit('MachineLearning')
for post in ml_subreddit.hot(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
print(posts)
Picture of Hottest ML Posts
Figure 4: Hottest ML posts

General information about the subreddit can be obtained using the .description function on the subreddit object.

# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

Output:

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
...

Get comments from a specific post

You can get the comments for a post/submission by creating/obtaining a Submission object and looping through the comments attribute. To get a post/submission we can either iterate through the submissions of a subreddit or specify a specific submission using reddit.submission and passing it the submission url or id.

submission = reddit.submission(url="https://www.reddit.com/r/MapPorn/comments/a3p0uq/an_image_of_gps_tracking_of_multiple_wolves_in/")
# or 
submission = reddit.submission(id="a3p0uq")

To get the top-level comments we only need to iterate over submission.comments.

for top_level_comment in submission.comments:
    print(top_level_comment.body)

This will work for some submission, but for others that have more comments this code will throw an AttributeError saying:

AttributeError: 'MoreComments' object has no attribute 'body'

These MoreComments object represent the “load more comments” and “continue  this thread” links encountered on the websites, as described in more detail in the comment documentation.

There get rid of the MoreComments objects, we can check the datatype of each comment before printing the body.

from praw.models import MoreComments
for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    print(top_level_comment.body)

But Praw already provides a method called replace_more , which replaces or removes the MoreComments. The method takes an argument called limit, which when set to 0 will remove all MoreComments.

submission.comments.replace_more(limit=0)
for top_level_comment in submission.comments:
    print(top_level_comment.body)

Both of the above code blocks successfully iterate over all the top-level comments and print their body. The output can be seen below.

Source: [https://www.facebook.com/VoyageursWolfProject/](https://www.facebook.com/VoyageursWolfProject/)
I thought this was a shit post made in paint before I read the title
Wow, that’s very cool.  To think how keen their senses must be to recognize and avoid each other and their territories.  Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a badass.
That’s really cool. The edges are surprisingly defined.
...

However, the comment section can be arbitrarily deep and most of the time we surely also want to get the comments of the comments. CommentForest provides the .list method, which can be used for getting all comments inside the comment section.

submission.comments.replace_more(limit=0)
for comment in submission.comments.list():
    print(comment.body)

The above code will first of output all the top-level comments, followed by the second-level comments and so on until there are no comments left.

Conclusion

Praw is a Python wrapper for the Reddit API, which enables us to use the Reddit API with a clean Python interface. The API can be used for webscraping, creating a bot as well as many  others.

This article covered authentication, getting posts from a subreddit and getting comments. To learn more about the API I suggest to take a look at their excellent documentation.

If you liked this article consider subscribing on my Youtube Channel and following me on social media.

The code covered in this article is available as a Github Repository.

If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.