Skip to content

colditzjb/ReReddit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 

Repository files navigation

ReReddit: A script to capture real time data and updates from Reddit

Overview:

The refreddit.py script uses Reddit's PRAW library (so install that first) and connects you to the Reddit API (you need to obtain your own Reddit developer credentials for this). The process runs an endless data collection loop until you manually kill it. This is a process to set-and-forget on a Raspberry Pi or similar low-maintenance, seldom-interfaced, usually-uptime Linux system (you may want to set up crontab to start it on reboot).

Behavior:

Initially, ReReddit will collect up to 1000 most-recent posts and associated comments that are available from the subreddits that you choose - this can be time consuming due to API limits. After those data are captured, it will regularly loop back and check for new posts. It systematically checks for updates to posts that were captured so the output data will also include new comments, updated text of previously captured posts (saved as separate text files), and whether a post was removed by moderators or deleted by the original poster. The update process prioritizes newer posts, identifies post edits and removals, and captures new comment activity when comment counts increase in post metadata.

Data format:

Output subfolder structure is: (1) subreddit name, (2) post capture date as YYYYMMDD format, (3) post ID, (4) comment ID and the comment ID that was responded to if the comment was a response (flattening but preserving network edge structure as folder names). Text data are stored in .txt files within post and comment subfolders (file name is epoch datetime of the post/comment - post time, not collection time). Metadata are stored separately in .csv files within these folders (e.g, data capture timestamps, upvote score, number of comments). Metadata are appended every time a post is checked for updates, so you can see some progression of post engagement.

Note: Because the data consist of many small files, a file system formatted with a smaller cluster allocation size is ideal so that the output doesn't take up unnecessary disk space. For example, if you are backing up a large amount of data on a dedicated drive, you might want to format it as NTFS with 512 byte cluster allocation size (this option is standard but not default when formatting in Windows). In a recent use case, this reduced the overall storage size by a 5:1 magnitude. However, this configuation may result in disk fragmentationa and slower I/O speed.

Other technical notes:

The Reddit API is rate limited, so data refresh frequency is inversely related to the number of subreddits that you include. Selecting many highly-active subreddits will slow the refresh rate.

The process relies on reviewing metadata output to know what new data needs collecting. If you remove files or folders from the established data output directories, it will assume that the data don't exist and attempt to re-collect them. Posts older than 180 days are "locked" by Reddit so those might be safe to move (but only if they preceed the 1000 most recent posts - there's some ambiguity around this in the Reddit API documentation). The current refresh settings generally ignore posts that are older than 30 days, because I picked an arbitrary number to reduce API call overhead. Change the refresh settings and relocate output data as appropriate for your use case.

This implementation includes an optional dependency on Sentry's Raven library so that errors can be re-broadcast to a Slack channel or other notification platform. I commented-out those dependencies in the code.

About

A Reddit scraper to capture real-time data and refereed updates.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages