Developing a Custom Substack Front-end

Part 1: Developing a Substack client to fetch posts and comments

Dec 10, 2022

Code: github.com/matthagy/substack_client

I’m a big Substack fan—paying for numerous subscriptions—and I particularly enjoy the thoughtful and engaging discussions in the comment sections. Yet I would like more functionality, including comment search. Hence, I’ve set to building my own custom Substack interface.

I’m currently working on cleaning up the code and releasing it as open source so that other people can also access this functionality. This is the first post in a series that will provide an overview of the components as they’re released.

Custom interface features

To first demonstrate some functionality, the custom interface enables me to search my previous comments.

Searching my previous comments for the term “inflation”

You’ll see that this custom interface also lets us add tags to comments, and those can be useful for future searching. Additionally, it also shows the distribution of tags for a given search, which can provide a summary of content for say a specific user.

Here’s the tag distribution of the nearly two hundred thousand Slow Boring comments, although I’ve only tagged a minuscule fraction of them.

An overview of all comments from Slow Boring

You can explore some functionality in an earlier prototype that just includes my own comments at matthagy.github.io/substack_comments. (Source code)

I’ve found these searching and tagging features quite useful, and I now hope to share this functionality with other people by releasing the code as open source. This post will explain how we fetch posts and comments through using a Python Substack client.

Developing a Substack client

Unfortunately, Substack doesn’t yet offer an official API. Thankfully, some reverse engineering shows that the Substack front-end is powered by a RESTful JSON API that we can leverage for fetching content.

Network requests made by a Substack website as shown in Chrome Developer Tools

Two GET endpoints are of interest for fetching content.

archive - Returns the metadata for the posts of a site
comments - Returns the comments for a given post

The archive endpoint is paginated, returning the 12 newest posts for a given offset. Through a series of calls we can fetch the metadata for all posts. The metadata for each post includes fields such as title, description, author by-lines, and many more. Of particularly interest is the field ‘id’ that uniquely identifies the post.

Here are a few of the fields of an example metadata entry.

{
  "id": 88072060,
  "publication_id": 159185,
  "title": "Friday Thread",
  "type": "thread",
  "slug": "thursday-thread-940",
  "post_date": "2022-12-09T22:00:56.679Z",
  "audience": "only_paid",
  "canonical_url": "https://www.slowboring.com/p/thursday-thread-940",
  "reactions": {
    "\u2764": 4
  },
  "description": "Sinema leaving the Democratic Party.",
  "reaction": true,
  "comment_count": 31,
  ...
}

We can then fetch the comments for a given post using the comments endpoint. The result is a list of complex and nested JSON objects representing all comments for the post. They are arranged in a hierarchy with each comment containing its children.

To show a few fields…

{
  "id": 11045666,
  "body": "The text of the comment...",
  "post_id": 88072060,
  "user_id": 3094604,
  "type": "comment",
  "date": "2022-12-10T03:27:25.952Z",
  "edited_at": null,
  "name": "Matt Hagy",
  "reactions": {
   "\u2764": 8
  },
  "reaction": "\u2764",
  "children": [{"id": ... }, ...],
  ...
}

Using these two endpoints we can fetch all comments for all posts on a given Substack site. In order to call them, we need to develop some code that mimics the requests made by a web browser. Notably, we need to include browser cookies that provide user authentication in order to access paid-only content.

An overview of the library

The functionality for calling these endpoints using browser cookies from Chrome is packaged up in the Python library substack_client. You can checkout example.py for a demonstration of this code.

The factory method create includes logic for fetching the relevant browser cookies from Chrome on Linux. The created SubstackClient instance has the following methods.

get_new_posts - Calls the archive endpoint to fetch metadata for the newest posts.
get_http - Generic HTTP GET that is used to fetch the HTML contents for a post.
get_comments - Calls the comments endpoint to fetch comments for a post.

Additionally, the client.py module provides code for maintaining a local mirror of post and comment content for a Substack site. Content is stored in files so that we can efficiently access all comments without needing to make redundant requests to the site.

fetch_new_posts - Maintains a file containing post metadata and efficiently fetches new posts from the archive endpoint for addition.
fetch_post_contents - Mirrors the HTML content of posts within a directory.
CommentFetcher - Maintains a directory that contains the comments for each post. Includes methods for fetching the comments of new posts and refreshing comments for previously fetched posts.

Now that we can fetch the posts and comments for a Substack site, we can start serving this content in our own custom front-end. That will be the subject of the next post in this series.

Matt's Blog

Developing a Custom Substack Front-end

Part 1: Developing a Substack client to fetch posts and comments

Custom interface features

Developing a Substack client

An overview of the library