Interactively exploring Reddit posts using basic Scala in your browser
This article continues our quick tour through Scala. In these exercises, you’ll get to explore actual Reddit posts using snippets of…
This article continues our journey through Scala. In these exercises, you’ll get to analyze actual Reddit posts using snippets of basic Scala code that you write and run in your browser. You may even discover some surprising behavior of Redditors.
This is part three of our tour through Scala. If you’re just arriving and would like to start at the beginning, checkout Quickly learning the basics of Scala through Structure and Interpretation of Computer Programs examples.
Today we’re going to do some novel programming exercises based around actual Reddit data. I’ve prepared a random sample of roughly ten thousand posts from the month of October 2018 for us to interactively explore by writing basic Scala in widgets within this article.
Here’s a preview of some of the Scala we’ll be writing to analyze Reddit posts.
Let’s dive right in and figure out how to access this data, run computations on it, and thereby analyze Redditors by writing and running Scala code directly in our browser.
The following code fetches the Reddit post data from a web server and shows the first few lines of the Reddit data. There are some new Scala concepts in the code that we won’t need to understand today since we’ll just be using the code to fetch the Reddit data. Instead, we’ll focus on the code we write to explore the data.
Aside: This is a generally cool feature of programming: we can commonly use functions in libraries developed by other programmers to perform computation for us, even when we don’t understand the details of the code in the library. Professional software engineers commonly use libraries in their work specifically so they don’t have to learn the details of how to solve a programming problem that someone else has already solved.
Hit “Run” on the following widget to run the code and see the results.
(Note, you can click “Edit on ScalaFiddle” on any ScalaFiddle widget in this article if you find the ScalaFiddle widget too horizontally compressed in the embedded form. In such cases, you’ll find the presentation on the ScalaFiddle website to be more readable.)
Pretty cool how easily we can access this interesting data, right? Now let’s figure out how to explore the data and compute novel results that provide interesting insights into Redditors.
We receive the data as one very long string. You can modify the Scala code in the previous ScalaFiddle widget to call.length
on the string and print the number of characters. I.e., we’ll modify our anonymous function at the bottom of the code to have the following form.
Run that and you’ll see that the data string consists of just under a million characters. Hence, we have some serious data to process here.
Let’s look at the sample again to see how we’ll want to process this long string.
MarioAI,nicolasrene,recreation of Mario facing left since i havent got the original picture not the real thing but he did face left once in 1-2...,1
u_seksualios,seksualios,Seksuali gundanti ištvirkėlė Alektra Blue juodais drabužiais,1
Cuphead,[deleted],Just finished Cuphead in an hour and fifteen minutes.,0
videos,lonemonk,Trump On The Traps - Calvin Dick (2017),1 CryptoCurrency,Pseudoname87,Why does binance show a different price than other sites?,1
This text data is in comma-separated value (CSV) format. Conventionally such data is shown as a table.
We can see that each single line of the file corresponds to a single post. Hence, first, we’ll want to break this string up by lines to access the individual line for each post. We can split a string into substrings — based on a specific splitting character — using the string method split
. To split into lines, we’ll use the newline character "\n"
as the splitting character.
Lastly, we’ll convert the return value of split
to a list using toList
and show the number of lines. You can go back to the ScalaFiddle widget and modify our processing code to be as follows.
The results show that we have roughly 11,000 Reddit posts in this data.
Now we still have some more work to do in preparing our data for programmatic exploration because each Reddit submissions contains multiple attributes. Going back to the sample, we see that each line is a comma-separated string of values for each post. In order, the attributes are:
subreddit: Subreddit to which post was submitted
author: The user account who posted the submission
title: The title of the post
score: The voting score of the submission at the time the data was pulled
We’ll again use String.split
to access the individual fields of each post, this time using ","
as the splitting character.
We want a way to organize together all of the different attributes of a single post and for that, we can create a Scala class
. At a high level, a class just allow us to organize together related data elements like the different attributes of a Reddit post. Here’s how we define a class for this data.
Here’s an example of how we can use the Post class. Modify the code as you like to get a feel for ourPost
data type.
We’ll cover classes in more detail in the future. For now, it’s good enough to know that this class defines a new data type to represent Reddit posts, with each post having four attributes: subreddit, author, title, and score.
Let’s now modify our processing code to parse our raw data into a list of posts. I.e., create aList[Post]
.
Now we have a list of posts in val posts: List[Post]
. So let’s start exploring this Reddit post data!
First, let’s count the number of posts for a single subreddit. You can add the following code to our processor function in the previous ScalaFiddle widget.
Here we’re using the filter
method ofList
to create a new list that just contains the posts that are from the subreddit “AskReddit”. In general, we can use filter
with any function we write to select elements of a List
and thereby create a new List
that just contains those elements of interest.
Feel free to modify the code to count the number of posts in our sample for any subreddit you’re interested in.
Building off that exercise, let’s compute the number of posts for every single subreddit. In this computation, we’ll build a data structure that associates each subreddit with its posts count. This will require us to introduce a new Scala data structure called Map
.
Map[K, V]
is an association between keys of type K
to values of type V
. In this exercise, we’ll be building a Map[String, Int]
to associate subreddit name to count of posts. Here are some examples of how we can work with such a map.
Can you add code to create forthMap
that adds the key "new key"
with a value of 12
to our val map: Map[String, Int]
?
For people familiar with other programming languages, you may be surprised to see that Map.updated
returns another newMap
. In general, Scala encourages us to use immutable data structures and therefore we avoid modifying anything in place. Instead, we create new data structures to represent the results of any change. Scala does some really clever things behind the scenes to make such updates efficient in both processing time and memory usage.
Note that Map[K, V].get(key)
returns an Option[V]
. Option
is used to account for the fact that it’s possible we don’t have a value for the given key
. Option
is a general data type that can take two forms, Some(value)
and None
. Some(value)
denotes that we do have a value for the key and we can access that value through Some.get
. Whereas None
denotes that there was no value for the key.
Option
has a useful method getOrElse(alterantive)
. When called on Some
it returns the value contained in Some
and ignores alterantive
. Whereas None.getOrElse(alternative)
returns the value of alternative
. We’ll use this method in our code to replace None
with zero when we fetch the count for a subreddit that we currently don’t yet have a count for.
Using Map
, here is how we can compute the number of posts for each subreddit.
Note how we’re using foldLeft
, a variant of fold
, to process each post and increment the count for the corresponding subreddit in the map. You may recall that fold
is used to aggregate all values in a list down to a single value. In this case, the final value is aMap[String, Int]
that associates subreddits to their count of posts.
A quicker refresher on fold
functions: We call foldLeft
with our initial value of aggregation and our folding function. fold
calls our folding function for every element in the list. In each call, the folding function also receives the current value of the aggregation. Our function returns the updated value of the aggregate that incorporates the list element. The fold
functions return the final value of aggregation after processing every element in the list using our folding function.
Spend some time reviewing this code to see if you can reason through how we’re computing the number of posts for each subreddit. As an exercise, can you modify this example code to instead count the number of posts for each user?
While the current results are nice, we’re more interested in knowing the results for the top subreddits; i.e., those with the most posts in this sample of Reddit posts. To that end, we’ll need a way to order the subreddits by the number of posts so that we can select the top few subreddits to show. In computer science terminology, such a process is referred to as sorting.
You can add the following code to the preceding example to sort the subreddits by post count and then show the top 10 in this sample of Reddit posts.
There are a few new things going on here. First, we’re converting subredditCount
from Map[String, Int]
to List[(String, Int)]
using the Map.toList
method. This introduces a new concept called tuples in that (String, Int)
is the type for a length-two tuple where the first element is a string and the second element is an integer.
Tuples are a general data type in Scala that can be used to represent fixed length collections of elements whereby each position has a fixed type. E.g., (String, String, String, int)
is a length-four tuple. We could’ve used this tuple type instead of the class Post
to represent the data in a single Reddit post.
In general, classes are a more legible way to group together related elements. Tuples can be useful in some cases, particularly in cases where we want to write generic algorithms that use placeholder types. This is the case in wanting a general method to convert a Map[K, V]
in a list of associated pairs, List[(K, V)]
.
Next, we’re using the method sortBy
to sort our list of tuples. The method takes a function that computes a ranking score for each element of the list. The elements of the list are sorted by rank and a new list is returned by sortBy
in which the elements are ordered. You can see that our ranking function just fetches the count for each subreddit by accessing the second element of the tuple, t._2
.
The results for our sample of Reddit posts are as follows.
(AskReddit,254)
(AutoNewspaper,214)
(The_Donald,84)
(CryptoCurrency,71)
(SteamTradingCards,69)
(RocketLeagueExchange,65)
(newsbotbot,65)
(videos,64)
(GlobalOffensiveTrade,59)
(PewdiepieSubmissions,58)
In thinking about the numbers, we should remember that this is a small random sample of all Reddit posts so the counts are going to be much smaller than the full number of posts. Our sample is 0.1% sample of all Reddit posts in October 2018, so we could multiply these numbers by 1000 to estimate the total number of posts for each subreddit for this month.
With my passing familiarity with Reddit, I’d say these results seem consistent with my intuition about popular subreddits like “AskReddit”. What do you think?
Can you modify your earlier exercise code that computes the number of posts per author so that the results are sorted? Who are the top authors in this sample of Reddit posts?
Next, let’s see if our sample includes any posts with Scala in the title. You can add the following snippet to the previous ScalaFiddle widget to answer this question.
What do you think about these posts? Is every one of them about Scala or is there a deficiency in using this heuristic to programmatically identify relevant posts? We’ll consider more sophisticated ways to analyze posts soon.
What other words are interesting to you? Modify the code as you’d like to look for other posts that have certain keywords. In many ways, we’re building a simple, custom search engine to find posts relevant to our interests across this small sample.
Note, I myself have discovered a non-trivial amount of obscene language. As an exercise, you could write some Scala code to find posts that contain swear words. I’m not including example code for this because I don’t want to have a list of curse words on my blog. :)
In general, we’d be interested in computing the frequency of different words in post titles across each subreddit. Here’s some moderately sophisticated code that accomplishes such an analysis.
Take some time and think through this example. It’s a good review of many of the concepts we’ve considered so far in current and past articles. Reminder, that you can click “Edit on ScalaFiddle” on the widget to open the example in a separate window that doesn’t have the horizontal compression of the widget to better read the code.
The example code includes some concepts we haven’t yet explored. If you’d like, you could explore these concepts on your own — ahead of our shared journey through Scala — using resources at scala-lang.org. In general, that website is a great place to learn about Scala concepts. And, of course, a general Google search can also turn up some useful resources, including StackOverflow questions and answers.
One thing I’d like to explain at present is how this code example uses two cases of using pattern matching to deconstruct data structures. In these cases, we’re accessing the elements of a tuple through pattern matching deconstruction. Here’s an isolated example of deconstructing tuples in a function.
You can see we’ve defined the function in a non-standard fashion. In general, we’d primarily use this pattern in anonymous functions, which is what we’ve done in the word frequency example.
In addition to analyzing the Scala code, what do you think of these results? Do the word frequency results seem appropriate given your knowledge of Reddit? Are there any surprising results in highly frequent words for certain subreddits?
You can modify and extend these examples however you’d like to compute anything that interests you about Reddit posts. Here are some ideas for things you might want to compute for this sample of Reddit posts.
What posts have the highest score? (Example from intro of this article)
Which Redditors have the highest average scores?
Who are the most prolific-posting Redditors in each subreddit?
What words are frequent in high scoring posts? Versus what words are frequent in low scoring posts?
Which words have a generally low frequency across all posts, but a high frequency in specific subreddits? I.e., what words are uniquely characteristic of a given subreddit? We can quantify this by taking the ratio of
subredditWordFrequency/generalWordFrequency
for each subreddit/word pair and looking for high ratios. (This one is a good challenge to further develop your Scala proficiency.)
I hope you’re enjoying applying Scala to analyze Reddit and learn a bit about Redditors. In the future, I’ll be showing you how to perform this analysis on the full set of Reddit posts for the month of October 2018. There are 11,306,843 posts in this month so we’ll need to learn how to apply Scala using the “big data” technology Spark.
You’ll be surprised to see how the code we write in these “big data” exercises is no more complicated than the code we’ve written in today’s exercises. Processing “big data” can be just as easy as small data when we use powerful technologies like Spark and Scala.
And thank you for working through another series of Scala exercises with me. I hope this one has been particularly fun because we’re getting to learn about real-world data. I’ll do what I can to create more exercises with this style.
I’d like to thank pushshift.io for hosting Reddit data dumps. This is a really interesting source of data and it was easy for me to take a small sample from the full dataset for October 2018 for these examples and exercises. You can download this data yourself and write your own Scala code to start performing more sophisticated analyses.