Introducing scalaps: Scala-inspired data structures for Python
A functional, object-oriented approach for working with sequences and collections. Also similar to Java Streams. Hope you find they…
A functional, object-oriented approach for working with sequences and collections. Also similar to Spark RDDs and Java Streams. Hope you find they simplify your code by providing a plethora of common algorithms for working with sequences and collections.
I’ve found that working on collections of elements by applying functions through well-defined algorithms (e.g., map
, filter
, and reduce
) to greatly simplify my code and remove many sources of errors. Therefore I was delighted to discover that Scala really pushes this to the next level by introducing a plethora of built-in algorithms on data structures. These concepts share some similarities to Spark RDDs and Java Streams, but I find the Scala approach simpler and more elegant.
As I return to data analysis and machine learning with Python, I’ve found it helpful to port these concepts to Python in a new library, scalaps. You can find the code at github.com/matthagy/scalaps.
In this article, we’ll walk through a few examples of how scalaps can simplify our code. Let’s start with this basic, contrived example.
ScSeq
is a wrapper around any sequence and it provides numerous methods for operating on its input sequence. Many of these methods return another ScSeq
instance.
Rather than analyze the contrived example, let’s walk through a more realistic example of analyzing Reddit posts using scalaps. For background, we have a sample of Reddit posts in the following CSV format.
We start by accessing the data and parsing it.
You can see the following ScSeq
methods used in this example.
map(func): map a function across the current sequence to create another sequence
to_frozen_list(): create a frozen, realized list of the current sequence as implemented in
ScFrozenList
.take(n): return a sequence that will have at most the first
n
elements of the current sequencefor_each(func): call the function on every element of the sequence in order
Here’s the equivalent conventional Python for these operations.
Which is perfectly reasonable Python and we haven’t yet seen the strength of scalaps.
Next, let’s look at counting the number of elements that match a criteria.
This introduces two more ScSeq
methods.
filter(func): select elements that match a criteria
count(): count the number of elements in the sequence
Note that filter
is lazy. It doesn’t evaluate to a realized collection but instead, is a lazily computed sequence. The same is true for other methods such as map
. In fact, an entire ScSeq
can be a lazy sequence when sourced from an appropriate lazy source. E.g., readings lines from a file.
In contrast, count
is a sink. It realizes each element of the sequence through a chain of operations starting from the source. count
is a constant memory sink and can, therefore, consume massive lazy sequences. Other sinks include to_fozen_list
, which realizes the sequence into an immutable list of type ScFrozenList
.
Once a ScSeq
has been realized, it cannot run again. Instead, we can reconstruct the sequence to realize it again. It can be useful to have functions that build a sequence from passed-in source(s) so that we can easily reconstruct the sequence as needed. E.g.,
Returning to the Reddit post example, let’s compute the most popular subreddits in this sample of posts. This is accomplished with the following code.
This introduces a few new scalaps concepts. First, we’re passing a string to map
. This is interpreted as “select the attribute of that name for each element”. Similarly, integers are interpreted as integer item lookups in a collection.
Next, we use the method value_counts()
. This sink computes an ScDict
in which each key is an element from the sequence and the value is the number of times the key occurred. ScDict
is an augmented Python dictionary that includes functionality such as returning ScSeq
s for keys()
, values()
, and items()
. In the example, we use the items()
method to generate a sequence of key/value tuples.
The sequence is then sorted into a ScList
using sort_by(key)
. Note, we’re using the integer 1
as the key so as to select the second element, the count, of each tuple. Hence, the list is now sorted by the number of posts.
reverse()
is used to generate a ScSeq
that is in the reversed order so that the posts are ordered by descending score. take(n)
is used with n=5
to select the first five posts. Lastly, they’re printed through forach(print)
.
I find this to be a more elegant description of this algorithm than the comparable Python. Do you agree? If not, in your opinion, what would the comparable Python be and why is it more elegant? Let me know in the comment section below.
Lastly, let me leave you with a more sophisticated use of scalaps that computes the frequency of title words in each subreddit.
I won’t explain the full example. Instead, see if you can reason through what the code is doing based upon the naming of the methods and the names of the functions used with them.
I will point out two interesting methods.
flat_map(func): takes a function that returns sequence for each element in the original sequence. Each returned sequence is expanded, in order, within a returned
ScSeq
.group_by(func): construct an
ScDict
where the keys are computed byfunc
and each value is anScList
with all the elements that have that same key.
In reading through this example, what are you’re thoughts on the legibility of the Python code with scalaps? I personally find this approach easier to reason through relative to conventional Python. Further, I’ve used such a style in Scala, Java, and (Py)Spark so as to structure my code as applying functions to collections using built-in algorithms. I’ve come to find this approach simpler, more legible, and less error prone relative to conventional imperative programming.
Thanks for considering scalaps! I hope it can help simplify your code, improve the readability, and eliminate errors. Let me know what you think in the comment section below.
Again, you can find the code at https://github.com/matthagy/scalaps. It is a nascent library and very much a work in progress. E.g., it needs tests and I’ll develop them once I get some feedback on the API. PR’s are also welcome.
These examples were derived from a Scala learning resource, Interactively exploring Reddit posts using basic Scala in your browser. Check it out if you’d like to learn more about the elegant and powerful programming language, Scala.