An introduction to Spark GraphFrame with examples analyzing the Wikipedia link graph
The Spark GraphFrame is an incredibly powerful tool for performing distributed computations with large graphical data. This article…
The Spark GraphFrame is an incredibly powerful tool for performing distributed computations with large graphical data. This article introduces the GraphFrame abstraction and shows how it can be leveraged to analyze the graph formed by the links between Wikipedia articles.
The Spark GraphFrame is a powerful abstraction for processing large graphs using distributed computing. It provides a plethora of common graph algorithms including label propagation and PageRank. Further, it provides the foundations for implementing complex graph algorithms, including a robust implementation of the Pregel paradigm for graph processing. Anyone who’s interested in working with large graphs should learn how to apply this powerful tool.
In this article, I’ll introduce you to the basic of GraphFrame and demonstrate how to use this tool through several examples. These examples consider the link graph between Wikipedia articles and I demonstrate how to analyze this graph by leveraging the GraphFrame abstraction.
Note that the examples in this post build off some more elementary Spark concepts such a DataFrames. Additionally, it uses basic Scala code to demonstrate algorithms. Only small and simple examples are shown so that one doesn’t need to be well-familiar with these concepts to learn about the power of the GraphFrame abstraction.
Let’s dive right in and consider how to create a GraphFrame. I’ll start by introducing the Wikipedia link data in a basic Spark RDD. Each element of this RDD is a single Wikipedia article page represented with the following Scala class.
Note that each link is the title of another page. I.e., each page knows all the other pages that it links to by title.
We begin with a single RDD pages
that contains 19,093,693 Wikipedia article pages. From that, we generate two Spark DataFrames, one consisting of the vertices (i.e., page nodes) and the other consisting of the directed edges.
Note that there are 206,181,091 directed edges in this graph.
Next, we create a GraphFrame using these two DataFrames.
And that’s all we have to do to access this powerful abstraction. We can now start using some of the builtin graph algorithms to analyze the Wikipedia link graph.
Let start by computing something simple: the Wikipedia pages with the largest number of outbound links. To this end, we can use the GraphFrame method outDegrees
, which is a computed DataFrame that corresponds to the number of outbound edges for each vertex. Since it’s a DataFrame, we can use the orderBy
method and limit
to select the top 10.
This gives the following results.
Interestingly, we can see that many special “Wikipedia:”-prefix pages have the highest number of outbound links.
Next, let’s consider the number of inbound edges for each page and find the top 10 linked-to pages using the corresponding GraphFrame method, inDegrees
.
We can see that locations, including countries and cities, are among the most heavily linked to pages.
Now let’s explore some more complex graph computations. Let’s consider the most heavily-linked to article, “United States”, and find the other articles in the link graph that are furthest away from this article in terms of the number of links you have to follow to arrive at “United States”.
To this end, we can use the GraphFrame method, shortestPaths
, which takes a collection of landmark vertices and returns the path length for every vertex in the graph to every landmark vertex.
It is interesting that there are only three articles that are a full 10 links removed from the “United States” article in terms of shortest path length.
Let’s go one step further and compute the number of pages at each path length. I.e., how many Wikipedia articles are 1 link removed from “United States”, how many are 2 links removed, etc.
(Note, that I’m using the built-in display
function of the Databricks notebook. I’ll talk more about the joy of using Databricks in a future post.)
Here we can see that there are millions of pages that are between 2 and 4 links removed from the “United States” article. There are also a vanishingly small number of articles that are six or more links removed. Further, the -1 path length denotes articles that aren’t at all connected to the “United States” article at all and there are roughly a million pages that meet this criterion.
Lastly, no demonstration of GraphFrame would be complete without showing how easily it can be used to perform the PageRank algorithm.
The page rank of the top 10 articles is:
It may not be surprising that “United States” has the highest vertex rank as we know it’s the most heavily linked-to article.
I hope these examples have helped to convince you that GraphFrame is a powerful abstraction for performing distributed computation on large graphs. There are additional more advanced concepts that I hope to share in future articles, including how to implement custom graph algorithms using the Pregel compute paradigm with GraphFrame.
Update: If you’ve found the Scala examples in this article interesting and would like to learn more about this powerful programming language you can check out my article, Quickly learning the basics of Scala through Structure and Interpretation of Computer Programs examples (Part 1).