## Connectivity in Directed Graphs

Since connectivity with undirected graphs is so boring, let's consider directed graphs instead. The important difference is that paths are not reversible anymore, so the nice picture of “connected components” above does not apply any more.

More formally, if we let $a \rightsquigarrow b$ mean “there is a path from $a$ to $b$”, then $\rightsquigarrow$ is still reflexive and transitive, but is not necessarily symmetric any more. Thus, “equivalence classes” are not well-defined any more.

But we can fix this: If we let $a \sim b$ mean “$a \rightsquigarrow b$ and $b \rightsquigarrow a$”, i.e., “there is a path from $a$ to $b$ and also a path from $b$ to $a$”, then it's easy to verify that $\sim$ is reflexive, symmetric and transitive, hence it's an equivalence relation!

If $a \sim b$, then we say $a$ and $b$ are strongly connected. A strongly connected component, or SCC, is a maximal set of nodes that is pairwise strongly connected. In other words, the SCCs are the equivalence classes under $\sim$. SCCs are the best analogue of connected components in undirected graphs.

The following offers a picture of the strongly connected components of an example directed graph:

Note that this time, there could be edges from one SCC to another. This is more exciting than before!

## Cycles and DAGs

A cycle is a nontrivial path from $a$ to itself. We say a graph is acyclic if it has no cycles. A directed acyclic graph is called, well, a directed acyclic graph, or DAG.

Here's an example of a DAG:

One nice thing about DAGs is that you can serialize the nodes, i.e., find a total order of the nodes such that every edge connects a node to a further node in the total order. This is called a topological sort, or toposort, of a graph. You probably already learned how to compute the topological sort in linear time.

Now, let's go back to our previous example:

suppose we “shrink” each SCC into a single node, and for every edge $a \rightarrow b$ connecting two nodes from different SCCs, we add an edge $\mathrm{SCC}_a \rightarrow \mathrm{SCC}_b$, where $\mathrm{SCC}_x$ is the node of the SCC containing $x$. We then obtain a new, smaller graph. My question is: can there be a cycle in a graph formed this way?

It turns out that there can't be any cycle in such a graph! The simple reason is that if there were a cycle, then the SCCs in that cycle could be compressed further into a single SCC, contradicting the fact that they're maximal. Thus, a cycle cannot exist.

Hence, if we shrink the SCCs into a single node, then we get a DAG, which we'll just call the DAG of SCCs. (You'll also see this called the condensation of the graph. Since $\sim$ is an equivalence relation, you may even see it called as “the graph modulo $\sim$”.)

Note that this is already more interesting and has more structure than in the undirected case, where shrinking the connected components results in just a graph with no edges!

## Depth-first search revisited (directed version)

Let's discuss how DFS works in the case of directed graphs. It turns out that the DFS forest in a directed graph provides a similar set of useful information as in the undirected case.

For instance, let's look at how the “DFS forest” might look like in a directed graph:

Note that there are still black and blue edges, representing tree and back edges. However, it looks like there are two new types of edges! It seems that the DFS on a directed graph classifies the edges into one of four types this time:

1. The black edges are the tree edges, which are the edges genuinely traversed by the DFS, i.e., $i \rightarrow j$ is a tree edge if the first time $j$ is visited is through $i$.
2. The blue edges are the back edges, which are edges that point to an ancestor of a node in the DFS forest.
3. The green edges are the forward edges, which are edges that point to a descendant of a node in the DFS forest.
4. The red edges are the cross edges, which are edges that point to neither an ancestor nor a descendant of a node in the DFS forest.

It's sometimes convenient to consider a tree edge as a type of forward edge as well, although there are forward edges that are not tree edges (unlike in the undirected case).

Now, let's look at how a DFS procedure could identify these edges:

function DFS(i):
// perform a DFS starting at node i

start_time[i] = time++

if start_time[j] == -1:
mark (i, j) as a tree edge
DFS(j)
else if finish_time[j] == -1:
mark (i, j) as a back edge
else if finish_time[j] > start_time[i]:
mark (i, j) as a forward edge
else:
mark (i, j) as a cross edge

finish_time[i] = time++

function DFS_all():
time = 0
for i = 0..n-1:
start_time[i] = -1
finish_time[i] = -1

for i = 0..n-1:
if start_time[i] == -1:
DFS(i)


Notice how we made use of the values ${\mathrm{start\_time}}[i]$ and ${\mathrm{finish\_time}}[i]$ to distinguish between back, forward and cross edges. In particular, assume start_time[j] != -1. This means that we have already started visiting node $j$. Thus, in the inner for loop above,

• If finish_time[j] == -1, then $j$ is being visited while we're on $i$, which means $j$ must be an ancestor of $i$ in the DFS forest.
• If finish_time[j] != -1 and finish_time[j] > start_time[i], then $j$'s visitation is already finished, but only after $i$'s visitation began, so it means $j$ must be a descendant of $i$.
• If finish_time[j] != -1 and finish_time[j] < start_time[i], then $j$'s visitation is already finished even before we started visiting $i$, hence $j$ is neither an ancestor or descendant of $i$.

The running time is still $O(n + e)$, but along the way, we've again obtained useful information about our directed graph!

As with the undirected case, note that the implementation above is written recursively. In some large trees, stack overflow might be a concern, especially if the stack size is limited. In those cases, you might want to implement an iterative version.

## Computing strongly connected components

So far we've discussed what SCCs are, along with some of their properties. But we haven't explained how to compute them yet. Unlike in the undirected case, a naïve search won't work here. (Why?)

We will discuss two algorithms. It's instructive to learn both, and then possibly choose your preferred algorithm later on.

Note that these sections go into quite some detail in proving the correctness of the algorithms, so unless you're comfortable, you might want to skip the proofs on first reading and just learn how the algorithms work first.

### Kosaraju's algorithm

Here, we describe Kosaraju's algorithm which computes the strongly connected components of a directed graph.

Here are the steps of Kosaraju's algorithm:

1. Mark all vertices as not visited.
2. Perform a DFS on the whole graph (in an arbitrary order). Take note of the finishing times of the nodes.
3. Reverse the directions of the edges.
4. Again, mark all vertices as not visited.
5. Perform another DFS traversal, this time in decreasing order of finishing time (which were calculated earlier). Every time we start a new top-level DFS traversal DFS(i), all the nodes visited in that run constitutes a strongly connected component.

Since we already know how to perform a DFS, this is easy to implement! Also, this runs in $O(n + e)$ time. You might wonder why not $O(n \log n + e)$ since we need to sort the nodes in decreasing order of finishing time, but there's actually no need to do that, since we can simply push every node that we just finished visiting onto a stack. Then, in the next phase, we simply pop from the stack to determine the order. This works because the nodes that are finished last will be the ones popped first from the stack! In fact, using this stack, we don't really even need to store the finishing times.

Here's a pseudocode of the algorithm:

function DFS(i, adj, group):
// then push all visited nodes to the vector 'group'
visited[i] = true
if not visited[j]:

group.push(i)

function SCC_Kosaraju():
// mark all nodes as unvisited
for i = 0..n-1:
visited[i] = false

stack = new empty vector
for i = 0..n-1:
if not visited[i]:

// create the reversal graph
for i = 0..n-1:
jda[j].push(i)

// reinitialize visited
for i = 0..n-1:
visited[i] = false

// do DFS again on jda, in decreasing order of finishing time
sccs = new empty vector of vectors
while not stack.empty():
i = stack.pop()
if not visited[i]:
scc = new empty vector
DFS(i, jda, scc)
sccs.push(scc)
return sccs


Now, why does it work? We'll provide a proof here, but you may want to skip it if you want spend some time thinking about it yourself.

The following is a rough proof of why this algorithm is correct. It relies on the fact that the SCCs of a graph is the same as the SCCs of its reversal graph (this can be proved very easily). Let us denote by $G^R$ the graph $G$ with all its edges reversed. ($G^R$ is also called the transpose of $G$.)

Claim. Kosaraju's algorithm correctly computes the strongly connected components of a graph $G$.

Proof. In order to prove that the algorithm is correct, we only need to ensure that in phase two, whenever we start a top-level DFS in $G^R$, we do it in such an order that all the reachable nodes that belong to a different SCC have already been visited. This is sufficient because if this is true, then the first time we visit a node in some SCC $B$ is when we actually start the DFS on a node in $B$, not on a node in a different SCC that can reach $B$ (otherwise, this would contradict the above statement), and once we start a DFS in $B$, all nodes in $B$ will be visited by this DFS (because they are reachable from each other), and only those in $B$ will be visited, because all the nodes in the other SCCs reachable from $B$ have already been visited (again according to the statement). Therefore, whenever we start a new DFS, we visit exactly those nodes that belong to an SCC.

Now, consider two distinct SCCs of $G$, say $A$ and $B$, and suppose there is a path from some node in $A$ to some node in $B$. Since $A$ and $B$ are SCCs, it follows that there is no path from any node in $B$ to any node in $A$. Now, during the first phase, where we are performing the DFS in an arbitrary order, there are two cases:

1. A node in $A$, say $a$, is discovered before any node in $B$. In this case, the DFS will be able to traverse the path from $A$ to $B$ and visit all nodes in $B$ before $a$ itself finishes expanding. Therefore, the finishing time of $a$ is greater than the finishing time of any node in $B$.
2. A node in $B$, say $b$, is discovered before any node in $A$. In this case, the DFS will finish visiting all nodes in $B$ before it ever reaches any node in $A$, because there is no path from $B$ to $A$. Therefore, all nodes in $A$ have a greater finishing time than all nodes in $B$.

The two cases are illustrated below:

What this shows is that regardless of the order we visit the nodes for the DFS, as long as there exists a path from component $A$ to $B$, there will always be a node in $A$ that has a greater finishing time than all nodes in $B$. More generally, if $A_1, A_2, \ldots, A_k$ are all the SCCs that can reach $B$, then there exists a node in each one of those components that have a greater finishing time than all nodes in $B$.

In the reversal graph $G^R$, $A_1 \ldots A_k$ are precisely the SCCs that $B$ can reach, and therefore none of the $A_i$ can reach $B$. And since in the second phase we are performing the DFS in decreasing order of finishing times, it follows that we will have done a DFS on each $A_i$ before we visit any on $B$, and thus, all nodes in $A_1 \ldots A_k$ will have been visited before any node in $B$. Therefore, once we start visiting a node in $B$, all the nodes reachable from it that belong to a different SCC have already been visited. This is exactly what we wanted to prove.

As a side note, the main idea in this proof can be repurposed to prove that we can get a topological sort of the DAG by ordering the nodes by decreasing finishing time:

Claim. A topological sort of a DAG is obtained by performing a DFS on it and ordering the nodes in decreasing finishing time.

Proof. Note that the main idea in the previous proof is to show that for two SCCs $A$ and $B$, if there is a path from some node in $A$ to some node in $B$, then there is a node in $A$ with a greater finishing time than all nodes in $B$.

But the SCCs of a DAG consist of single nodes, thus $A$ has exactly one element, say $a$, and $B$ has exactly one element, say $b$, so it simply says that if there is a path from $a$ to $b$, then $a$ has a greater finishing time than $b$. In particular, paths of length $1$, i.e., single edges, point from a node to another node with a lower finishing time, hence ordering the nodes in decreasing finishing time results in a valid topological sort!

As in Kosaraju's algorithm, you can construct the toposort without computing the finishing times explicitly by pushing the just-finished nodes onto a stack, and then reversing the stack in the end.

### Tarjan's SCC algorithm

Here, we describe another algorithm called Tarjan's SCC algorithm. Tarjan's SCC algorithm also uses a DFS, but unlike Kosaraju's algorithm, it needs only one DFS traversal. It works by augmenting the DFS procedure with additional bookkeeping data that's enough to identify the SCCs.

This algorithm uses the ${\mathrm{disc}}$ and ${\mathrm{low}}$ arrays, just like in our algorithm for bridges and articulation points!

Here's the pseudocode:

function DFS(i):
disc[i] = low[i] = time++

stack.push(i)
instack[i] = true
if disc[j] == -1:
DFS(j)
low[i] = min(low[i], low[j])
else if instack[j]:
low[i] = min(low[i], disc[j])

if low[i] == disc[i]:
get_scc(i)

function get_scc(i):
// pop the stack until you pop i, and collect those as an SCC
scc = new empty vector
do:
j = stack.pop()
instack[j] = false
scc.push(j)
while j != i
SCCs.push(scc)

function SCC_Tarjan():
stack = new empty vector
sccs = new empty vector of vectors
time = 0
for i = 0..n-1:
disc[i] = -1
instack[i] = false

for i = 0..n-1:
if disc[i] == 0:
DFS(i)
return sccs


Let's try to explain how this works. Let's first describe the following property of the DFS forest.

Claim 2. The nodes of any SCC form a rooted subtree in the DFS forest. (Note that “subtree” means slightly different here. This doesn't mean that all nodes down to the leaves are part of the SCC. It means that, if you consider only the nodes of some SCC and ignore the rest (including possibly some nodes below them), then you get a rooted tree.)

A simple proof sketch is as follows. Here, we define the head of an SCC as the node that's discovered first among all nodes in the SCC:

Proof. Consider two nodes $a$ and $b$ from a single SCC such that $a$ is an ancestor of $b$ in the DFS forest. Then every node from the path between them must belong to the same SCC. This is because for every node $c$ in the path, $a$ reaches $c$, $c$ reaches $b$ and $b$ reaches $a$ (since $a$ and $b$ are in an SCC), so $c$ must also belong to the same SCC as $a$ and $b$.

Next, let $h$ be the head of an SCC. Then every other node in the SCC is a descendant of $h$ in the forest, because they are all connected, and $h$ is visited earliest. Therefore, combining the above with this, it follows that the SCC forms a rooted tree in the forest, with $h$ as the root.

Note that this also proves that the head is the root of that tree, and that the head is also the last to finish expanding among all nodes in an SCC.

Let's now describe ${\mathrm{disc}}[i]$ and ${\mathrm{low}}[i]$. They are defined similarly as before:

• Let ${\mathrm{disc}}[i]$ be the discovery time of $i$.
• Let ${\mathrm{low}}[i]$ be the lowest discovery time of any ancestor of $i$ that is reachable from any descendant of $i$ with a single back edge. If there are no such back edges, we say ${\mathrm{low}}[i] = {\mathrm{disc}}[i]$.

It's worth mentioning that ${\mathrm{low}}[i]$ ignores forward edges or cross edges.

Using the ever-useful ${\mathrm{low}}[i]$ and ${\mathrm{disc}}[i]$, we can identify whether a node is a head or not.

Claim. A node $i$ is a head of some SCC if and only if ${\mathrm{low}}[i] = {\mathrm{disc}}[i]$.

Here's a rough proof:

Proof. By the definition of ${\mathrm{low}}$, we find that ${\mathrm{low}}[i] \le {\mathrm{disc}}[i]$.

Thus, the claim is equivalent to saying that a node $i$ is not a head of some SCC if and only if ${\mathrm{low}}[i] < {\mathrm{disc}}[i]$.

Now, note that ${\mathrm{low}}[i] \lt {\mathrm{disc}}[i]$ happens if and only if there is a node $j$ reachable from $i$ that is an ancestor of $i$ in the tree (using only tree edges and back edges). But for such a $j$, $i$ reaches $j$ and $j$ reaches $i$, so $j$ is another member of the SCC containing $i$ that has been discovered earlier. Therefore, $i$ is not a head if and only if such a $j$ exists, if and only if ${\mathrm{low}}[i] < {\mathrm{disc}}[i]$, which is what we wanted to prove.

Now, we're calculating ${\mathrm{low}}[i]$ and ${\mathrm{disc}}[i]$ on the fly as we perform the DFS traversal, and since the head is the last to finish expanding, we can collect all the members of the SCC containing that head right at that moment. Conveniently enough, it turns out that the members of this SCC are all on top of the stack!

To see this, note that whenever we visit a node, we simply push it onto the stack. But when a node finishes expanding, we don't necessarily pop it from the stack. We only pop the stack whenever we finish expanding a head $h$, and we keep popping until $h$ is popped.

Now, due to Claim 2 above, and the fact that we only pop when we finish expanding a head, we are guaranteed that for every two nodes $i$ and $j$ in the stack belonging to a single SCC, all nodes between them in the stack also belong to the same SCC, therefore all nodes in a single SCC in the stack are found in contiguous locations. Also, when we finish expanding a head $h$, all other nodes in the SCC of $h$ are still in the stack. Therefore, whenever we pop from the stack, all the nodes popped belong to a single SCC!

After the traversal, we would have computed all the SCCs of the graph.

Clearly, the time complexity is $O(n + e)$. However, although the time complexity is the same with Kosaraju's algorithm, Tarjan's algorithm can still be seen as somewhat of an improvement over Kosaraju's algorithm in a few ways:

• Only one DFS is required.
• We don't have to build the reversal graph (which could be memory-intensive).
• The correctness of the algorithm is more easily seen. (at least to some; honestly it took me quite some time to understand it myself.)
• It's usually faster in practice.

### Which algorithm to use?

Now that you know both algorithms, which one should you now use? Well, it's really up to you. For me, the choice here is essentially whether to choose an easy-to-implement solution or a slightly faster solution. I usually choose Kosaraju's algorithm since it's easier to understand (and remember), although I know a lot of others who prefer Tarjan. In fact, it seems I'm in the minority. So it's up to you if you want to go mainstream or hipster.

## DAG of SCCs

Finally, it's at least worth mentioning how we can construct the DAG of SCCs. Once we can compute the SCCs using any of the algorithms above, we can now construct the DAG of SCCs. In high level, the steps are:

1. Compute the SCCs of the graph.
2. “Shrink” the SCCs into single nodes.
3. Remove the self-loops and duplicate edges.
4. The resulting graph is the DAG of SCCs.

The “shrinking” part might be too vague, but a simple “engineer” approach can be used. Let's restate the steps above, but also expand that part:

1. Compute the SCCs of the graph. Suppose there are $k$ SCCs.
2. Compute the array $f$, where $f[i]$ represents the index of the SCC containing $i$.
3. Construct a new graph with $k$ nodes and initially $0$ edges.
4. For every edge $a \rightarrow b$ in the original graph, if $f[a] \not= f[b]$, then add the edge $f[a] \rightarrow f[b]$ in the new graph (if it doesn't exist already).
5. The new graph is now the DAG of SCCs.

Congratulations! You may now use the DAG of SCCs (if you need it).

Note that this still runs in $O(n + e)$ time.