The personal site of Remco Vermeulen

CodeQL Path Graphs

Those working with GitHub Code Scanning undoubtedly have encountered the show paths functionality available on some alerts produced by CodeQL. The show paths functionality provides developers with a visualization of the data flow showing the path from the source of untrusted data to the sink – the location of the alert – to help a developer understand how the security issue materializes. The path helps with identifying the location where a validation or contextual encoding step is missing. These visualizations are made possible by path queries and path graphs.

CodeQL path queries are queries that can be constructed from regular problem queries by changing its metadata – by providing a query predicate – and by changing the select clause’s structure to adhere to a predefined format. The CodeQL documentation provides the following boilerplate query that includes the necessary ingredients for creating a path query with a data flow path graph.

/**
* @kind path-problem
*/

import <language>
// For some languages you need to import the Dataflow module
// import semmle.code.<lang>.dataflow.Dataflow
import DataFlow::PathGraph

from SomeConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Some alert message"
*/

Looking at this template we can see that:

While all these changes are required, the magic of constructing the path graph is happening in the module DataFlow::PathGraph.

Looking at the Java implementation of the DataFlow::PathGraph reveals three predicates: edges, nodes, and subpaths annotated with the annotation query. According to the QL language specification the queries in a QL module, our query file defines an implicit query module, are the select clause or predicates annotated with query that are in scope of our QL module. With queries, we refer to the output of a QL program. To add to the confusion, it is common to refer to a QL program as a query. Long story short, the output of our query can be defined by a select clause or a predicate annotated with query that is in scope of our query module.

Of the three query predicates defined in DataFlow::PathGraph, the edges query predicate is the only required predicate to construct a path graph. The other predicates provide means to tweak the result of the path graph and are not further considered in this post.

In the next snippet you can find the implementation of the edges predicate as found in the DataFlow::PathGraph module.

/** Holds if `(a,b)` is an edge in the graph of data flow path explanations. */
query predicate edges(PathNode a, PathNode b) { a.getASuccessor() = b }

While this feels anticlimactic, these are exactly the tuples we want for a data flow path graph. Each node is part of the computed data flow graph, and we want to find all the edges between those nodes, so we can construct a path from the source to the sink by using these edges.

So now we have seen all the ingredients for a path query: the metadata, the edges query predicate, and the select clause format. Now it is time to create our own path query with a custom path graph.

For multiple reasons it is interesting to know what functions are reachable, through calls, from some given function. To determine the answer we need to compute a call graph. The following predicate is a basic implementation, for the C/C++ language, that looks for calls that are part of the caller and that call the callee.

predicate calls(Function from_, Function to) {
    exists(Call call | call.getEnclosingFunction() = from_ |
        call.getTarget() = to
    )
}

By applying the transitive closure we can construct the call graph as follows:

predicate calls(Function from_, Function to) {
    exists(Call call | call.getEnclosingFunction() = from_ |
        call.getTarget() = to
    )
}

from Function source, Function reachable
where calls+(source, reachable)
select source, reachable

Viewing the raw results or rewriting the query to a problem query gets us the results we are looking for, but we have to manually connect the dots between rows or alerts.

/**
* @kind problem
*/

predicate calls(Function from_, Function to) {
    exists(Call call | call.getEnclosingFunction() = from_ |
        call.getTarget() = to
    )
}

from Function source, Function reachable
where calls+(source, reachable)
select reachable, "This function is reachable from $@", source, source.getName()

To resolve that we turn this query into a path query by following the steps based on the requirements we identified earlier.

/**
* @kind path-problem
*/
import cpp

predicate calls(Function from_, Function to) {
    exists(Call call | call.getEnclosingFunction() = from_ |
        call.getTarget() = to)
}

query predicate edges(Function caller, Function callee) {
    calls(caller, callee)
}

from Function source, Function reachable
where calls+(source, reachable)
select reachable, source, reachable, "This function is reachable from $@", source, source.getName()

The results of this query will now show a path from the source to each reachable function in the CodeQL Results View part of the Visual Studio Code CodeQL extension. However, there is still room for improvement. I would like to know how each function reaches another function. In other words, I would like to include the function call expressions in our path. To achieve this we need to create our own path graph node that can represent both a function and a function call expression.

The way to represent both a function and a function call is with an algebraic datatype. The short version of algebraic data types is that it allows use to create a type that represents new values that are neither primitive values nor types that are database entities. These values are created by taking the sum of all the values produced by the algebraic datatype’s branches. Let’s dive right into the code to see how we can define a new type to represent our path graph node.

module CallGraph {
    newtype TPathNode =
        TFunction(Function f) or
        TCall(Call c)
}

The TPathNode type represents all the functions and calls in a program. However, in its current form we cannot use it in a select clause like select any(CallGraph::TPathNode n), because it is missing a toString member predicate. This is needed to be able to print the results.

The way to use algebraic data types is discussed in the CodeQL documentation. In our case it suffices to create a class that extends from our algebraic type, so we can define member predicates like toString.

module CallGraph {
    newtype TPathNode =
        TFunction(Function f) or
        TCall(Call c)

    class PathNode extends TPathNode {

        Function asFunction() {
            this = TFunction(result)
        }

        Call asCall() {
            this = TCall(result)
        }

        string toString() {
            result = this.asFunction().toString()
            or
            result = this.asCall().toString()
        }
}

The PathNode class includes two convenience member predicates asFunction and asCall, for easy access to the underlying types, in addition to the required toString member predicate. With the following select clause we can now see the values of our new PathNode type select any(CallGraph::PathNode n).

The next step is updating our edges predicate to use our new PathNode. In the CallGraph module we add the edges predicate. However, the edges predicate can no longer use the calls predicate because that only reasons about functions. To resolve this we follow the Dataflow::Pathgraph module’s implementation and add a getASuccessor member predicate that will be used by the edges predicate.

class PathNode extends TPathNode {
    ...
    PathNode getASuccessor() {
        this.asFunction() = result.asCall().getEnclosingFunction()
        or
        this.asCall().getTarget() = result.asFunction()
    }
}
query predicate edges(PathNode pred, PathNode succ) {
    pred.getASuccessor() = succ
}

If we now combine all this with our previous query we get:

/**
* @kind path-problem
*/
import cpp

module CallGraph {
    newtype TPathNode =
        TFunction(Function f) or
        TCall(Call c)

    class PathNode extends TPathNode {

        Function asFunction() {
            this = TFunction(result)
        }

        Call asCall() {
            this = TCall(result)
        }

        string toString() {
            result = this.asFunction().toString()
            or
            result = this.asCall().toString()
        }

        PathNode getASuccessor() {
            this.asFunction() = result.asCall().getEnclosingFunction()
            or
            this.asCall().getTarget() = result.asFunction()
        }
    }

    query predicate edges(PathNode pred, PathNode succ) {
            pred.getASuccessor() = succ
    }
}

import CallGraph

predicate calls(Function from_, Function to) {
        exists(Call call | call.getEnclosingFunction() = from_ |
            call.getTarget() = to
        )
}

from PathNode source, PathNode reachable
where calls+(source.asFunction(), reachable.asFunction())
select reachable.asFunction(), source, reachable, "This function is reachable from $@", source, source.asFunction().getName()

When you execute this query you will see the paths from caller to callee, including the calls. Unfortunate we are not presented with links to the location of the path nodes but with the text [no location]. Our PathNode didn’t provide any location information that the CodeQL Query Results view can use. An in-depth treatment on how to provide location information can be found in the CodeQL documentation. Because our types already have this information we can make that information available in our PathNode by implementing the getLocation member predicate.

Location getLocation() {
    result = this.asFunction().getLocation()
    or
    result = this.asCall().getLocation()
}

Re-running the query with the getLocation predicate reveals the final result we were looking for, a path from functions to reachable functions and how they are reached.

With that we have reached the end and hopefully a better understanding of path graphs and how you can use them for your own queries. The following are examples of custom path graphs that help developers better understand the results of queries: