The personal site of Remco Vermeulen

Security code reviewing with CodeQL

In the post “Scaling application security with codified security knowledge” I discussed how codifying security knowledge acquired during manual security code reviews can help with scaling application security. In this post I would like to allude how one can use CodeQL in a security code review and codify gaps in the security knowledge uncovered during the code review. The approach taken to show how CodeQL can be used is by looking at the similarities between the security code review process and the security query writing process and how both can improve the other.

The security code review process is a fundamentally creative process and each reviewer will use a different approach that suits their needs and goals. For an in-depth treatment on security code reviews I suggest reading chapter 4 of The Art of Software Security Assessment. During a security code review you trace the flow of execution and when tracing code you can follow one of two directions. You either start at the entry points of an application, its attack surface, and see where you end up following the flow of data, or you start at possible vulnerable patterns and trace back to the application’s entry points.

The forward direction approach closely matches how many of the standard CodeQL security queries uncover security issues. The analysis implemented by these queries starts with program elements that are marked as sources of untrusted data and follow the use of that data using data flow analysis to determine if the data is used in security sensitive pattern called a sink. The sources are entry points and the sinks are security sensitive patterns. Like with a security code review, a query tries to determine a connection, a path, between a source and a sink. Along the way, if any validation or contextually encoding is encountered the data flow is inhibited and the query no longer considers that path as a possible connection. In CodeQL parlance encoding functions and validation patterns are called sanitizers and barriers. Sanitizers directly work on the data being tracked, for example HTML encoding data, and barriers guard insecure use of data, typically with an if statement that validates it is safe to use.

The similitude between the security code review process and security queries means that knowledge already encoded in the queries can be used during a security code review process and any gaps found during the security code review process can be codified to further expand the security knowledge available to queries.

To use this codified knowledge you first start with running the security queries as is and review the results. The next step, following the forward direction, is determining if the currently codified knowledge sufficiently captures the entry points of the application. Security knowledge that describe patterns of interest are defined in CodeQL through models. Models allow a query to reason about program elements in terms of concepts. An example concept is the remote flow source, that for the Java language is described by the CodeQL class RemoteFlowSource. The RemoteFlowsource class represents all the elements in a program that introduce a value that cannot be trusted. Models fill in the concrete details for each concept. For example, a model of the Spring Framework can state that the parameters of methods annotated with @GetMapping are considered sources of untrusted data. The standard library of CodeQL contains an ever-increasing set of these models, but they are limited to frameworks that publish their source-code or their interfaces. Gaps in the modelling can be identified and resolved during a security code review.

The following CodeQL query can be used to determine if you are missing any remote flow sources when targeting the Java language. For other supported languages you can write a similar query.

import java
import semmle.code.java.dataflow.FlowSources

select any(RemoteFlowSource s)

Many of the RemoteFlowSource implementations can be found in the FlowSources module in case you would like to have examples to start writing your own implementations.

Having good coverage of untrusted sources helps the existing queries relying on the RemoteFlowSource class, but it also helps with finding missing sinks, the security sensitive patterns. Like with our remote flow sources, we would like to determine if we have a good coverage of our sinks. Did we consider all the security sensitive patterns that can be reached by untrusted data?

This is a question of interest to both security code reviewers and CodeQL query writers. This means that security code reviewers can benefit from debug functionality available to CodeQL query writers. The query UntrustedDataToExternalAPI is such a debug tool, and it is available for all the supported languages. This query follows untrusted data and lists their use in unknown external API calls, that is, functions the program uses to interact with its environment. They are unknown in the sense of that there a no models describing them. A good example of its effectiveness is described in the blog Pre-Auth RCE with CodeQL in Under 20 Minutes. A critical part of the query’s success is a good coverage of sources and that is why you should first look at the completeness of the values captured by the RemoteFlowSource class.

What about the backward direction approach? Can CodeQL support a reviewer that would like to start with the security sensitive patterns? Where sources are captured by a few concepts like the concept of a remote flow source, sinks are captured by a profusion of sink concepts. Each sink concept is specific to the vulnerability you would like to find. If you are looking for cross-site scripting (XSS) issues, the concept will be about patterns that write raw HTML content. When you are looking a path traversal issues, the concept used will reason about I/O operations accepting a path. To list all these candidate program elements in your target requires a query that lists all these different sinks. Unfortunately, this is currently a cumbersome query to write.

Each security query relying on data flow uses a configuration to make the global data flow analysis that is performed tractable. The analysis is global in the sense of tracking data across function boundaries. A configuration requires a description of the sources and sinks to reduce the set of candidates program elements to consider. Most configurations will use the RemoteFlowSource class to describe the sources to be considered, but use configuration specific sinks. The class XssSink in the XSS module, for example, can be used to find all the patterns susceptible to XSS injections. The following query uses the concept similar to the query used to find all the untrusted entry points of the application.

import java
import semmle.code.java.security.XSS

select any(XssSink s)

To enumerate all the interesting patterns for the backward’s direction we have to import all the known concepts and used them in a query like:

import java
import semmle.code.java.dataflow.DataFlow
import semmle.code.java.security.XSS
import semmle.code.java.security.UrlRedirect
// Imports of other sinks
// ...

from DataFlow::Node sink, string type
where sink instanceof XssSink and type = "XSS sink"
    or
    sink instanceof UrlRedirectSink and type = "URL redirect sink"
    // type checks of other imported sinks
    // or
    // ...
select sink, type

Like in the case of sources it would be interesting to know where information that can reach the sinks originates from. We would be interested in the backward’s flow of data. One way to find these flows is to define a configuration that flips it into a forward flow problem by using a weak source, that is a source that describes a broad set of program elements, and the sinks we are interested in. The following query captures that.

/**
* @kind path-problem
*/
import java
import semmle.code.java.dataflow.TaintTracking
import semmle.code.java.security.XSS
import semmle.code.java.security.UrlRedirect
// Imports of other sinks
// ...
import DataFlow::PathGraph

predicate isCandidate(DataFlow::Node node, string type) {
    node instanceof XssSink and type = "XSS sink"
    or
    node instanceof UrlRedirectSink and type = "URL redirect sink"
    // type checks of other imported sinks
    // or
    // ...
}

class WeakSourceConfiguration extends TaintTracking::Configuration {
    WeakSourceConfiguration() {
        this = "WeakSourceConfiguration"
    }

    override predicate isSource(DataFlow::Node node) {
        any()
    }

    override predicate isSink(DataFlow::Node node) {
        isCandidate(node, _)
    }
}

from WeakSourceConfiguration c, DataFlow::PathNode source, DataFlow::PathNode sink, string type
where c.hasFlowPath(source, sink) and isCandidate(sink.getNode(), type)
select sink, source, sink, "Flow between weak source and sink of type " + type + "."

While this query works it probably doesn’t perform well on larger code bases. A better solution is available to CodeQL query writers that can also help us in exploring these backwards flows.

When writing a configuration the analysis may encounter program elements where the analysis does not know how to continue. A common example is that a tracked value is passed to a library for which we do not have a model and cannot investigate its implementation. Each data flow/taint tracking configuration builds a graph to determine if sources can reach sinks. For debugging purpose there is the capability to building a partial graph to determine where the graph cannot be extended. This debugging technique is documented in the CodeQL documentation on Debugging data flow queries using partial flow.

Applying the partial path graph technique is also a good next step after you have executed the query UntrustedDataToExternalAPI in the forward direction approach to determine if the analysis isn’t missing any paths that would uncover new paths to sinks.

To conclude, CodeQL can be a useful tool when performing a security code review. Already existing security knowledge can help you enumerate the attack surface and find interesting security sensitive patterns in a code base. Debugging facilities can help you find gaps in the existing security knowledge and help in finding new vulnerabilities. When these gaps are identified and codified the new security knowledge can be used to uncover new variants in the same code base or in other code bases. This iterative process leads to security knowledge that is continually reused and improved. In other words, we are scaling application security by scaling security knowledge.

So the next time when you are performing a security code review, try to see if CodeQL can help by following these five steps, and maybe you will find a vulnerability in 20 minutes.

  1. Run the security queries and analyze the results.
  2. Enumerate the attack surface using the RemoteFlowSource class and add any missing entry points.
  3. Run the UntrustedDataToExternalAPI query to uncover interesting sinks.
  4. Use partial flow graphs to uncover missing paths between sources and sinks.
  5. Re-run the security queries if you added new sources, sinks, or additional data flow steps as a result from the previous steps.