CodeQL for Beginners

Introduction to CodeQL

CodeQL is a powerful semantic code analysis engine developed by Semmle and later acquired by GitHub. It allows developers to query code as though it were data. If you think about it, this is a rather profound idea. Just as you might query a database to find specific information, you can use CodeQL to find specific patterns in your code.

For instance, suppose you want to find every if statement in your codebase that doesn't have a corresponding else statement. With traditional means, this would be quite a tedious task, but with CodeQL, you can just write a query to get all such instances.

Here's a simple example of what a CodeQL query might look like (note that this is pseudocode):

from IfStmt ifstmt
where not(ifstmt.hasElseBranch())
select ifstmt, "This 'if' statement doesn't have an 'else' branch."

This simple query scans the codebase for if statements (IfStmt) that do not have an else branch (not(ifstmt.hasElseBranch())) and then selects those if statements, along with a message.

The primary use of CodeQL, however, is not merely to find syntactical patterns like in the example above but to identify more complex, semantic patterns that can highlight potential security vulnerabilities. In fact, it's one of the most powerful tools currently available for semantic code analysis in the context of security.

The Importance of Security Testing

In today's era, where software forms the backbone of numerous critical systems, ensuring software security has never been more important. Security vulnerabilities in code not only pose a risk to data privacy, but they can also lead to financial losses and damage to an organization's reputation. Security testing forms the first line of defense against such vulnerabilities.

Semantic code analysis tools like CodeQL allow developers to analyze their code from a security standpoint. These tools can uncover complex security vulnerabilities by analyzing the meaning of code, which is something that traditional syntactic code scanners may miss. For instance, CodeQL can help you find SQL injection vulnerabilities, cross-site scripting (XSS) vulnerabilities, and more, even in a large and complex codebase.

You will find yourself gradually mastering this powerful tool, ready to take on the challenges of securing your codebase in an increasingly connected world.

What is Code Analysis?

Code analysis, also known as static analysis, is a method of debugging by examining source code before a program is run. It's done by analyzing a set of code against a set (or multiple sets) of coding rules. Code analysis is an important aspect of software development as it not only helps in improving the quality of software but also accelerates the development process by identifying bugs at an early stage.

Code analysis can be done both manually and automatically. Manual code reviews are time-consuming and can be error-prone. Automated code analysis, on the other hand, offers an efficient and reliable alternative. CodeQL is an example of an automated code analysis tool.

Note: CodeQL is particularly powerful because it performs semantic code analysis, which goes a step beyond mere syntactic analysis to understand the 'meaning' of code.

How Code Analysis Improves Software Security

Code analysis plays a pivotal role in improving software security. By identifying potential vulnerabilities at the coding stage, it can help prevent security breaches that might occur when the software is in use. Some ways in which code analysis enhances software security include:

  1. Early Bug Detection: Code analysis can uncover bugs and vulnerabilities early in the development process, even before the testing phase. This allows developers to fix problems before they can be exploited in a live environment.

  2. Automation: Automated code analysis tools like CodeQL can scan large codebases quickly and efficiently, ensuring that no stone is left unturned in the hunt for potential security issues.

  3. Coding Standards Compliance: Code analysis ensures that the code complies with standard coding practices. Following these practices can prevent a number of common security issues.

  4. In-depth Analysis: Tools like CodeQL allow for deep, semantic analysis of code. This means they can find complex vulnerabilities that may be missed by simple syntactic analysis.

In this chapter, we'll go through the process of setting up your CodeQL environment. We'll cover the steps to download and install CodeQL, introduce you to the CodeQL command-line interface (CLI), and discuss setting up CodeQL for various Integrated Development Environments (IDEs).

Setup

Downloading and Installing CodeQL

Before you start with CodeQL, you need to download and install it. Here's how you can do it:

  1. Download the CodeQL CLI: The CodeQL CLI can be downloaded from the GitHub's CodeQL repository. Make sure to choose the version compatible with your operating system.

  2. Unpack the archive: Once you've downloaded the archive, unpack it to a location of your choice.

  3. Add CodeQL to your PATH: After unpacking, add the path of the codeql executable to your system's PATH environment variable. This will allow you to run CodeQL commands from anywhere.

Here's an example of how you can add CodeQL to your PATH on a Unix-like system:

export PATH=$PATH:/path/to/codeql

And here's how you can do it on Windows:

$env:Path += ";C:\path\to\codeql"

Please replace /path/to/codeql and C:\path\to\codeql with the actual path to the codeql executable on your system.

CodeQL CLI

The CodeQL command-line interface (CLI) is a powerful tool that allows you to run CodeQL queries, create databases for analysis, and perform a variety of other tasks.

Some basic commands you might find useful include:

  • codeql database create: This command creates a new CodeQL database. You can specify the language of the database with the --language option.

  • codeql query run: This command runs a CodeQL query on a database.

Here's an example of how you might use these commands:

# Create a new JavaScript database
codeql database create my-js-database --language=javascript --source-root ./my-js-project

# Run a query on the database
codeql query run ./my-query.ql --database my-js-database

In the example above, replace ./my-js-project with the path to your JavaScript project, and ./my-query.ql with the path to your CodeQL query.

Setting Up CodeQL for Various IDEs

While you can run CodeQL queries from the command line, you might find it more convenient to use an Integrated Development Environment (IDE). Many popular IDEs support CodeQL either natively or through plugins.

Visual Studio Code

For instance, if you're using Visual Studio Code, you can install the CodeQL for Visual Studio Code extension. This extension provides CodeQL syntax highlighting, query help, and database management.

To install the extension, open Visual Studio Code and follow these steps:

  1. Click on the Extensions view icon on the Sidebar (or press Ctrl+Shift+X).

  2. Search for "CodeQL".

  3. Click on Install.

Once installed, you can open a CodeQL database by clicking on "Choose Database from a Folder" in the Databases view. After opening a database, you can run a query by clicking on "Run Query" in the CodeQL Queries view.

CodeQL Queries

The Structure of CodeQL Queries

A typical CodeQL query has the following components:

  1. Import Statements: CodeQL queries start with import statements. These statements import the CodeQL libraries for the specific language you're analyzing.

  2. From-Where-Select Blocks: CodeQL queries retrieve data using from-where-select blocks.

    • The from clause defines a variable with a specific type.

    • The where clause sets a condition that the data needs to satisfy.

    • The select clause decides the final output of the query.

  3. Query Metadata: Query metadata, enclosed in a comment block at the beginning of the query file, provides information about the query. It can include the purpose of the query, its author, and more.

Here's an example of a simple CodeQL query to find all Python functions named execute:

import python

from Function f
where f.getName() = "execute"
select f

In this query, we import the CodeQL library for Python with import python. Then, we define a variable f of type Function, and in the where clause, we set a condition that the function's name should be "execute". Finally, we select the function.

Understanding CodeQL Libraries

CodeQL has libraries for different programming languages. These libraries contain classes that represent various elements of the code you're analyzing, and predicates that represent the properties and relations of these elements.

For instance, the Python library includes classes such as Function for Python functions, Class for Python classes, and Module for Python modules. These classes have associated predicates. For example, the Function class has predicates like getName() to get the function's name and getArgument(int i) to get the function's i-th argument.

Here's an example query that uses the Python library to find all calls to a function named execute:

import python

from Call c
where c.getFunction().getName() = "execute"
select c

In this query, we define a variable c of type Call, which represents a function call. In the where clause, we set a condition that the function being called should be named "execute". Finally, we select the call.

Querying for Vulnerabilities

CodeQL is a powerful tool for finding vulnerabilities in code. It can help you find patterns in your code that could lead to security vulnerabilities.

For instance, consider the following Python code:

@app.route('/api/data')
def api_data():
    param = request.args.get('param', '')
    results = db.session.execute('SELECT * FROM data WHERE name = %s' % param)
    ...

This code is vulnerable to SQL Injection because it directly includes a user-supplied parameter (param) in a SQL query.

We can write a CodeQL query to find similar vulnerabilities in a Python codebase:

import python

from StrConst str, LocalVariable var, Expr use
where
  var.getAnAssignedValue() = str and
  var.getAUse() = use and
  use.getParent*() instanceof ExecStmt
select use, "This code may be vulnerable to SQL Injection."

In this query, we first define three variables: str of type StrConst (representing a string constant), var of type LocalVariable (representing a local variable), and use of type Expr (representing an expression).

The where clause sets three conditions:

  1. The local variable var is assigned the string constant str.

  2. The local variable var is used in the expression use.

  3. The expression use is within an ExecStmt (representing a SQL execute statement).

Finally, we select the expression use and output a warning message.

The query uses the getAnAssignedValue() predicate of the Variable class to find the value assigned to the variable, the getAUse() predicate to find where the variable is used, and the getParent*() predicate to find the containing statement.

CodeQL for Programming Languages

While CodeQL is language-agnostic in its core principles, different programming languages have unique features and vulnerabilities. In this chapter, we will explore how CodeQL is used with different programming languages, focusing on JavaScript, Python, and Java.

CodeQL for JavaScript

JavaScript is often used in web applications, which are prime targets for security exploits. Let's write a CodeQL query to identify a common security vulnerability in JavaScript: Cross-Site Scripting (XSS).

XSS happens when untrusted input is directly included in output that gets rendered in a user's browser. For example, consider the following piece of Node.js code using the Express framework:

app.get('/sayHello', (req, res) => {
  res.send('Hello, ' + req.query.name + '!');
});

The name query parameter is directly included in the response sent to the client. If it includes JavaScript code, this code gets executed in the user's browser.

Here's a CodeQL query that detects similar issues:

import javascript

from Expr xssSink, DataFlow::Node source, DataFlow::TrackableSanitizer sanitizer
where 
  source.asExpr() = xssSink and
  not exists(DataFlow::Node mid |
    DataFlow::localFlow(source, mid) and
    sanitizer.sanitizes(mid)
  )
select xssSink, "This code may be vulnerable to Cross-Site Scripting (XSS)."

This query uses the DataFlow library to follow the flow of data from sources (user input) to sinks (where the data gets used in a potentially unsafe way). It checks that there is no sanitizer (code that cleans the input) in the data flow path from the source to the sink.

CodeQL for Python

Python is widely used in web and network applications, data analysis, and more. We'll focus on a common security issue in web applications: Open Redirect.

Open Redirect vulnerabilities occur when an application incorporates user-controllable data into the target of a redirection in an unsafe way. Consider the following Python code using the Flask framework:

@app.route('/redirect')
def redirect():
    target = request.args.get('target', '/')
    return redirect(target, code=302)

In this example, the application redirects the user to the URL they specified in the target parameter. An attacker could use this to redirect the user to a phishing page.

Here's a CodeQL query that finds similar issues in a Python codebase:

import python

from Flask::Redirect::Range redirect, StrConst taint
where redirect.getUrl().(LocalSourceNode).flowsTo(DataFlow::exprNode(taint))
select redirect, "This code may be vulnerable to Open Redirect."

This query uses the Flask library to identify calls to the redirect function. It then uses data flow analysis to check if a tainted string (i.e., user-controlled data) can flow into the URL parameter of the redirect call.

CodeQL for Java

Java is commonly used in web applications, server-side applications, and Android apps. We'll write a CodeQL query to detect a frequent security vulnerability: SQL Injection.

SQL Injection happens when untrusted input is included directly in a SQL query. Consider the following Java code:

String query = "SELECT * FROM users WHERE name = '" + userName + "'";
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(query);

Here, userName is included directly in the SQL query. If it includes SQL code, this code gets executed in the database.

The following CodeQL query identifies similar issues:

import java
import semmle.code.java.dataflow.FlowSources

from DataFlow::PathNode source, DataFlow::PathNode sink, DataFlow::Configuration config
where 
  config.hasFlowPath(source, sink) and
  source.getNode() instanceof FlowSources::UserInput and
  sink.getNode() instanceof SqlInjectionSink
select sink.getNode(), source, sink, "This code may be vulnerable to SQL Injection."

This query uses data flow analysis to find paths from user input (the source) to a SQL injection sink (where user input gets included in a SQL query). The SqlInjectionSink class is defined in the CodeQL library for Java.

Additional CodeQL Techniques

After understanding how to write basic CodeQL queries for different programming languages, it's time to deepen your understanding of CodeQL. In this chapter, we will discuss more advanced techniques to write CodeQL queries, such as using path queries for detailed analysis and incorporating control flow analysis.

Path Queries

While most CodeQL queries simply identify problematic code patterns, sometimes you need more detailed information. Path queries provide more context by showing the data flow path from a source (where data comes from) to a sink (where it ends up). They are especially useful for understanding how a vulnerability arises from the propagation of tainted data.

For example, consider a SQL Injection vulnerability in a Java application. The following path query can help us identify how tainted data flows from a source to a sink:

import java
import semmle.code.java.dataflow.TaintTracking
import semmle.code.java.dataflow.FlowSources

class SqlInjectionConfiguration extends TaintTracking::Configuration {
  SqlInjectionConfiguration() { this = "SqlInjectionConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    source instanceof FlowSources::UserInput
  }

  override predicate isSink(DataFlow::Node sink) {
    sink instanceof SqlInjectionSink
  }
}

from SqlInjectionConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "This code may be vulnerable to SQL Injection."

In this query, we define a data flow configuration that identifies tainted data flowing from user input to a SQL Injection sink. The hasFlowPath(source, sink) call checks for the existence of such a data flow path. The query then outputs not only the sink, but also the source and the entire data flow path, providing more context about the vulnerability.

Control Flow Analysis

Control flow analysis allows you to track the execution path through a program. This is useful when you want to understand the order in which statements and expressions are evaluated. CodeQL provides classes and predicates for control flow analysis in its standard libraries.

Here's an example of a control flow analysis query for a Java codebase, which finds places where a null check is performed after a variable is used:

import java

from VarAccess access, NullGuard guard
where
  guard.controls(access.getBasicBlock()) and
  guard.getAGuardedNode().getAControlFlowNode().dominates(access.getControlFlowNode()) and
  not guard.getValue().getAChild*() = access
select access, "This variable is used before a null check."

In this query, NullGuard represents a control statement (like an if statement) that guards against null values, and VarAccess represents an access to a variable. The controls() call checks if the NullGuard controls the basic block of the VarAccess, and the dominates() call checks if the NullGuard is evaluated before the VarAccess.

Advanced Libraries and Class Definitions

As your needs get more complex, you will start to define your own classes and predicates in CodeQL. You can also make use of advanced CodeQL libraries that define classes and predicates for common code patterns and vulnerabilities.

For instance, CodeQL provides libraries for working with various web frameworks (like Express.js for JavaScript and Django for Python), identifying standard sources of user input and sinks of potential vulnerabilities, and tracking the flow of data and control in a program.

Here's an example of a custom class definition in a CodeQL query:

import java

class PublicMutableField extends Field {
  PublicMutableField() {
    this.isPublic() and
    not this.isFinal() and
    not this.isStatic()
  }
}

from PublicMutableField field
select field, "This field is public and mutable."

This query defines a new class PublicMutableField for public, non-final, non-static fields, which can be unsafe because they can be accessed and modified from anywhere. It then finds all instances of this class in a Java codebase.

Query Optimizations

As your CodeQL queries get more complex, they may also get slower. There are various ways to optimize CodeQL queries for better performance.

One of the most effective ways to speed up a CodeQL query is to limit the number of possibilities it needs to consider. You can use the fastest keyword to prioritize faster computations, or the strictcount keyword to ensure accurate results.

Also, it's recommended to use specific types as much as possible. For instance, if you know that a variable represents a string, use StrConst instead of Expr. The more specific the type, the faster CodeQL can find instances of it.

Here's an example of an optimized CodeQL query for a Python codebase:

import python

from StrConst str, LocalVariable var, Expr use
where
  var.getAnAssignedValue() = str and
  var.getAUse() = use and
  use.getParent*() instanceof ExecStmt
select use, "This code may be vulnerable to SQL Injection."

In this query, instead of using the more general Variable type, we use the more specific LocalVariable and StrConst types. We also use the ExecStmt type instead of the more general Stmt type. This makes the query faster.

Analyzing Real-World Vulnerabilities with CodeQL

Understanding how vulnerabilities exist in real-world code is a key step towards effective security testing. In this chapter, we're going to take a closer look at a few real-world vulnerabilities and see how we can utilize CodeQL to identify these vulnerabilities in a codebase.

CVE-2017-5638: Apache Struts Command Injection Vulnerability

One of the most impactful vulnerabilities in recent memory was a command injection vulnerability in the Apache Struts web application framework, identified by CVE-2017-5638. The vulnerability existed in the way Struts processed Content-Type headers in an HTTP request.

The vulnerable code was similar to:

String contentType = request.getContentType();
if(contentType != null && contentType.indexOf("multipart") > -1){
    // process request...
}

The problem here is that an attacker could craft a Content-Type header that includes an OGNL expression (a language used in Struts for manipulating data), which gets evaluated when contentType.indexOf("multipart") is executed. This allowed the attacker to run arbitrary commands on the server.

A CodeQL query that can detect this type of vulnerability is:

import java
import semmle.code.java.dataflow.TaintTracking

class StrutsContentTypeConfiguration extends TaintTracking::Configuration {
  StrutsContentTypeConfiguration() { this = "StrutsContentTypeConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(MethodAccess).getMethod().getName() = "getContentType"
  }

  override predicate isSink(DataFlow::Node sink) {
    sink.asExpr().(MethodAccess).getMethod().getName() = "indexOf"
  }
}

from StrutsContentTypeConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Potential Apache Struts Command Injection vulnerability."

This query uses the TaintTracking library to find data flows from the getContentType() method (the source) to the indexOf() method (the sink). If there is a data flow path, this could indicate a potential command injection vulnerability.

CVE-2018-11776: Apache Struts Remote Code Execution Vulnerability

Another critical vulnerability in Apache Struts was identified by CVE-2018-11776. It allowed remote code execution through the use of a specially crafted URL. The root cause was insufficient validation of user-provided untrusted inputs.

The vulnerable pattern in the Struts code was as follows:

String actionName = getActionMappingName(someUserInput);
// actionName used later without proper validation

A potential CodeQL query to detect this kind of pattern is:

import java
import semmle.code.java.dataflow.TaintTracking

class StrutsActionNameConfiguration extends TaintTracking::Configuration {
  StrutsActionNameConfiguration() { this = "StrutsActionNameConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(MethodAccess).getMethod().getName() = "getActionMappingName"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(MethodAccess ma | ma = sink.asExpr() |
      ma.getMethod().getName() in ["addActionError", "addActionMessage", "addFieldError"] and
      ma.getArgument(0) = sink.asExpr()
    )
  }
}

from StrutsActionNameConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Potential Apache Struts Remote Code Execution vulnerability."

This query uses taint tracking to find data flows from getActionMappingName() to methods like addActionError(), addActionMessage(), and addFieldError(), which use the result without proper validation, potentially leading to remote code execution.

Incorporating CodeQL into Security Testing Practices

So far, you have learned how to use CodeQL to identify vulnerabilities in your code. However, to make the most of it, you should incorporate CodeQL into your regular security testing practices. This chapter outlines strategies to embed CodeQL in your organization's security practices, allowing for continuous security testing.

Security Testing in CI/CD Pipelines

In modern software development, continuous integration and continuous deployment (CI/CD) pipelines are crucial. CodeQL can be integrated into these pipelines to automatically perform security analysis on your codebase whenever changes are made.

For example, GitHub offers a CodeQL GitHub Action that you can use in your GitHub workflows. It automatically scans your codebase whenever you push changes or make a pull request.

Here's a sample workflow configuration for a JavaScript project:

name: "CodeQL"

on:
  push:
    branches: [ main ]
  pull_request:
    # The branches below must be a subset of the branches above
    branches: [ main ]

jobs:
  analyze:
    name: Analyze
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Initialize CodeQL
      uses: github/codeql-action/init@v1
      with:
        languages: "javascript"

    - name: Analyze
      uses: github/codeql-action/analyze@v1

In this workflow, the github/codeql-action/init step initializes the CodeQL database with the languages you specify. The github/codeql-action/analyze step then runs CodeQL analysis on the codebase.

Code Review and Bug Bounties

In addition to automated security testing, CodeQL can also be used in manual security testing practices, such as code reviews and bug bounty programs.

During code reviews, you can use CodeQL queries to look for problematic code patterns related to the changes being reviewed. This can make your code reviews more effective and help educate your developers about secure coding practices.

In bug bounty programs, you can provide CodeQL as a tool for bounty hunters. If they can formulate a CodeQL query that finds a bug, they can submit both the bug and the query. This way, you not only get the bug report, but also a way to prevent similar bugs in the future.

Training and Awareness

Finally, CodeQL can be a great tool for security training and awareness. By teaching your developers how to use CodeQL, you can make them more aware of secure coding practices and how vulnerabilities arise in code.

For example, you can organize internal workshops where developers write CodeQL queries to find real-world vulnerabilities. You can also use the queries provided by CodeQL's standard libraries as examples to illustrate common vulnerabilities.

Creating a Secure Development Lifecycle

Incorporating CodeQL into your CI/CD pipelines is just one part of creating a Secure Development Lifecycle (SDL). An SDL incorporates security practices into every stage of your development process, from design and development to testing and deployment.

In the design and development stage, you can use CodeQL to enforce secure coding standards. For example, you can write CodeQL queries that find violations of these standards, and use them to automatically comment on pull requests when violations are detected.

In the testing stage, you can use CodeQL as part of your security testing suite. By integrating CodeQL into your testing frameworks, you can automatically detect vulnerabilities in your codebase. For example, you could set up a nightly build that runs a suite of CodeQL queries against your codebase and alerts you to any new potential vulnerabilities.

In the deployment stage, you can use CodeQL to help with incident response. If a vulnerability is discovered in your application, you can use CodeQL to investigate the root cause and to find any other instances of the same vulnerability in your codebase.

Writing Custom Queries for Your Codebase

While CodeQL comes with a set of standard queries for common vulnerabilities, you will get the most benefit from CodeQL by writing custom queries that are specific to your codebase. This can help you find vulnerabilities that are specific to the technologies and coding patterns you use.

For example, if your application uses a custom web framework, you could write a CodeQL query that finds instances where user input is not properly sanitized before being used in a SQL query. This can help you find potential SQL injection vulnerabilities that would not be caught by the standard queries.

You can also use custom queries to enforce secure coding standards. For example, you could write a query that finds instances where secure coding standards are not followed, such as using eval() in JavaScript, or not validating certificates in SSL connections.

Training Developers and Security Teams

By teaching your developers and security teams how to use CodeQL, you can enhance their understanding of security vulnerabilities and how to prevent them. This can help reduce the number of vulnerabilities that are introduced into your codebase.

You can offer training sessions where developers and security teams learn how to write and use CodeQL queries. You can also encourage them to contribute to the CodeQL community by sharing their queries and collaborating with others to improve the security of open-source software.

Last updated