Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: per svoboda

Regular expressions (regex) are commonly used to match strings of text. As such, regular expressions can be found in applications that search through text. A notable example includes the *NIX grep utility. For example, a programmer may want this kind of functionality for searching through log files.

Java's regular expression facilities are wide ranging and powerful which can lead to unwanted modification of the original regular expression string to form a pattern that matches too widely, possibly resulting in far too much information being matched.

One method of preventing this vulnerability is to parse out the sensitive information prior to matching and then running the user-supplied regex against that. However, if the log format changes without a corresponding change in the class, sensitive information may be exposed. Furthermore, depending on how encapsulated the search keywords are, a malicious user may be able to grab a list of all the keywords (If there are a lot of keywords, this may cause a denial of service).

The primary means of preventing this vulnerability is to sanitize a regular expression string coming from untrusted input. One may whitelist certain characters (such as letters and digits) before passing the user supplied string to the regex parser. Blacklisting certain operators might be difficult due to the variability of the regex language.

Additionally, the programmer should could look into ways of avoiding using regular expressions from untrusted input, or perhaps provide only a very limited subset of regular expression functionality to the user.

...

Code Block
bgColor#FFCCCC
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public final class ExploitableLog {
    private static final StringBuilder logBuffer = new StringBuilder();
    private static String log = logBuffer.toString();
    
    public static Set<String> suggestSearches(String search) {
        Set<String> searches = new HashSet<String>();
        
        // Construct regex from user string
        String regex = "^(" + search + ".*),[0-9]+?,[0-9]+?$";
        int flags = Pattern.MULTILINE;
        Pattern keywordPattern = Pattern.compile(regex, flags);
        
        // Match regex
        Matcher logMatcher = keywordPattern.matcher(log);
        while (logMatcher.find()) {
            String found = logMatcher.group(1);
            searches.add(found);
        }
        
        return searches;
    }
    
    private static void append(CharSequence str) {
        logBuffer.append(str);
        log = logBuffer.toString(); //update log string on append
    }

    static {
        // this is supposed to come from a file, but its here as a string for
        // illustrative purposes
        append("Alice,1267773881,2147651408\n");
        append("Bono,1267774881,2147351708\n");
        append("Charles,1267775881,1175523058\n");
        append("Cecilia,1267773222,291232332\n");
    }
}

Compliant Solution

One somewhat compliant solution is to parse out the sensitive information prior to matching and then running the user-supplied regex against that. However, if the log format changes without a corresponding change in the class, sensitive information may be exposed. Furthermore, depending on how encapsulated the search keywords are, a malicious user may be able to grab a list of all the keywords (If there are a lot of keywords, this may cause a denial of service).

...

This compliant solution filters out non-alphanumeric characters from the search string using Java's Character.isLetterOrDigit(). This removes the grouping parentheses and the OR operator which triggers the injection.

...