Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: intro changes for now

Wiki Markup
Regular expressions
(regexes) are commonly used to match strings of text. As such, regular expressions can be found in applications that must search through text. A notable example includes the POSIX grep utility. For example, a programmer may want this kind of functionality for searching through log files.

Java's regular expression facilities are wide ranging and powerful which can lead to unwanted modification of the original regular expression string to form a pattern that matches too widely, possibly resulting in far too much information being matched, or matches occuring when not expected.

One method of preventing this vulnerability is to filter out the sensitive information prior to matching and then running the user-supplied regex against the remaining non-sensitive information. However, if the log format changes without a corresponding change in the class, sensitive information may be exposed. Furthermore, depending on how encapsulated the search keywords are, a malicious user may be able to grab a list of all the keywords. (If there are a lot of keywords, this may cause a denial of service.)

The primary means of preventing this vulnerability is to sanitize a regular expression string coming from untrusted input. Whitelisting certain characters (such as letters and digits) before passing the user supplied string to the regex parser is a common strategy. Blacklisting certain operators might be difficult due to the variability of the regex language, and consequently whitelisting is preferred over blacklisting.

Additionally, the programmer could look into ways of avoiding using regular expressions from untrusted input, or perhaps provide only a very limited subset of regular expression functionality to the user.

Constructs and properties of Java regular expressions to watch out for include:

 are widely used to match strings of text. For example, the POSIX {{grep}} utility supports regular expressions for finding patterns in the specified text. For introductory information on regular expressions, see the Java Tutorials \[[Tutorials 08|AA. Java References#Tutorials 08]\]. The {{java.util.regex}} package provides the {{Pattern}} class that encapsulates a compiled representation of a regular expression and the {{Matcher}} class that is an engine which interprets and uses a {{Pattern}} to perform matching operations on a {{CharacterSequence}}.

The powerful regular expression (regex) facilities must be protected from misuse. An attacker may supply a malicious input that modifies the original regular expression in such a way that the regex fails to comply with the program's specification. This attack vector, referred to as a regex injection, might affect control flow, cause information leaks, or result in denial of service vulnerabilities (DoS).

Certain constructs and properties of Java regular expressions are susceptible to exploitation:

  • Matching flags: Untrusted inputs may Matching flags used in non-capturing groups (These override matching options that may or may not have been passed into to the Pattern.compile() method.)
  • Greediness: An untrusted input may attempt to inject a regex that changes the original regex greediness (The regular expression tries to match as much of the string as possible. This may expose too much , exposing sensitive information.)
  • grouping (Grouping: The programmer can define certain smaller enclose parts of the a regular expression to capture and return, but a malicious user in parentheses to perform some common action on the group. An attacker may be able to make their own groupings)change the groupings by supplying untrusted input, leading to the security weaknesses described earlier.

Untrusted input should be sanitized before use to prevent regex injection. When the user must specify a regex as input, care must be taken to ensure that the original regex cannot be modified without restriction. White-listing characters (such as letters and digits) before delivering the user supplied string to the regex parser is a good input validation strategy. However, when the user is allowed to enter regexes, the white-list may need to permit certain dangerous characters. These inputs should not be used to build a security sensitive dynamic regex. A programmer must provide only a very limited subset of regular expression functionality to the user to minimize any chance of misuse Wiki MarkupFor introductory information on regular expressions, see \[[Tutorials 08|AA. Java References#Tutorials 08]\].

Noncompliant Code Example

...

which grabs the entire log line rather than just the old keywords. The first close parentheses of the malicious search string defeats the grouping protection. Using the OR operator allows injection of any arbitrary regex. Now this regex will reveal all IPs and timestamps of past searches.

Compliant Solution

One method of preventing this vulnerability is to filter out the sensitive information prior to matching and then running the user-supplied regex against the remaining non-sensitive information. However, if the log format changes without a corresponding change in the class, sensitive information may be exposed. Furthermore, depending on how encapsulated the search keywords are, a malicious user may be able to grab a list of all the keywords. (If there are a lot of keywords, this may cause a denial of service.)

This compliant solution filters out non-alphanumeric characters from the search string using Java's Character.isLetterOrDigit(). This removes the grouping parentheses and the OR operator which triggers the injection.

...