Constrain, Reject, and Sanitize Input

The preferred approach to validating input is to constrain what you allow from the beginning. It is much easier to validate data for known valid types, patterns, and ranges than it is to validate data by looking for known bad characters. When you design your application, you know what your application expects. The range of valid data is generally a more finite set than potentially malicious input. However, for defense in depth you may also want to reject known bad input and then sanitize the input.

To create an effective input validation strategy, be aware of the following approaches and their tradeoffs:

Constrain input.
Validate data for type, length, format, and range.
Reject known bad input.
Sanitize input.

Constrain Input

Constraining input is about allowing good data. This is the preferred approach. The idea here is to define a filter of acceptable input by using type, length, format, and range. Define what is acceptable input for your application fields and enforce it. Reject everything else as bad data.

Constraining input may involve setting character sets on the server so that you can establish the canonical form of the input in a localized way.

Validate Data for Type, Length, Format, and Range

Use strong type checking on input data wherever possible, for example, in the classes used to manipulate and process the input data and in data access routines. For example, use parameterized stored procedures for data access to benefit from strong type checking of input fields.

String fields should also be length checked and in many cases checked for appropriate format. For example, ZIP codes, social security numbers, and so on have well defined formats that can be validated using regular expressions. Thorough checking is not only good programming practice; it makes it more difficult for an attacker to exploit your code. The attacker may get through your type check, but the length check may make executing his favorite attack more difficult.

Reject Known Bad Input

Deny "bad" data; although do not rely completely on this approach. This approach is generally less effective than using the "allow" approach described earlier and it is best used in combination. To deny bad data assumes your application knows all the variations of malicious input. Remember that there are multiple ways to represent characters. This is another reason why "allow" is the preferred approach.

While useful for applications that are already deployed and when you cannot afford to make significant changes, the "deny" approach is not as robust as the "allow" approach because bad data, such as patterns that can be used to identify common attacks, do not remain constant. Valid data remains constant while the range of bad data may change over time.

Sanitize Input

Sanitizing is about making potentially malicious data safe. It can be helpful when the range of input that is allowed cannot guarantee that the input is safe. This includes anything from stripping a null from the end of a user-supplied string to escaping out values so they are treated as literals.

Another common example of sanitizing input in Web applications is using URL encoding or HTML encoding to wrap data and treat it as literal text rather than executable script. HtmlEncode methods escape out HTML characters, and UrlEncode methods encode a URL so that it is a valid URI request.

In Practice

The following are examples applied to common input fields, using the preceding approaches:

Last Name field. This is a good example where constraining input is appropriate In this case, you might allow string data in the range ASCII A-Z and a-z, and also hyphens and curly apostrophes (curly apostrophes have no significance to SQL) to handle names such as O'Dell. You would also limit the length to your longest expected value.
Quantity field. This is another case where constraining input works well. In this example, you might use a simple type and range restriction. For example, the input data may need to be a positive integer between 0 and 1000.
Free-text field. Examples include comment fields on discussion boards. In this case, you might allow letters and spaces, and also common characters such as apostrophes, commas, and hyphens. The set that is allowed does not include less than and greater than signs, brackets, and braces.

Some applications might allow users to mark up their text using a finite set of script characters, such as bold "<b>", italic "<i>", or even include a link to their favorite URL. In the case of a URL, your validation should encode the value so that it is treated as a URL.

In an ideal scenario, an application checks for acceptable input for each field or entry point. However, if you have an existing Web application that does not validate user input, you need a stopgap approach to mitigate risk until you can improve your application's input validation strategy. While neither of the following approaches ensures safe handling of input, because that is dependent on where the input comes from and how it is used in your application, they are in practice today as quick fixes for short-term security improvement:

HTML-encoding and URL-encoding user input when writing back to the client. In this case, the assumption is that no input is treated as HTML and all output is written back in a protected form. This is sanitization in action.
Rejecting malicious script characters. This is a case of rejecting known bad input. In this case, a configurable set of malicious characters is used to reject the input. As described earlier, the problem with this approach is that bad data is a matter of context.