What to Do
Applications should assume that all of their input is malicious, and take action accordingly. Input should be validated and either rejected or sanitized immediately, carefully quarantined during use, and encoded appropriately on output.
Why
Malicious input is the single largest cause of vulnerabilities in web applications, and in the most general sense, is the root cause of almost every issue. The only way to ensure safety is by a defense in depth, default deny policy that starts with the fundamental supposition that all input is malicious until proven otherwise. For example, if you call an external Web service that returns strings, how do you know that malicious commands are not present? Also, if several applications write to a shared database, when you read data, how do you know whether it is safe?
When
All applications should assume that all their input is malicious.
How
Getting input validation correct is tricky; there's a reason that it's the number one security problem for web applications. However, when approached systematically, it's not too hard of a problem to solve. Follow these steps:
1. Determine all inputs
The first step is to determine all the things in the application which can be controlled by the user. There are some surprises here -- a lot of the variables in a normal HTTP server environment are actually taken from the user's request, so make sure you know exactly where everything is coming from. It's a good idea to leave a brief comment in the code where the input comes in mentioning where it comes from (if it isn't obvious from context), the expected format, and where it's validated (again, if it isn't obvious).
2. Determine all trusted data stores
Every application has at least one, usually multiple data stores. It's important to know when a data store can be trusted. The guideline here is simple; if the system in question is the only input into the data store, then you can rely on the semantics enforced by your input validation routines to apply to all data found in the store. If other applications access the data store, then you can't. While it is possible to check the semantics of every validation routine in every other system that accesses the data store, it's simpler and safer to assume that the data store is untrusted, and treat it as a potential source of malicious data, validating all input from it as you would any other input.
3. Determine all crossover points
Crossover points are one of the places where malicious input becomes a bug. They're not necessarily places where output occurs; in fact, they'll often occur many layers further in than that in large applications. A crossover point is anywhere where user input is included textually in some larger body of command text, or where a security-relevant decision is made based on it. A good example of a crossover point is a dynamic SQL query. The risk here is of the user input crossing over into the associated command data, allowing an attacker to execute commands. Xpath and other XML injections are another example here. The worst case here is when user input is evaluated by a languages built-in "eval" command or something similar -- these commands should never be used, even with values that look safe, because of the risks associated.
Once the crossover points are found, all inputs should be traced back to make sure that they've been validated appropriately beforehand, and a comment again stating the format, source, and validation point should be made. All crossover points have, depending on the technology involved, different sets of safe characters. Using the allowlist approach described below, the safe set of characters for that crossover point should be compared against what the validator will allow through; the allowed characters must be a subset of the safe ones.
Whenever possible, steps should be taken to remove crossover points entirely. Switching from dynamic SQL to stored procedures with bound parameters removes an entire category of crossover points from the system, and greatly reduces risk to an entire class of attacks. Similar things can be done with other types of crossovers.
4. Determine all outputs
The last point of concern is the list of outputs from the system. This will likely have a certain amount of overlap with the list of crossover points, which is fine. Again, we need to determine the allowable format for each output, and look at where the incoming data is being validated. If there's any question of whether the data may contain dangerous characters, it should be encoded in a manner appropriate to the specific output. There are more output contexts than one might thing; the contents of HTML attributes, the tags themselves, free text between the tags, and javascript strings all have different safe sets of characters (and a different encoding, in the last case). Comments on the input source, format, validation point, and encoding point are also useful here.
5. Build a centralized validation module
One of the biggest dangers of implementing input validation is inconsistent validation; an attack may be caught on one data path, but not on another. An attacker will try all of them, however. The way to solve this problem is to have a single point of responsibility for input validation. Where this is depends on the design. If every piece of input is an object, then it may be appropriate to have the object constructs and setters perform the validation for that object's input. In a less strictly OO system, a single module with methods for each different input format may be more appropriate.
Which ever method is chosen, the input validation routine for a specific data type should be as strict as possible. For example, when validating a US zip code, allow either 5 or 9 numbers, and nothing else. If you're dealing with international postal codes, either validate them seperately with a looser format that also allows letters, or build a more complex validator that understands the postal codes of each nation, if you need to ensure a higher level of integrity.
6. Build a centralized encoding module
In an ideal world, all encoding routines would be done via the same libraries which are used to create output. While many HTML control libraries attempt this, none of them take the allowlist approach. Instead, they try to guess which characters might be harmful, a list which is categorically incomplete. Unless you want to build a new output library (which might be an option on a large enough application), you should build a set of data encoders for each ouput context which you have. These encoders should be used as close as possible to the actual point of output; this minimizes the chance of an alternate data path skipping the encoding, and ensures that the developer knows exactly what context the output is being used in. Avoid the temptation to store encoded data, because even if it is initially only used in the context you encoded it for, this may change over time.
7. Ensure that all paths through the system preserve validation
Once the validation system is complete, all the paths that data takes through the system should be checked to ensure that they preserve the validation properties that are expected. Input which is sent round-trip through a client or another system must be re-validated, unless a cryptographic signature is used to ensure that it has not been tampered with. Validation which occurs on an untrusted system must also be repeated. Client-side validation in javascript is a nice UI touch, but it is trivially circumvented as a security measure.