Generally, the term input handing is used to describe functions like validation, sanitization, filtering, encoding and/or decoding of input data. Applications receive input from various sources including human users, software agents (browsers), and network/peripheral devices to name a few. In the case of web applications, input can be transferred in various formats (name value pairs, JSON, SOAP, etc...) and obtained via URL query strings, POST data, HTTP headers, Cookies, etc... Non-web application input can be obtained via application variables, environment variables, the registry, configuration files, etc... Regardless of the data format or source/location of the input, all input should be considered untrusted and potentially malicious. Applications which process untrusted input may become vulnerable to attacks such as Buffer Overflows, SQL Injection, OS Commanding, Denial of Service just to name a few.
Improper Input Validation
One of the key aspects of input handling is validating that the input satisfies a certain criteria. For proper validation, it is important to identify the form and type of data that is acceptable and expected by the application. Defining an expected format and usage of each instance of untrusted input is required to accurately define restrictions.
Validation can include checks for type safety and correct syntax. String input can be checked for length (min & max number of characters) and character set validation while numeric input types like integers and decimals can be validated against acceptable upper and lower bound of values. When combining input from multiple sources, validation should be performed during concatenation and not just against the individual data elements. This practice helps avoid situations where input validation may succeed when performed on individual data items but fails when done on a combined set from all the sources .
Client-side vs Server-side validation
Improper Input Sanitization and Filtering
Sanitization of input deals with transforming input to an acceptable form where as filtering deals with blocking/allowing all or part of input that is deemed unacceptable/acceptable respectively. Sanitization and filtering typically is implemented in addition to input validation.
Weak sanitization and/or filtering can lead an attacker to evade such mechanisms and supply malformed and/or malicious input to the application. The "attacks" section of this document describes SQL Injection and Buffer Overflow attacks which are a direct effect of missing or weak filtering/sanitization.
Input sanitization can be performed by transforming input from its original form to an acceptable form via encoding or decoding. Common encoding methods used in web applications include the HTML entity encoding and URL Encoding schemes. HTML entity encoding serves the need for encoding literal representations of certain meta-characters to their corresponding character entity references. Character references for HTML entities are pre-defined and have the format &name; where "name" is a case-sensitive alphanumeric string. A common example of HTML entity encoding is where "<" is encoded as < and ">" encoded as > . Refer to  for more information on character encodings. URL encoding applies to parameters and their associated values that are transmitted as part of HTTP query strings. Likewise, characters that are not permitted in URLs are represented using their Unicode Character Set code point value, where each byte is encoded in hexadecimal as "%HH". For example, "<" is URL-encoded as "%3C" and "ÿ" is URL-encoded as "%C3%BF".
There are many ways in which input can be presented to an application. With web applications and browsers supporting more than one character encoding types, it has become a common place for attackers to try and exploit inherent weaknesses in encoding and decoding routines. Applications requiring internationalization are a good candidate for input sanitization. One of the common forms of representing international characters is Unicode . Unicode transformations use the UCS (Universal Character Set) which consist of a large set of characters to cover symbols of almost all the languages in the world. The table below, taken from , shows a set of samples with different characters from UCS that are visually similar in representation to ASCII characters "s", "o", "u" and "p". From the most novice personal computer user to the most seasoned security expert, rarely does an individual inspect every character within a Unicode string to confirm its validity. Such misrepresentation of characters enables attackers to spoof expected values by replacing them with visually or semantically similar characters from the UCS.
Note that although the characters have a similar visual representation, they all carry a different hexadecimal code that uniquely maps to UCS. Additional information on character encoding types and output handling can be found at .
Canonicalization is another important aspect of input sanitization . Canonicalization deals with converting data with various possible representations into a standard "canonical" representation deemed acceptable by the application. One of the most commonly known application of canonicalization is "Path Canonicalization" where file and directory paths on computer file systems or web servers (URL) are canonicalized to enforce access restrictions. Failure of such canonicalization mechanism can lead to directory traversal or path traversal attacks . The concept of canonicalization is widely applicable and applies equally well to Unicode and XML processing routines.
The first major Unicode vulnerability was documented against Microsoft Internet Information Server (IIS) in October 2000 . This vulnerability allowed attackers to encode "/", "\" and "." characters to appear as their Unicode counterparts and bypass the security mechanisms within IIS that block directory traversal. In another example, a vulnerability discovered in Google perfectly illustrates the significance of character encoding . The vulnerability stated in this example exploits lack of consistency in character encoding schemes across the application. While expecting UTF-8  encoded characters, the application fails to sanitize and transform input supplied in the form on UTF-7  leading to a Cross-site scripting attack. Additional examples can be found at  and . As mentioned earlier, applications that are internationalized have a need to support multiple languages that cannot be represented using common ISO-8859-1 (Latin-1) character encoding. Languages like Chinese, Japanese use thousands of characters and are therefore represented using variable-width encoding schemes . Improperly handled mapping and encoding of such international characters can also lead to canonicalization attacks .
Based on input and output handling requirements, applications should identify acceptable character sets and implement custom sanitization routines to process and transform data specific to their needs. Additional information on outputting data in international applications can be found at .
Input Filtering is a decision making process that leads either to the acceptance or the rejection of input based on predefined criteria. In its most basic form, input filtering deals with matching or comparing an input data stream with a predefined set of characters to determine acceptability. Acceptable input is passed forward for processing and unwanted characters are blocked thus preventing the application from processing unrecognized and potentially malicious input. There are two major approaches to input filtering :
- Whitelist - Allowing only the known good characters. E.g. a-z,A-Z,0-9 are known good characters in the whitelist and are hence accepted by the filter
- Blacklist - Allowing anything except the known bad characters. E.g. <,/,> are known bad characters in the blacklist and are hence blocked by the filter
There are advantages and disadvantages to both approaches. Blacklist based filtering is widely used as it is fairly easy to implement, but offers protection only from known threats. Characters in a blacklist can be modeled to evade filtering as the filter only blocks known bad characters; an attacker can specially craft an attack to avoid those specific characters. Researchers have demonstrated several ways of evading blacklist based filtering approaches. The XSS cheat sheet  and SQL cheat sheet  are classic examples of how filter evasion techniques can be used against blacklist based approaches. Both Mitre  and NVD  host several advisories describing vulnerabilities due to poor blacklist filtering implementations.
Whitelist based filtering is often more difficult to implement properly. Although proven efficient with virus and malware protection techniques, it can be difficult to compile a list of all good input that a system can accept.
Input validation, sanitization and filtering requirements apply equally to elements beyond web application code. Web application infrastructure components like web servers and proxies that handle web application requests and responses have been shown to be vulnerable to attacks caused due to weak input validation of HTTP request/response headers. Some examples include HTTP Response Splitting , HTTP Request Smuggling , etc...
A common approach to perform input filtering, validation and sanitization is through the use of a regex (Regular Expressions) . Regular Expressions provide a concise and flexible means of identifying patterns in a given data set. Many ready-made regular expressions that deal with common input/output related attacks such as SQL Injection , OS Commanding  and Cross-Site Scripting  are available on the Internet. While these regular expressions may be simple to copy into an application, it is important for developers using them to ensure they are evaluating the requirements for their expected input streams.
Commercial companies like Microsoft and open source communities like OWASP have ongoing efforts to provide protection tools against some of the common attacks mentioned above. Microsoft's Anti Cross-Site Scripting Library  not only guides its users and developers with putting measures in place to thwart cross-site scripting attacks, but also provides insight into alternatives for proper input and output encoding where its library routines may not apply. OWASP's ESAPI project  provides guidelines and primary defenses against SQL Injection attacks. It also provides details on database specific SQL escaping requirements to help escape/encode user input before concatenating it with a SQL query. SQL escaping, as advocated in EASPI, uses DBMS character escaping schemes to convert input that can be characterized by the SQL engine as data instead of code.
Common examples of attacks due to Improper Input Handling
The length of the source variable
input is not validated before being copied to the destination
dest_buffer. The weakness is exploited when the size of
input (source) exceeds the size of the
dest_buffer(destination) causing an overflow of the destination variable's address in memory.
void bad_function(char *input)
printf("The first command line argument is %s.\n", dest_buffer);
int main(int argc, char *argv)
if (argc > 1)
printf("No command line argument was given.\n");
See  for more on this and similar attacks.
The sample code below shows a SQL query used by a web application authentication form.
SQLCommand = "SELECT Username FROM Users WHERE Username = '"
SQLCommand = SQLComand & strUsername
SQLCommand = SQLComand & "' AND Password = '"
SQLCommand = SQLComand & strPassword
SQLCommand = SQLComand & "'"
strAuthCheck = GetQueryResult(SQLQuery)
In this code, the developer combines the input from the user,
strPassword, with the existing SQL statement's structure. Suppose an attacker submits a login and password that looks like the following:
Password: bar' OR ''='
The SQL command string built from this input would be as follows:
SELECT Username FROM Users WHERE Username = 'foo'
AND Password = 'bar' OR ''=''
This query will return all rows from the user's database, regardless of whether "foo" is a real user name or "bar" is a legitimate password. This is due to the OR statement appended to the WHERE clause. The comparison
''='' will always return a "true" result, making the overall WHERE clause evaluate to true for all rows in the table. If this is used for authentication purposes, the attacker will often be logged in as the first or last user in the Users table.
See  for more information on this and other variants of SQL Injection attack
OS Commanding (command injection) is an attack technique used for unauthorized execution of operating system commands. Improperly handled input from the user is one of the common weaknesses that can be exploited to run unauthorized commands. Consider a web application exposing a function showInfo() that accepts parameters
template from the user and opens a file based on this input
Due to improper or non-existent input handling, by changing the template parameter value an attacker can trick the web application into executing the command /bin/ls or open arbitrary files.
See  for more on this and other variants of OS commanding or Command Injection attack
Character encodings in HTML
Secure input and output handling
XSS Cheat Sheet
SQL Cheat Sheet
CVE at Mitre
National Vulnerability Database
CWE-20: Improper Input Validation
Microsoft IIS Extended Unicode Directory Traversal Vulnerability
Google XSS Vulnerability
Widescale Unicode Encoding Implementation Flaw Discovered
Unicode Left/Right Pointing Double Angel Quotation Mark
Variable width encoding schemes
Canonicalization, locale and Unicode
The Methodology and an application to fight against Unicode attacks
Improper Output Handling
HTTP Response Splitting
HTTP Request Smuggling
Cross Site Scripting
Microsoft Anti-Cross Site Scripting Library V3.0
HTML 5 "pattern" attribute