Thursday, October 22, 2009

Improper Input Handling

Improper input handling is one of the most common weaknesses identified across applications today. Poorly handled input is a leading cause behind critical vulnerabilities that exist in systems and applications.
Generally, the term input handing is used to describe functions like validation, sanitization, filtering, encoding and/or decoding of input data. Applications receive input from various sources including human users, software agents (browsers), and network/peripheral devices to name a few. In the case of web applications, input can be transferred in various formats (name value pairs, JSON, SOAP, etc...) and obtained via URL query strings, POST data, HTTP headers, Cookies, etc... Non-web application input can be obtained via application variables, environment variables, the registry, configuration files, etc... Regardless of the data format or source/location of the input, all input should be considered untrusted and potentially malicious. Applications which process untrusted input may become vulnerable to attacks such as Buffer Overflows, SQL Injection, OS Commanding, Denial of Service just to name a few.

Improper Input Validation

One of the key aspects of input handling is validating that the input satisfies a certain criteria. For proper validation, it is important to identify the form and type of data that is acceptable and expected by the application. Defining an expected format and usage of each instance of untrusted input is required to accurately define restrictions.

Validation can include checks for type safety and correct syntax. String input can be checked for length (min & max number of characters) and character set validation while numeric input types like integers and decimals can be validated against acceptable upper and lower bound of values. When combining input from multiple sources, validation should be performed during concatenation and not just against the individual data elements. This practice helps avoid situations where input validation may succeed when performed on individual data items but fails when done on a combined set from all the sources [11].

Client-side vs Server-side validation

A common mistake most developers make is to include validation routines in the client-side of an application using JavaScript functions as a sole means to perform bound checking. Validation routines are beneficial on the client side but are not intended to provide a security feature as all data accessible on the client side is modifiable by a malicious user or attacker. This is true of any client-side validation checks in JavaScript and VBScript or external browser plug-ins such as Flash, Java, or ActiveX. The HTML5 specification has added a new attribute "pattern" to the INPUT tag that enables developers to write regular expressions as part of the markup for performing validations [29]. This makes it even more convenient for developers to perform input validation on the client side without having to write any extra code. The risk from such a feature becomes significant when developers start using it as the only means of performing input validation for their applications. Relying on client-side validation alone in not a safe practice. It gives a false sense of security to many developers since client-side validations can easily be evaded by malicious entities. It is important to note that while client-side validation is great for UI and functional validation, it isn't a substitute for server-side validation. Performing validation on the server side ensures integrity of your validation controls. In addition, the server-side validation routine will always be effective irrespective of the state of JavaScript execution on the browser. As a best practice input validation should be performed both on the client side as well as on the server side.

Improper Input Sanitization and Filtering

Sanitization of input deals with transforming input to an acceptable form where as filtering deals with blocking/allowing all or part of input that is deemed unacceptable/acceptable respectively. Sanitization and filtering typically is implemented in addition to input validation.

Weak sanitization and/or filtering can lead an attacker to evade such mechanisms and supply malformed and/or malicious input to the application. The "attacks" section of this document describes SQL Injection and Buffer Overflow attacks which are a direct effect of missing or weak filtering/sanitization.

Input Sanitization

Input sanitization can be performed by transforming input from its original form to an acceptable form via encoding or decoding. Common encoding methods used in web applications include the HTML entity encoding and URL Encoding schemes. HTML entity encoding serves the need for encoding literal representations of certain meta-characters to their corresponding character entity references. Character references for HTML entities are pre-defined and have the format &name; where "name" is a case-sensitive alphanumeric string. A common example of HTML entity encoding is where "<" is encoded as < and ">" encoded as > . Refer to [1] for more information on character encodings. URL encoding applies to parameters and their associated values that are transmitted as part of HTTP query strings. Likewise, characters that are not permitted in URLs are represented using their Unicode Character Set code point value, where each byte is encoded in hexadecimal as "%HH". For example, "<" is URL-encoded as "%3C" and "ÿ" is URL-encoded as "%C3%BF".

There are many ways in which input can be presented to an application. With web applications and browsers supporting more than one character encoding types, it has become a common place for attackers to try and exploit inherent weaknesses in encoding and decoding routines. Applications requiring internationalization are a good candidate for input sanitization. One of the common forms of representing international characters is Unicode [18]. Unicode transformations use the UCS (Universal Character Set) which consist of a large set of characters to cover symbols of almost all the languages in the world. The table below, taken from [21], shows a set of samples with different characters from UCS that are visually similar in representation to ASCII characters "s", "o", "u" and "p". From the most novice personal computer user to the most seasoned security expert, rarely does an individual inspect every character within a Unicode string to confirm its validity. Such misrepresentation of characters enables attackers to spoof expected values by replacing them with visually or semantically similar characters from the UCS.

s ѕ Ѕ Ϩ
0073 FF53 0455 10BD FF33 0405 03E8
o ο о º ѻ
006F 03BF 043E FF4F 00BA FFB7 047B
u υ IJ
0075 2294 03C5 22C3 222A 0132 1E75
p р ƿ ρ ק Р
0070 0440 FF50 01BF 03C1 05E7 0420

Note that although the characters have a similar visual representation, they all carry a different hexadecimal code that uniquely maps to UCS. Additional information on character encoding types and output handling can be found at [22].


Canonicalization is another important aspect of input sanitization [20]. Canonicalization deals with converting data with various possible representations into a standard "canonical" representation deemed acceptable by the application. One of the most commonly known application of canonicalization is "Path Canonicalization" where file and directory paths on computer file systems or web servers (URL) are canonicalized to enforce access restrictions. Failure of such canonicalization mechanism can lead to directory traversal or path traversal attacks [24]. The concept of canonicalization is widely applicable and applies equally well to Unicode and XML processing routines.

The first major Unicode vulnerability was documented against Microsoft Internet Information Server (IIS) in October 2000 [12]. This vulnerability allowed attackers to encode "/", "\" and "." characters to appear as their Unicode counterparts and bypass the security mechanisms within IIS that block directory traversal. In another example, a vulnerability discovered in Google perfectly illustrates the significance of character encoding [13]. The vulnerability stated in this example exploits lack of consistency in character encoding schemes across the application. While expecting UTF-8 [14] encoded characters, the application fails to sanitize and transform input supplied in the form on UTF-7 [15] leading to a Cross-site scripting attack. Additional examples can be found at [16] and [17]. As mentioned earlier, applications that are internationalized have a need to support multiple languages that cannot be represented using common ISO-8859-1 (Latin-1) character encoding. Languages like Chinese, Japanese use thousands of characters and are therefore represented using variable-width encoding schemes [18]. Improperly handled mapping and encoding of such international characters can also lead to canonicalization attacks [19].

Based on input and output handling requirements, applications should identify acceptable character sets and implement custom sanitization routines to process and transform data specific to their needs. Additional information on outputting data in international applications can be found at [22].

Input Filtering

Input Filtering is a decision making process that leads either to the acceptance or the rejection of input based on predefined criteria. In its most basic form, input filtering deals with matching or comparing an input data stream with a predefined set of characters to determine acceptability. Acceptable input is passed forward for processing and unwanted characters are blocked thus preventing the application from processing unrecognized and potentially malicious input. There are two major approaches to input filtering [2]:

  • Whitelist - Allowing only the known good characters. E.g. a-z,A-Z,0-9 are known good characters in the whitelist and are hence accepted by the filter
  • Blacklist - Allowing anything except the known bad characters. E.g. <,/,> are known bad characters in the blacklist and are hence blocked by the filter

There are advantages and disadvantages to both approaches. Blacklist based filtering is widely used as it is fairly easy to implement, but offers protection only from known threats. Characters in a blacklist can be modeled to evade filtering as the filter only blocks known bad characters; an attacker can specially craft an attack to avoid those specific characters. Researchers have demonstrated several ways of evading blacklist based filtering approaches. The XSS cheat sheet [7] and SQL cheat sheet [8] are classic examples of how filter evasion techniques can be used against blacklist based approaches. Both Mitre [9] and NVD [10] host several advisories describing vulnerabilities due to poor blacklist filtering implementations.

Whitelist based filtering is often more difficult to implement properly. Although proven efficient with virus and malware protection techniques, it can be difficult to compile a list of all good input that a system can accept.

Input validation, sanitization and filtering requirements apply equally to elements beyond web application code. Web application infrastructure components like web servers and proxies that handle web application requests and responses have been shown to be vulnerable to attacks caused due to weak input validation of HTTP request/response headers. Some examples include HTTP Response Splitting [25], HTTP Request Smuggling [26], etc...

A common approach to perform input filtering, validation and sanitization is through the use of a regex (Regular Expressions) [23]. Regular Expressions provide a concise and flexible means of identifying patterns in a given data set. Many ready-made regular expressions that deal with common input/output related attacks such as SQL Injection [4], OS Commanding [5] and Cross-Site Scripting [27] are available on the Internet. While these regular expressions may be simple to copy into an application, it is important for developers using them to ensure they are evaluating the requirements for their expected input streams.

Commercial companies like Microsoft and open source communities like OWASP have ongoing efforts to provide protection tools against some of the common attacks mentioned above. Microsoft's Anti Cross-Site Scripting Library [28] not only guides its users and developers with putting measures in place to thwart cross-site scripting attacks, but also provides insight into alternatives for proper input and output encoding where its library routines may not apply. OWASP's ESAPI project [6] provides guidelines and primary defenses against SQL Injection attacks. It also provides details on database specific SQL escaping requirements to help escape/encode user input before concatenating it with a SQL query. SQL escaping, as advocated in EASPI, uses DBMS character escaping schemes to convert input that can be characterized by the SQL engine as data instead of code.

Common examples of attacks due to Improper Input Handling

Buffer Overflow

The length of the source variable input is not validated before being copied to the destination dest_buffer. The weakness is exploited when the size of input (source) exceeds the size of the dest_buffer(destination) causing an overflow of the destination variable's address in memory.

void bad_function(char *input)
char dest_buffer[32];
strcpy(dest_buffer, input);
printf("The first command line argument is %s.\n", dest_buffer);
int main(int argc, char *argv[])
if (argc > 1)
printf("No command line argument was given.\n");
return 0;

See [3] for more on this and similar attacks.

SQL Injection

The sample code below shows a SQL query used by a web application authentication form.

SQLCommand = "SELECT Username FROM Users WHERE Username = '"
SQLCommand = SQLComand & strUsername
SQLCommand = SQLComand & "' AND Password = '"
SQLCommand = SQLComand & strPassword
SQLCommand = SQLComand & "'"
strAuthCheck = GetQueryResult(SQLQuery)

In this code, the developer combines the input from the user, strUserName and strPassword, with the existing SQL statement's structure. Suppose an attacker submits a login and password that looks like the following:

Username: foo
Password: bar' OR ''='

The SQL command string built from this input would be as follows:

SELECT Username FROM Users WHERE Username = 'foo'
AND Password = 'bar' OR ''=''

This query will return all rows from the user's database, regardless of whether "foo" is a real user name or "bar" is a legitimate password. This is due to the OR statement appended to the WHERE clause. The comparison ''='' will always return a "true" result, making the overall WHERE clause evaluate to true for all rows in the table. If this is used for authentication purposes, the attacker will often be logged in as the first or last user in the Users table.

See [4] for more information on this and other variants of SQL Injection attack

OS Commanding

OS Commanding (command injection) is an attack technique used for unauthorized execution of operating system commands. Improperly handled input from the user is one of the common weaknesses that can be exploited to run unauthorized commands. Consider a web application exposing a function showInfo() that accepts parameters name and template from the user and opens a file based on this input

Example: http://example/cgi-bin/

Due to improper or non-existent input handling, by changing the template parameter value an attacker can trick the web application into executing the command /bin/ls or open arbitrary files.

Attack Example:


See [5] for more on this and other variants of OS commanding or Command Injection attack


Character encodings in HTML


Secure input and output handling


Buffer Overflow


SQL Injection


OS Commanding




XSS Cheat Sheet


SQL Cheat Sheet


CVE at Mitre


National Vulnerability Database


CWE-20: Improper Input Validation


Microsoft IIS Extended Unicode Directory Traversal Vulnerability


Google XSS Vulnerability






Widescale Unicode Encoding Implementation Flaw Discovered


Unicode Left/Right Pointing Double Angel Quotation Mark


Variable width encoding schemes


Canonicalization, locale and Unicode




The Methodology and an application to fight against Unicode attacks


Improper Output Handling


Regular Expressions


Path Traversal


HTTP Response Splitting


HTTP Request Smuggling


Cross Site Scripting


Microsoft Anti-Cross Site Scripting Library V3.0


HTML 5 "pattern" attribute


Original here.

No comments:

Post a Comment