Preventing Basic XSS Attacks

Thursday, April 2, 2009

Unlike most attacks, XSS or Cross Site Scripting works on the client side. The most basic form of XSS is to put some JavaScript in user-submitted content to steal the data in a user's cookie. Since most sites use cookies and sessions to identify visitors, the stolen data can then be used to impersonate that user—which is deeply troublesome when it's a typical user account, and downright disastrous if it's the administrative account. If you don't use cookies or session IDs on your site, your users aren't vulnerable, but you should still be aware of how this attack works.

Unlike MySQL injection attacks, XSS attacks are difficult to prevent. Yahoo!, eBay, Apple, and Microsoft have all been affected by XSS. Although the attack doesn't involve PHP, you can use PHP to strip user data in order to prevent attacks. To stop an XSS attack, you have to restrict and filter the data a user submits to your site. It is for this precise reason that most online bulletin boards don't allow the use of HTML tags in posts and instead replace them with custom tag formats such as [b] and [linkto].

Here's a simple script that illustrates how to prevent some of these attacks.

<?php
function transform_HTML($string, $length = null) {
    // Remove dead space.
    $string = trim($string);

    // Prevent potential Unicode codec problems.
    $string = utf8_decode($string);

    // HTMLize HTML-specific characters.
    $string = htmlentities($string, ENT_NOQUOTES);
    $string = str_replace("#", "&#35;", $string);
    $string = str_replace("%", "&#37;", $string);

    $length = intval($length);
    if ($length > 0) {
        $string = substr($string, 0, $length);
    }
    return $string;
}
?>


This function transforms HTML-specific characters into HTML literals. A browser renders any HTML run through this script as text with no markup. For example, consider this HTML string:
<STRONG>Bold Text</STRONG>


Normally, this HTML would render as follows:

Bold Text

However, when run through transform_HTML(), it renders as the original input. The reason is that the tag characters are HTML entities in the processed string. The resulting string from HTML() in plaintext looks like this:
&lt;STRONG&gt;Bold Text&lt;/STRONG&gt;


The essential piece of this function is the htmlentities() function call that transforms <, >, and & into their entity equivalents of &lt;, &gt;, and &amp;. Although this takes care of the most common attacks, experienced XSS hackers have another sneaky trick up their sleeve: Encoding their malicious scripts in hexadecimal or UTF-8 instead of normal ASCII text, hoping to circumvent your filters. They can send the code along as a GET variable in the URL, saying, "Hey, this is hexadecimal code, but could you run it for me anyway?" A hexadecimal example looks something like this:
<a href="http://host/a.php?variable=%22%3e %3c%53%43%52%49%50%54%3e%44
%6f%73%6f%6d%65%74%68%69%6e%67%6d%61%6c%69%63%69%6f%75%73%3c%2f%53%43%52
%49%50%54%3e">


But when the browser renders that information, it turns out to be:
<a href="http://host/a.php?variable="> <SCRIPT>Dosomethingmalicious</SCRIPT>


To prevent this, transform_HTML() takes the additional steps of converting # and % signs into their entity, shutting down hex attacks, and converting UTF-8–encoded data.

Finally, just in case someone tries to overload a string with a very long input, hoping to crash something, you can add an optional $length parameter to trim the string to the maximum length you specify.

Hope it helps.

Validating Email Addresses

Wednesday, April 1, 2009

The following script verifies that an email address mostly follows the rules outlined in RFC 2822. This won't prevent someone from entering a false (but RFCcompliant) email address such as dontbugmeman@nonexistentdomain.com, but it will catch some typos.

<?php
function is_email($email) {
 if (! preg_match( '/^[A-Za-z0-9!#$%&\'*+-/=?^_`{|}~]+@[A-Za-z0-9-]+(\.[AZa-z0-9-]+)+[A-Za-z]$/', $email)) {
            return false;
 } 
 else {
  return true;
 }
}
?>


This script uses regular expression to check whether the given email uses proper email characters (alphabetical, dots, dashes, slashes, and so on), an @ sign in the middle, and at least one dot-something on the end.

Hope it helps.