Freedom The Open Source Way Contribute Articles or News to OSForgeOSForge HomeLogout from Forums
Contacting OSForgeOSForge HomeAbout OSForge
  

Root
Contribute News
Learning Corner
Linux Distributions
Linux Common FAQ's
Discussion Forums
Community Gallery
Links Directory
Search OSForge
Networking
Industry Updates
Linux & Open Source
Opinions
Press Release
Programming
Security
Web Development

White Paper
OpenKM - Document Mangement announces version 2.0
SugarCRM Manages End-to-End SaaS Offering with Zenoss
Linux Foundation’s Annual Collaboration Summit Kicks Off
Engine Yard Kicks Off Hackfest Series for Ruby Developers
Plat'Home Launches First Linux-based Eco-Friendly Servers In United States
World’s Largest Python Conference Sees 70 Percent Jump in Attendance
Leading SaaS Infrastructure Provider Deploys Zenoss to Improve Uptime and Reduce Cost
JasperSoft is Most Widely-Deployed Business Intelligence Software in the World
Cluster Resources to Showcase Adaptive Windows/Linux Cluster at BrainShare
Funambol Helps New AGPLv3 Open Source License Gain Formal OSI Approval

View More »

Writing CGI Scripts in Perl: Part 2: Decoding Client Data

Page: 2/2  [Printable Version]



Decoding Data

The client encodes the query string that contains the name/values pairs of user-supplied data. Our next task will be to decode that data and create our own hash to contain it.

The data is encoded because the GET method appends key/value pairs to URLs. In order to support this method a client must make sure that nothing in that data is a character that is not allowed in an URL, or else the entire mechanism of the world wide web would grind to a smoldering halt. URLs must be represented in US_ASCII plain text, so any characters that do not appear in that specification, or that otherwise have special meanings in URLs are encoded according to the content-type declaration of the enctype attribute to the form tag. The default encoding type is application/x-www-form-urlencoded. This content type demands that any other characters but the basic 24 upper and lower case letters, digits and the characters $ - _ . and + are represented by a three character string beginning with a % sign and followed by the hexadecimal notation of the number for that character in the US_ASCII font table. Spaces and multiple spaces are translated into a single + sy! mbol .

You can read more about URLs and how they are encoded in RFC 1738 at http://www.rfc-editor.org.

Understanding Hexadecimal Notation

To count in hexadecimal you need a method of notation for counting using base 16, rather than base 10. If you look at how to represent numbers using digits in base ten, you start counting at 0 and you indicating digits in the first column of the notation, when you reach ten you indicate a single digit in the second column and start over from 0 again in the first column.
Zero 0
One 1
Two 2
Three 3
Four 4
Five 5
Six 6
Seven 7
Eight 8
Nine 9
Ten 1 0
Eleven 1 1

The base of a number is what you count up to before you indicate a digit in the second column. Trouble is we only have 0-9 for digits as our language is based on base 10 numbering (maybe because we have ten fingers). So when we're counting in base 16 we use the letters a-f as notation for the remaining digits we need. The number eleven, for example, would be written as B. Counting to seventeen in Base 16 looks like this:

Zero 0
One 1
Two 2
Three 3
Four 4
Five 5
Six 6
Seven 7
Eight 8
Nine 9
Ten A
Eleven B
Twelve C
Thirteen D
Fourteen E
Fifteen F
Sixteen 1 0
Seventeen 1 1

To find the hexadecimal notation for a given number you find the closest even product of 16 and then indicate the remainder using the table above. For instance, 72 divided by 16 is 4, or 62, which leaves a remainder of 10. To write 72 in hexadecimal notation you would write 4A, "four sixteens and ten remainder". Likewise to indicate the number 255 in hexadecimal you would write FF, as 15 sixteen are 240, so there are 15 sixteen's in the first column, which leaves a remainder of 15 in the first column.

For instance, if your form contains a field called title and the user enters "midsummer night's dream" the query string sent by the client using application/www-urlencoded is:

title=midsummer+night%27s+dream. 



Perl can quickly translate all the + characters back to spaces with a simple translate operator tr:
$scalar =~ tr/+/ /;



It's difficult to see, but there's a space between those second slashes. Perl can translate the hexadecimal codes these back for you in a single line:
$pair =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;



We still remember how we first felt when we first saw this line. No, you don't really need to understand it, and this statement will probably be the most mathematically complex line of Perl your will ever use in the vast majority of scripts. Just type it in and go, as millions of people have gone before you.

Now that we have decoded the query string, all that remains to do is break our @PAIRS array up into names and values indexed into a hash, which we will do with a simple foreach loop.

foreach $pair (@PAIRS) {
    ($name, $value) = split /=/, $pair;
    $F{$name} = $value;
}




Now we can access our user data by the name of the HTML form field. For example, if our form contained an input variable called "title" we could get the value of that element in our script by looking for the value of the $F{"title"} assignment. You can see the complete subroutine at the end of this section, but first a word about security.

$pair =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

Here's how it works for the curious. Often, the best way to decipher complex statements like this is to pick them apart from the inside out. It's really just one big fat regular expression of the substitution form: s///. The s tells Perl to substitute what comes in between the first slash pair with what comes between the second. The =~ assignment operator says to apply this substitution to whatever is in the $value and then fill the $value with the result.

To decipher this, look at what's between the first slashes, which is what Perl is looking for in the pattern: %[a-fA-F0-9][a-fA-F0-9]. A simple text match. Perl scans the line for any series of characters that begins with a % sign and two other characters. A range of possible matches defines those two other characters, which is what the [] brackets indicate. The two other characters can either be a lower case letter between a and f, an upper case letter in the same range, or a digit between 0 and 9. This matches the text notation of a hexadecimal number used by the URL encoding scheme.

In between the second slashes is what Perl substitutes in the value when it finds a match. The special variable $1 is always set to whatever was matched between the brackets ( ) of the first pattern, and the hex operator converts that hexadecimal notation back to a regular decimal number. That number then becomes the argument of the pack command. In this case pack basically looks up that number in the ASCII table and returns the appropriate US_ASCII character. The e on the end is a modifier to the entire regular expression which forces Perl to evaluate the right hand side (the replacement text) as a Perl expression (so the pack and hex functions get executed), while the g modifier instructs Perl to make this match as many times as it can in the target value, and not stop after the first time.

There. You can do this!

Security

Don't ignore security for CGI scripts -- it's so easy to make scripts secure, while the danger of not doing so is very real. CGI scripts allow anybody on the Internet to run a program on your machine, and to send input to that program. If a malicious user sends input to your program and your program passes that information to the operating environment (by passing user data to the shell or another program) your CGI script could be used to do almost anything. If nobody has hit your scripts with these kinds of attacks yet, then you are relying only on the safety of numbers.

The situation is made more onerous due to the fact that Linux (Unix) shells allow easy ways to combine commands, and Perl makes it almost trivial to send commands to other programs.

Security in CGI is a simple matter of checking user input for characters that have special meaning in shell commands and being careful when passing user input to external programs.

The characters that have a special meaning are:

    & ` '   " | * ? ~ > <  ^ ( ) { } [ ] $ 
 

The space character is not dangerous.

Testing for Insecure Characters

This line causes our script to abort if the user supplies data that contains any of the those characters:

  if ($value =~ /[&`'\"|*?~><^(){}[]$
]/) { 
            print_head("die", 500, "Bad Data");
}



Some of these shell meta characters are also meta characters for regular expressions, so we escape those in our match with the mark. We include this test after we have decoded the user input (as it cannot match these characters when they are represented in hexadecimal form) and after we have divided our query string into names and values (as the & symbol is used in the CGI protocol, which is arguably a bug, believe it or not).

Of course, logic indicates you could test for these characters only when you were about to pass user data out of your script, rather than testing every user supplied value as we do. Our style is a teeny, tiny bit slower, but it's complete, and we'd rather not pit our wits against the imagination of every possible hacker in the world, nor do we want to have to remember all this security stuff every time we talk to the shell or an external program. You may have particular needs for some of these characters to be supplied by your user, and may choose to delete them from the check. That's OK, so long as you are aware of what you are doing, and why.

Testing for Secure Characters

Some CGI programmers argue that testing for legal data is more secure than testing for bad data. What characters you choose to define as legal are up to you, but take a breath before your define an insecure character as legal!

Our users should have no reason to send the script anything other than the regular 24 upper and lower case characters of the English alphabet, the digits 0-9, tabs, a dash or the @ symbol (for their email address). We test for good data this way:

if ($form_data !~ /^[a-zA-Z0-9 	-@]+$/) { 
    print_head("die", 500, "bad data"); 
}



The symbol stands for tab. The expression is preceded by the ^ modifier, which tells Perl to start from the beginning of the string, and is terminated by + (which tells Perl to match one or more of the items in the range) and the $ symbol, which indicates the end of the string to Perl.
It's a good idea to read more about security. One of the best places to start is http://www.w3.org/Security/Faq/www-security-faq.html, the WWW Security FAQ by Lincoln D. Stein, one of the true masters of the web.

Using Taint Mode

For the truly paranoid Perl contains a built in tool called '"Taint" which reports errors if your script passes unsecured data out of the Perl process. You run your script in taint mode using the -T flag, either from the command line or on the shebang line. In taint mode Perl treats all data coming from without the process as suspect. The only way to untaint the data is to use a regular expression to match the string you are looking for within that data. Thus, taint mode is only as good as your ability to write a regular expression that matches secure strings. For that reason we find taint to be overkill and not much useful for the majority of CGI programmers, though of course others would disagree.

The complete user decoding and security check subroutine

Our complete subroutine looks like this
sub get_user_data {
    my ($name, $value, $form_data, @PAIRS);
    if ($ENV{'REQUEST_METHOD'} eq "GET") {
        $form_data = $ENV{'QUERY_STRING'};
    } elsif ($ENV{'REQUEST_METHOD' eq "POST") {
        read(STDIN, $form_data, $ENV{'CONTENT_LENGTH'} 
            or print_head("error", 500, "Read failed $!");
    } else {
        print_head("error", 300, "Client is using unsupported method");
    }

    $form_data =~ tr/+/ /;
    $form_data =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
    @PAIRS = split (/&/, $form_data);
    foreach $pair (@PAIRS) {
        if ($value =~ /[&`'\"|*?~><^(){}[]$
]/) { 
            print_head("die", 500, "Bad Data"); }
        ($name, $value) = split /=/, $pair;
        $F{$name} = $value;
    }
}



To use it with CGI scripts that require form input, we simply call it at the start of our script
#!/usr/bin/perl

&get_user_data;



Conclusion

In this second article we have taken a close look at exactly how to safely get at the data a client is sending to a server from within your script (or from without it). All CGI scripts get data, manipulate it in some way and then send results back to the client. In the first article in this series we looked at that HTTP transaction itself and we also explored ways to send data back to the client. In this article we looked at how to retrieve data. In subsequent articles we may explore some of the things you can do with that data "in the middle", things like shopping cart systems, online databases and utilities of that sort.

About the author: SA Hayes is a writer, editor and tinkerer living in Northern California, for which he is grateful. When not working on computers Hayes spends his days building things and his nights driving away raccoons.

<< Previous Page << Previous Page (1/2)   

[ Back to Programming & Development | Sections Index ]

 
Scroll Up

   About | Term of Use | Privacy | Adras | Tell a Friend | Advertise  

OSForge News RSS Feed