Decoding Data
The client encodes the query string that contains the name/values pairs of
user-supplied data. Our next task will be to decode that data and create our own
hash to contain it.
The data is encoded because the GET method appends key/value pairs to URLs.
In order to support this method a client must make sure that nothing in that
data is a character that is not allowed in an URL, or else the entire mechanism
of the world wide web would grind to a smoldering halt. URLs must be represented
in US_ASCII plain text, so any characters that do not appear in that
specification, or that otherwise have special meanings in URLs are encoded
according to the content-type declaration of the enctype attribute to the form
tag. The default encoding type is application/x-www-form-urlencoded.
This content type demands that any other characters but the basic 24 upper and
lower case letters, digits and the characters $ - _ . and + are represented by a
three character string beginning with a % sign and followed by the hexadecimal
notation of the number for that character in the US_ASCII font table. Spaces and
multiple spaces are translated into a single + sy! mbol .
You can read more about URLs and how they are encoded in RFC 1738 at http://www.rfc-editor.org.
Understanding Hexadecimal Notation
To count in hexadecimal you need a method of notation for counting using base
16, rather than base 10. If you look at how to represent numbers using digits
in base ten, you start counting at 0 and you indicating digits in the first
column of the notation, when you reach ten you indicate a single digit in the
second column and start over from 0 again in the first column.
| Zero |
0 |
| One |
1 |
| Two |
2 |
| Three |
3 |
| Four |
4 |
| Five |
5 |
| Six |
6 |
| Seven |
7 |
| Eight |
8 |
| Nine |
9 |
| Ten |
1 0 |
| Eleven |
1 1 |
The base of a number is what you count up to before you indicate a digit in
the second column. Trouble is we only have 0-9 for digits as our language is
based on base 10 numbering (maybe because we have ten fingers). So when we're
counting in base 16 we use the letters a-f as notation for the remaining
digits we need. The number eleven, for example, would be written as B.
Counting to seventeen in Base 16 looks like this:
| Zero |
0 |
| One |
1 |
| Two |
2 |
| Three |
3 |
| Four |
4 |
| Five |
5 |
| Six |
6 |
| Seven |
7 |
| Eight |
8 |
| Nine |
9 |
| Ten |
A |
| Eleven |
B |
| Twelve |
C |
| Thirteen |
D |
| Fourteen |
E |
| Fifteen |
F |
| Sixteen |
1 0 |
| Seventeen |
1 1 |
To find the hexadecimal notation for a given number you find the closest
even product of 16 and then indicate the remainder using the table above. For
instance, 72 divided by 16 is 4, or 62, which leaves a remainder of 10. To
write 72 in hexadecimal notation you would write 4A, "four sixteens and
ten remainder". Likewise to indicate the number 255 in hexadecimal you
would write FF, as 15 sixteen are 240, so there are 15 sixteen's in the first
column, which leaves a remainder of 15 in the first column.
For instance, if your form contains a field called title and the user enters
"midsummer night's dream" the query string sent by the client using application/www-urlencoded
is:
title=midsummer+night%27s+dream.
Perl can quickly translate all the + characters back to spaces with a simple
translate operator tr:
$scalar =~ tr/+/ /;
It's difficult to see, but there's a space between those second slashes. Perl
can translate the hexadecimal codes these back for you in a single line:
$pair =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
We still remember how we first felt when we first saw this line. No, you don't
really need to understand it, and this statement will probably be the most
mathematically complex line of Perl your will ever use in the vast majority of
scripts. Just type it in and go, as millions of people have gone before you.
Now that we have decoded the query string, all that remains to do is break
our @PAIRS array up into names and values indexed into a hash,
which we will do with a simple foreach loop.
foreach $pair (@PAIRS) {
($name, $value) = split /=/, $pair;
$F{$name} = $value;
}
Now we can access our user data by the name of the HTML form field. For example,
if our form contained an input variable called "title" we could get
the value of that element in our script by looking for the value of the $F{"title"}
assignment. You can see the complete subroutine at the end of this section, but
first a word about security.
$pair =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
Here's how it works for the curious. Often, the best way to decipher complex
statements like this is to pick them apart from the inside out. It's really
just one big fat regular expression of the substitution form: s///. The s
tells Perl to substitute what comes in between the first slash pair with what
comes between the second. The =~ assignment operator says to apply this
substitution to whatever is in the $value and then fill the $value
with the result.
To decipher this, look at what's between the first slashes, which is what
Perl is looking for in the pattern: %[a-fA-F0-9][a-fA-F0-9]. A
simple text match. Perl scans the line for any series of characters that
begins with a % sign and two other characters. A range of possible matches
defines those two other characters, which is what the [] brackets indicate.
The two other characters can either be a lower case letter between a and f, an
upper case letter in the same range, or a digit between 0 and 9. This matches
the text notation of a hexadecimal number used by the URL encoding scheme.
In between the second slashes is what Perl substitutes in the value when it
finds a match. The special variable $1 is always set to whatever
was matched between the brackets ( ) of the first pattern, and the hex
operator converts that hexadecimal notation back to a regular decimal number.
That number then becomes the argument of the pack command. In this case pack
basically looks up that number in the ASCII table and returns the appropriate
US_ASCII character. The e on the end is a modifier to the entire regular
expression which forces Perl to evaluate the right hand side (the replacement
text) as a Perl expression (so the pack and hex functions get executed), while
the g modifier instructs Perl to make this match as many times as it can in
the target value, and not stop after the first time.
There. You can do this!
Security
Don't ignore security for CGI scripts -- it's so easy to make scripts secure,
while the danger of not doing so is very real. CGI scripts allow anybody on the
Internet to run a program on your machine, and to send input to that program. If
a malicious user sends input to your program and your program passes that
information to the operating environment (by passing user data to the shell or
another program) your CGI script could be used to do almost anything. If nobody
has hit your scripts with these kinds of attacks yet, then you are relying only
on the safety of numbers.
The situation is made more onerous due to the fact that Linux (Unix) shells
allow easy ways to combine commands, and Perl makes it almost trivial to send
commands to other programs.
Security in CGI is a simple matter of checking user input for characters that
have special meaning in shell commands and being careful when passing user input
to external programs.
The characters that have a special meaning are:
& ` ' " | * ? ~ > < ^ ( ) { } [ ] $
The space character is not dangerous.
Testing for Insecure Characters
This line causes our script to abort if the user supplies data that contains
any of the those characters:
if ($value =~ /[&`'\"|*?~><^(){}[]$
]/) {
print_head("die", 500, "Bad Data");
}
Some of these shell meta characters are also meta characters for regular
expressions, so we escape those in our match with the mark. We include this
test after we have decoded the user input (as it cannot match these characters
when they are represented in hexadecimal form) and after we have divided our
query string into names and values (as the & symbol is used in the CGI
protocol, which is arguably a bug, believe it or not).
Of course, logic indicates you could test for these characters only when you
were about to pass user data out of your script, rather than testing every user
supplied value as we do. Our style is a teeny, tiny bit slower, but it's
complete, and we'd rather not pit our wits against the imagination of every
possible hacker in the world, nor do we want to have to remember all this
security stuff every time we talk to the shell or an external program. You may
have particular needs for some of these characters to be supplied by your user,
and may choose to delete them from the check. That's OK, so long as you are
aware of what you are doing, and why.
Testing for Secure Characters
Some CGI programmers argue that testing for legal data is more secure than
testing for bad data. What characters you choose to define as legal are up to
you, but take a breath before your define an insecure character as legal!
Our users should have no reason to send the script anything other than the
regular 24 upper and lower case characters of the English alphabet, the digits
0-9, tabs, a dash or the @ symbol (for their email address). We test for good
data this way:
if ($form_data !~ /^[a-zA-Z0-9 -@]+$/) {
print_head("die", 500, "bad data");
}
The symbol stands for tab. The expression is preceded by the ^
modifier, which tells Perl to start from the beginning of the string, and is
terminated by + (which tells Perl to match one or more of the items
in the range) and the $ symbol, which indicates the end of the
string to Perl.
It's a good idea to read more about security. One of the best places to start
is http://www.w3.org/Security/Faq/www-security-faq.html, the WWW
Security FAQ by Lincoln D. Stein, one of the true masters of the web.
Using Taint Mode
For the truly paranoid Perl contains a built in tool called '"Taint"
which reports errors if your script passes unsecured data out of the Perl
process. You run your script in taint mode using the -T flag, either from the
command line or on the shebang line. In taint mode Perl treats all data coming
from without the process as suspect. The only way to untaint the data is to
use a regular expression to match the string you are looking for within that
data. Thus, taint mode is only as good as your ability to write a regular
expression that matches secure strings. For that reason we find taint to be
overkill and not much useful for the majority of CGI programmers, though of
course others would disagree.
The complete user decoding and security check subroutine
Our complete subroutine looks like this
sub get_user_data {
my ($name, $value, $form_data, @PAIRS);
if ($ENV{'REQUEST_METHOD'} eq "GET") {
$form_data = $ENV{'QUERY_STRING'};
} elsif ($ENV{'REQUEST_METHOD' eq "POST") {
read(STDIN, $form_data, $ENV{'CONTENT_LENGTH'}
or print_head("error", 500, "Read failed $!");
} else {
print_head("error", 300, "Client is using unsupported method");
}
$form_data =~ tr/+/ /;
$form_data =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
@PAIRS = split (/&/, $form_data);
foreach $pair (@PAIRS) {
if ($value =~ /[&`'\"|*?~><^(){}[]$
]/) {
print_head("die", 500, "Bad Data"); }
($name, $value) = split /=/, $pair;
$F{$name} = $value;
}
}
To use it with CGI scripts that require form input, we simply call it at the
start of our script
#!/usr/bin/perl
&get_user_data;
Conclusion
In this second article we have taken a close look at exactly how to safely get
at the data a client is sending to a server from within your script (or from
without it). All CGI scripts get data, manipulate it in some way and then send
results back to the client. In the first article in this series we looked at
that HTTP transaction itself and we also explored ways to send data back to the
client. In this article we looked at how to retrieve data. In subsequent
articles we may explore some of the things you can do with that data "in
the middle", things like shopping cart systems, online databases and
utilities of that sort.
About the author: SA Hayes is a writer, editor and tinkerer living in
Northern California, for which he is grateful. When not working on computers
Hayes spends his days building things and his nights driving away raccoons.