Writing CGI Scripts in Perl: Part 1: Getting Ready
by Simon Hayes, Linux.com
Monday July 23rd, 2001

 

Understanding the HTTP Transaction

HTTP is rather misleadingly named hypertext transport protocol, but is used all the time for much more than text. It is also known as an "application layer protocol", which means that its purpose is to allow applications to communicate and pass information to each other over the network. For this reason HTTP is best understood as a conversation between a client and a server. That conversation is governed by certain rules, things that can be said and things that can't. How well a client or server conforms to those rules is how HTTP compliant they are said to be. The complete HTTP exchange is called a transaction, because the point is for the client and the server to exchange a resource.

Successful transactions have two parts, headers and bodies. The header is the conversation held between the client and server, while the body is the resource itself.

The basic form of an HTTP header might comprise multiple lines like this:

 


Open: 216.103.122.37 Thu 08 Jul 1999 02:23:41
Request: GET /myscript.cgi?ti=&au=king HTTP/1.0
Status: HTTP/1.0 200 OK
Close: Thu 08 Jul 1999 02:23:48

The first two header lines are sent by the client, the last two by the server.

All parts of these conversations are sent over the Internet in plain text, just like you see above. The conversation begins with the client identifying itself and giving the date of its request, followed by the actual request on a second line, which contains the URL of the resource the client wants, the method it wants to use for requesting the resource, and the version of HTTP it's using. HTTP supports many methods for transactions, but we'll only dsicuss two: GET and POST.

 

A port number can also be specified in the request. If no port is specified, then port 80 is assumed.

The server replies with its answer in the status line - first with the version of HTTP it will use, then a three digit numeric code to identify the type of answer it is sending, followed by a text string that is an explanation of that code. If everything goes well and the server has the requested resource available the resource is sent as the body of the message. Finally the server closes the connection with a close line. The resource itself is always a stream of bytes, conforming to an Internet MIME type. The server puts a blank line between the last header line and the body of the transaction.

Of course, in reality it's a little more complicated than this. Other headers are sent in this transaction, and elaborate rules govern everything from what headers can be sent and when, to the exact number of spaces in a date string. Unless you are coding your own web server (or client) these details are not very important.

You can read the entire HTTP 1.0 specification at http://www.w3.oreg/Protocols/HTTP/HTTP2.html, though be aware that this specification is out of date, originally implemented in 1992. Most clients and servers today are switching to the HTTP 1.1 specification http://www.rfc-editor.org/rfc/rfc2616.txt. HTTP 1.0 is a lot easier to understand than HTTP 1.1, so we recommend that you read that document first as an introduction if this is all new to you.

Meta Information Headers

In a sense all header lines describe the nature of the body of an HTTP transaction, and are therefore 'meta' information. But usually people refer to meta headers as those headers included in the transaction which are not a part of the basic mechanics of making the request, but rather provide additional information about the content sent in the body of the message.

Meta information headers can contain any information at all, and the exact collection you will see depends on your particular server and client pair. These header lines are placed into environmental variables (as we'll explore throughout this article), but some of the more important ones are detailed here.

The content-type Header

One of the most important meta headers a server sends to a client is the content-type. This header identifies the kind of content the server is sending in the response, and without this header some clients will be unable to display the content at all. In other words the content-type header describes what kind of content the stream of bytes that makes up the body of the transaction is supposed to be, whether an image, formatted text or whatever.

All content on the Internet is identified by MIME type. When a server sends HTML documents it is able to identify the MIME type all by itself by various means, usually file extension. But when you use a CGI script to output HTML you must tell the server explicitly the MIME type of the data you are about to send by printing this header line.

Thus simple output from a server will typically look something like this:

 


HTTP/1.0 200 OK
Content-type: text/html

<html><head><title>My page</title>
....

Why are Headers Useful for CGI scripts?

Every time a script is run in Linux (or Unix) it executes within something called a process, which is really just a fancy way of referring to a series of connected instructions in a CPU. A process runs within an environment, which is the name given to describe the properties of that process. Environmental variables are set by the operating system and contain all kinds of information made available to your process.

In addition, a Perl process has three important handles set automatically that are known as STDIN, STDOUT and STDERR ('standard in', 'standard out' and 'standard error'). Handles are just a way of describing an input/output (I/O) stream of bytes to or from a process. These handles refer to the places that a Perl process looks for data, the default place it sends its data, the default place it sends any errors.

When a Perl process is invoked by a web server the STDOUT is set automatically to the server. In other words, to output back to the web server you just print the information you want sent, and Perl sends it to the server by default. Of course, you can send output to other places and override these defaults, as you might need to when saving data to a file, for example.

When a server sees that the requested resource is a script to run it feeds information from the HTTP headers into that script's process. The server puts most of that information into environmental variables, except in the case of a POST request when information is also fed into the program's STDIN handle.

And that, in a nutshell, is the mechanism of CGI - how a client is able to pass information to a server, a server is able to pass that information along to a script and get a response back.

Non-parsed Headers

Unlike static pages, CGI scripts always print out some HTTP headers lines - at a minimum the content-type line. If your server supports it you also have the option to print out the entire HTTP header from a CGI script. The principle advantage to doing this is to speed up your web site - if the server doesn't have to scan all the output and add headers the resource gets returned to the client that much faster. In addition different servers have different buffers and caching mechanisms, so if your script does server push countdowns or animations you may need to override that scanning of headers and send data to a client directly.

Returning your own complete headers with non parsed headers scripts is a complex business which requires a fairly complete understanding of the HTTP protocol, and is a task made all the more difficult by the rigorous requirements of HTTP 1.1

To find out more about non-parsed headers consult your server documentation.

Get Ready to Program

Your script is just a text file that contains Perl code. You can use any editor to write that code. You need to put your script where the server can find it, and different servers and different ISP's have various policies on what that means. Some ISP's require all CGI scripts to live in a special directory, while others allow you to simply name the file with a certain extension (often .cgi). Other servers identify CGI scripts by MIME type, believe it or not. Ask your ISP what their policy is. Of course if you are running your own server and Linux workstation you'll know where to put the files by consulting your server documentation.

There are a few things you need to be aware of when getting setup to program CGI that do not vary (too much) from server to server, and we'll address those here.

The shebang Line

When the server is instructed to execute the script and send the results back to the client, the code you write within the script is compiled by the Perl engine and executed. Therefore the first thing you have to do is indicate which engine you want to use to run your script. Do this with the "shebang" line. The very first line of your script must always be the path to the Perl engine (or interpreter) on your Unix system, preceded by a # sign and an exclamation point.

 

You might also see this called the magic line, or "a bit of magic". People say this because this method of finding the correct interpreter is actually a hack, done by somebody so long ago that it has become standard. Like many classic hacks it's almost impossible to understand or describe why and how it works, so people say it's magic instead, and, well, it basically is.

You can find the path to your Perl engine by typing "which Perl" at the command prompt of your system. For instance, if you typed:

 


%which perl

and the system prints:

 


/usr/bin/perl

your shebang line - the very first line of the script - would be:

 


#!/usr/bin/perl

Setting Permissions

If you're creating your CGI script on a multi-user operating system (like Linux) be aware that your server also has certain privileges in the system. In other words, your server itself is like a user, and different users can run different programs based on the permissions their userid is allowed. In order to be run by the server your script will need its permissions set such that the server's userid and group have permission to execute it. Any directories the script lives within, and any files the script must access also need their permissions set. On Linux use the chmod command to make sure your script has execute permission for you and everybody else, including the server.

 


%chmod a+x my_script.cgi

%

Although all scripts require the execute permissions above, the exact permissions you may need will depend on the policies and settings of your ISP. Ask them; if they don't know or cannot answer your question, get a new ISP immediately!

 

You may also see this described as "setting the bundle bit". Don't let that bother you.

Test your Script

To test to see if your script is currently working add this statement to your script:

 


#!/usr/bin/perl
print "cheese";
exit;

Save your file, set your permissions (if you haven't already) and type from the command prompt:

 


%perl my_script.cgi

If it prints cheese back at you, then your script is working. Of course, this will not tell you if your script is working as a CGI script, we'll show you how to test that in a moment.

 

All Perl statements end with the ; character. It's the first thing to look for if your script stops working
.

Checking your Script's Syntax

Perl also has a built in tool to check the syntax of your script, which can be very useful for finding errors as the script gets more complex. You use this tool by invoking it with a command line c flag:

 


%perl -c my_script.cgi

my_script.cgi syntax OK.

Planning your Work

For most people the development of a script is a give-and-take process of testing and refinement. From the first line to the time you are done you could run your script hundreds of times to see what it outputs and track down bugs. When developing your scripts it is essential to examine their output, as that's what you're trying to do here, after all: output stuff.

As we said before, most CGI scripts have three distinct parts: understanding and securing user input, processing and sorting data, and sending back results. Each of these parts of a script has its own particular concerns and challenges; we find it useful to be conscious of these areas when creating our scripts. Planning your architecture should help you keep everything clear and simple in your head, and make sure your script covers all the bases.

In this article we'll cover setting up a functional CGI script that can understand what is coming in and send back results. We'll look in detail at what happens in the middle - processing and sorting data - in the next few articles.

Sending Back Results

The point of a CGI script is to send back a response to a client. Any checking user input or processing data is pointless if you can't do the last part of what a CGI script must do. The first thing we always do is make sure that our script is sending back results correctly. We'll do this by creating a couple of useful tools that will aid us as we work on the rest of the scripts in these articles.

Once you are sure your server can find and execute the script, and the script is functioning correctly, it's time to print some information out to the client. Add one line to your simple script to make sure the server can understand your output:

 


#!/usr/bin/perl

print "Content-type: text/html

";
print "cheesetoast";
exit;

Now load up your script in a web browser by typing its URL into the location bar of your web browser.

Once you press Enter, it should print the exciting word "cheesetoast" into your web browser. Right on! You've just written and executed a CGI script.

Line 2 of our simple CGI script is the HTTP header line we mentioned before. This line identifies the output MIME type to the server and browser; your script will not work without it. Notice that it has two line breaks after the text, indicated by the meta characters for "new line" - This extra line break identifies to the web server the end of the header information. Double quoted strings in Perl are variable interpolated, meaning the string is scanned for character that have special meanings to Perl like variables and meta characters. If you wanted " " to be printed as a text value (not a new line) you would put the string in single quotes.

 

You can output any content type you like, but for the client to understand it must be a type the client is willing to accept. You can check those types by looking in the HTTP_ACCEPT environmental variable.

Creating a Header Subroutine

Since we'll have to send a similar header from all our CGI scripts it makes sense to put the output of that header into a subroutine. This way we can write the code to output the header once, and then just call the subroutine whenever we're ready to start sending data.

A user-created subroutine in Perl is defined with the sub statement and a block:

 


sub print_head {
	...
}

We can call this subroutine in a number of ways, although we'll only use two in our simple scripts:

 


print_head();
&print_head;

The subroutine to print the correct headers might look something like this:

 


sub print_head {
	print "Content-type: text/html

";
	print "<html><head><title>My Output</title>
";
	print "</head><body>
";
}

Now, whenever we want to begin sending output we just call the subroutine:

 


&print_head;

from anywhere within the program and the block of code within the routine prints the start of our output back to the server.

 

Because Perl scans all double-quoted strings for special characters, when it finds a second double quote it thinks it has come to the end of the string to scan, so you need to escape the quote character. Escape special characters with a backslash before them.

Using the Print Block

We can avoid typing all those print statements by putting the content to print within a print block:

 


sub print_head {
	print <

Perl prints everything between the first EOF and the last EOF. You can use any word in place of EOF, which is only a common acronym for 'end of file'. Notice the two blank lines after the Content-type header, the same effect as the extra we used before.

Passing Data to our Subroutine

Our subroutine would be more useful if we could customize its output; we want different pages to have different titles, for example. We can pass information to our subroutine by invoking it with arguments:

 

print_head("Search Results");

We access those arguments inside our subroutine by looking in the @_ array. In Perl arrays of this kind are keyed by item number, starting from zero. In this instance, @_ will contain only one item, which we'll read directly by looking at the array in a variable (or scalar in Perl-speak) context, and assigning that lookup to another scalar:

 

sub print_head {
	my $title = $_[0];
	....
}

The = operator tells Perl to assign the value of $_[0] (the first item in the @_ array) to the variable $title. We invoke this assignment with the my function, which tells Perl to use this value of $title only within this block of code, freeing up the variable name to contain another value elsewhere in our script. Localizing (or controlling the scope, as programmers say) of scalars is a good idea, particularly in CGI scripts. You'll be doing a lot of variable assignments with the same data in all kinds of places; if you don't scope your variables your scripts will end up littered with hundreds of different variable names that hold the same data.

Scalars are variables that hold a single value.

So our header subroutine looks like this:

 

sub print_head {
	my $title = $_[0];
	print <

Now we can call our subroutine and print out the correct headers from anywhere in our program with a customized title:

 

print_head("cart");

 

You can use other values for arguments when calling a subroutine: $newtitle = "cart"; print_head($newtitle);

Final Note on Subroutines

User defined subroutines must return a value, like any other function. If no value is specified as output the exit value of the last statement in the subroutine is returned. This value usually indicates either "true" or "false". Using the return keyword you can specify any value as the return value allowing you to test if your subroutine actually worked the way it was supposed to.

Sending Data to the Client

Recall that when a process is started three file handles are automatically created. The STDERR file handle is where Perl sends error explanations, and it is directed to your server's error log file. So, if your script crashes you will not be able to find out what went wrong without digging through your server log file.

Scripts also have user errors - perhaps your script cannot continue because the user did not enter their zip code, for example. A very useful tool is to modify our print_head function so that we can use it to test for script or user errors and send them back to the client (so we can see them easily). We do this by having our script output an additional header line into the HTTP transaction: the Status line.

Netscape and most other browsers will simply display a blank page unless your return a resource (which contains your error string) before the exit statement. For those clients you can return a formatted web page that includes your error string.

If you send formatted errors in response to client requests your users will be able to see what went wrong with your script's execution, and so will you. In the case of user errors, this is exactly what you want. In the case of script errors, it may not be. In either case you'll want to put some thought into your error strings, make them short but informative.

To modify our print_head subroutine we need to pass two additional arguments, the HTTP status code and a string describing the error. We separate these arguments with commas:

 


print_head("die", 300, "User Error");

In our subroutine we grab three values from the @_ array and assign them to local scalars2.

 


sub print_head {
	my $title = $_[0];
	my $header_code = $_[1};
	my $header_string = $_[2];
	...
}

We'll want to test the header code to determine if the subroutine should return any content to this HTTP request. The HTTP specification says that any status code in the range of 200 is a successful request, so we make a regular expression test to find out if $header_code begins with a 2.

 


if ($header_code =~ /^2/) {
...
 }

The if statement in Perl makes a truth test and asks "is this statement I'm being asked to evaluate true?" If true, the if operator executes the code within its block, and if false, it doesn't. In this case the statement is an expression which asks "does the value of the scalar $header_code contain a 2 at the beginning?" The =~ indicates that the test should be applied to the scalar $header_code, and the / / delineate the pattern to match. The ^ tells Perl to match said pattern only at the beginning of the value we're testing against. We'll talk more about regular expression like this later on, as truth tests combined with expression are at the heart of almost everything we do in CGI.

 

Of course, testing for "truth" begs the question "what is true?". The answer to this is either intuitive or frustrating depending on your training. In Perl, everything is true except for an empty value (""), a value that contains zero ("0") or a value that contains a space (" "). Truth is not the same as defined: you can define a scalar as empty, for example. This can easily get confusing, and Perl gurus recommend you choose to test either for truth of defindness, not both. In CGI scripts we recommend you stick to the truth, the whole truth and nothing but the truth.

If we call this subroutine with a 200 status code then we'll want to continue with our script, we put the rest of our existing header into the if s code block:

 

sub print_head {
	my $title = $_[0];
	my $header_code = $_[1];
	my $header_string = $_[2];
	
	if ($header_code =~ /^2/) {
		print <

Now that we have made this truth test, anything that doesn't match the truth test will contain a HTTP status code that indicates an error (no more information coming). We use the else keyword to expand the if test and include "everything else" that failed the if test.

 

sub print_head {
	my $title = $_[0];
	my $header_code = $_[1];
	my $header_string = $_[2];

	if ($header_code =~ /^2/) {
		print <

Once we're done sending our HTTP error header to the client we instruct Perl to abort our script (not output anything else) by calling the exit function.

Here's how we might use our subroutine to test whether Perl was able to open a file for reading. If the open fails, our script will print a 500 server error to the client (with our custom error string) and then abort.

 

$filepath = "./mytextfile.txt";
open(FILE, "<$filepath") or 
	print_head("error", 500, "Faile




This article comes from osforge.com
http://www.osforge.com

The URL for this story is:
http://www.osforge.com/viewtutorial23.html