Web programming

Units WEB1P and WEB2P

Passing data from an HTML page to a program

It is important to know how the information submitted in an HTML form or in an HTML link is passed through the HTTP and CGI interfaces to a program.

HTTP level (browser to server)

Data from a form to a server is passed as parameters to the request. This can be done in two ways:

  1. The parameters can be tacked onto the request URI in the form of a query string, e.g.
  2. http://www.weather.co.uk/forecast?city=Portsmouth&day=today

    which would be passed in a request as

    GET /forecast?city=Portsmouth&day=today HTTP/1.1
    

    The query string starts with a question mark (?) and each name/value pair is separated by an ampersand (&). The name and value are separated by an equals sign (=).

  3. The parameters can be sent as part of the request message body, e.g.
    POST /forecast HTTP/1.1
    ... [headers]
    
    
    city=Portsmouth&day=today

    Note the blank line separating the header from the body.

In both cases, names and values must be URL-encoded - unsafe characters such as spaces, question marks, ampersands and non-alphanumeric characters must be converted into safe ones. Space is converted into plus (+); other characters can be represented by percent (%) followed by two hexadecimal digits representing their ASCII code. The encoding is specified by RFC 1738 (http://www.ietf.org/rfc/rfc1738.txt).

Note that in number 1, the user gets to see the parameters in the URL displayed in their browser's address window. They don't get to see what is submitted by number 2 in the message body. While browsers and servers are not supposed to place a limit on the length of a URL, it is safest not to rely on a URL of longer than 255 characters - hence a query is not the way to convey potentially large quantities of data.

If the HTTP method is GET, the parameters are passed in the query string. If the HTTP method is POST, the parameters are passed in the message body (though in principle it could send some in the query string as well).

If you click on a link (HTML <A> tag) in a page, the browser sends a GET request. If you use a form in an HTML page, you can specify (via the "method" attribute of the <FORM> tag) which method to use.

Which method should you use in a given situation? Section 9.1 of the HTTP specification (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.1) says that there is a convention that GET (and HEAD) should only be used for information retrieval. Thus POST should be used for actions that may have unexpected significance or be "unsafe". GET also has the property of "idempotence", in that (aside from error or expiration issues) the side-effects of multiple identical requests is the same as for a single request. Thus POST should be used for actions that might have incremental effects (e.g. an online order, where if made twice you get two items not one).

CGI level (server to program)

The rules for CGI are described on page 482 of Stein. More details are in the CGI specification: http://hoohoo.ncsa.uiuc.edu/cgi/. To illustrate how this works in practice, let's look at two versions of the same program that show how a CGI script receives its input.

Both versions are generated by PerlBuilder, using its CGI Wizard. The HTML form used as input can be found at mynameis.htm. You can download the original source of version 1 at mynameis1.pl and version 2 at mynameis2.pl.

To make the following notes easier to follow, a printable version of the programs (with line numbers) is also available. Version 1 is in mynameis1.pdf; version 2 in mynameis2.pdf. You might want to print those off before continuing.

Even if you don't know Perl, hopefully the program is sufficiently clear that you will be able to follow it.

Look at mynameis1 first.

        Each of the different sections generated by the CGI Wizard is identified by "AUTOGEN"/"ENDAUTOGEN" comment blocks. You should never alter the code between those lines, but you can add code outside those sections.

        The logic of this script (as with any other generated by the CGI Wizard) is as follows:

·       first, get the input submitted by the form (lines 9-19)
·       secondly, validate it - in this case there is no input validation
·       thirdly, log the input - again in this case, no logging is done
·       fourthly, send any email messages - again, no emails are sent in this example
·       fifthly, produce HTML output (lines 36-50)

        The key to getting access to the input submitted by the form is the subroutine GetFormInput. It is the first thing called (line 9). Its body can be found on lines 59-96. It is automatically generated by the CGI Wizard, and exactly the same subroutine is generated for every HTML form you process with the wizard.

The important bits of GetFormInput relating to CGI are:

        Line 64 looks at the environment variable REQUEST_METHOD to see whether the request made was GET or POST.

        If the request is a POST, a block of text is read in from the standard input (line 65). The web server arranges to write the data sent by the form to this channel. The amount of data to be read is determined by the CONTENT_LENGTH environment variable.

        If the request isn't a POST (i.e. it is probably a GET), the text is read from the QUERY_STRING environment variable (line 68).

        Either way, by line 70 the variable $buf contains the data submitted by the form, but it is still encoded. If there is no input, line 70-71 simply returns - it has no more work to do.

        Assuming there is input (line 74), an example of the input would look like this:

name=Jim&gender=male&likewpss=1&agegroup=31-40

        Line 74 splits that text up at the ampersand ("&") symbols, so that we have a list (@fval) of distinct input name and value pairs.

        The foreach loop starting on line 75 iterates through that list.

        It first of all splits the variable name from the value (line 76).

        Lines 77-80 decode any special characters that appear in the variable value (77-78) or name (79-80). Don't worry too much about how this is done.

        Lines 82-86 place the value obtained in a hash called %field. The key is the variable name. If there is more than one value associated with the same name (as there can be quite often on HTML forms that include multiple-selection lists, or tick boxes with the same name), a comma-separated list of these values is formed (line 86).

        So, by the end of GetFormInput, the appropriate fields of the hash %field have been assigned the values sent by the form.

The main script copies the field values into similarly named local variables (lines 16-19). This is a CGI Wizard option (you could write the script to refer to elements of %field directly), but it is useful to have it on for the sake of readability - having a list of the inputs obtained from the form helps the script document itself.

The print statements on lines 36-50 are intended to reproduce the HTML document specified to the CGI Wizard (mynameis_out.htm). Where the value of an input from the form appears in the output, the CGI Wizard generates appropriate code (e.g. lines 55-58).

mynameis2 is almost exactly the same:

        Instead of having the CGI Wizard generate GetFormInput, mynameis2.pl uses the CGI.pm module. (This is an option on the first tab of the Wizard.) This produces more compact code.

        Line 8 imports the CGI module.

        Line 10 creates a new CGI object and makes $query reference it.

        Lines 17-20 get the values submitted from the form by accessing the param method on the CGI object for each in turn. Behind the scenes, the CGI module does almost identical things to those done by GetFormInput in mynameis1.pl, but without the programmer having to worry about it in any way. This is what makes the CGI module so useful, and preferable to use.

For more details of the CGI module, see the documentation associated with it in the CPAN library. If you have the module installed on your computer, you may find it in a file with a name something like C:/perl/html/lib/CGI.html.

 

Last updated by Prof Jim Briggs of the School of Computing at the University of Portsmouth

 
The web programming units include some material that was formerly part of the WPRMP, WECPP, WPSSM and WEMAM units.