Webserver use, configuration and management
In the lecture on evaluation and testing, we highlighted a couple of well-known log analysis tools. There are a wide range available – your first task is to spend a few minutes searching the web for others not mentioned in the lecture, and to compile a brief summary of their features; perhaps including cost (or freeware), language used (hence what support on the server needed), results accessible over the web (or offline analysis only), what log file formats supported, etc. Do include Analog and Webalizer in your search; new versions of them are bound to contain new features. Discuss your table of results with others in your group.
First have a look at the data presented by inspection using a text editor. Any program that will open a plain text file would do, but remember that if the text file originated on a Unix/Linux system the line ends are coded differently to text files produced by Windows systems. You should be able to identify most of the fields in the log file; consult the Apache documentation or ask if you cannot.
An alternative simple tool would be a spreadsheet. If you open the file in Excel it will invoke the import wizard and give you the opportunity to get the fields from the log file into different columns of the spreadsheet – not perfect, but a worthy try. Try several different ways of importing to see if you can separate out the fields properly. (Hint – what might be sensible separators in addition to the default “tab” suggestion?)
If you struggle, here's a version in Excel format.
The first task is to check to see if there are any 404 response codes, indicating page not found. Could you extract or group lines with a 404 response message so as to be able to concentrate on them? Excel would simplify this task with its filter facility (Data|Filter|AutoFilter).
Clearly the webmaster should be working at eliminating these 404 errors. Can you make any suggestions or observations?
Try to get the data grouped by different client IP addresses. Can you say anything about the addresses reported in the log file? Do you have any way of finding out what domain they come from? What might you deduce from a full IP lookup? Can you identify any different counties in the client hosts you checked – there are some non-UK ones in the Jan2004 log file. Is this a useful task? Do the log file analysers you looked at in the first part of the session support this activity?
Can you make an estimate (or exact figure) on the number of distinct IPs listed as clients in the log file? Can you say anything about the number of likely actual human readers of the web pages?
Calculate the average file size transferred. (Once loaded into Excel this should be easy to calculate.) Filtering on Response Code 200 would help see the data. What proportion of requests were successful?
Can you estimate the average number of hits a day? What is the maximum number of hits per day? What is the least number of hits per day? Clearly undertaking this task with just a text editor is going to be difficult, though with the log file loaded into Excel this task is very much simpler. (Hint: Use the Data|Subtotals... menu option. This does require you to have added a column label at the top of each of the columns.) Can you extract just the date from the combined data/time field?
If you have time it would be good to download one of the free tools and try it out on the noonshadow log files.