[ Top | Up | Prev | Next | Map | Index ]

Readme for analog3.11

Choosing a logfile

This is a rather long page, so here is a quick summary of the most important points:
The basic command for selecting a logfile is
LOGFILE logfilename
or just to put the logfile name on the command line without any arguments, e.g., analog logfilename. A - sign or the word stdin is interpreted as standard input: this is useful on Unix systems for constructing pipes. The word none means that the list of logfiles specified so far is erased. All logfiles must be on your local disk -- analog doesn't fetch them from across the network. In the Mac version, you can also analyse a particular single logfile by dragging it onto the analog icon.

You can have several LOGFILE commands. You can include wildcards in the logfile name (but not necessarily in the directory name: this is system-dependent), and you can use a list of logfiles separated by commas (without spaces). So the following commands would tell analog to read logfile1, c:\logs\logfile2, and all files ending in .log:

LOGFILE logfile1,*.log
LOGFILE c:\logs\logfile2
The LOGFILE commands are cumulative, except that any logfiles on the command line or in user-specified configuration files override any in the default configuration file, and are themselves overridden by any in the mandatory configuration file.
Analog knows about several different types of logfile. By default it will attempt to see if your logfile is of one of the types it knows about, based on the first line. (Note: if the first line of your logfile is corrupt, or if your logfile has lines in different formats, you'll have to tell analog the logfile type yourself). The types it can diagnose are the common log format, the NCSA combined format, referrer log and browser log, the W3 extended log format, the Microsoft IIS format (sometimes), the Netscape format, the WebSTAR format, and the Netpresenz format (sometimes). Examples of all these formats are given at the end of this page. If you have debugging on, analog will report what type of logfile it thinks yours is.

The reason for the "sometimes" in the previous paragraph is as follows. The Microsoft and Netpresenz formats are extremely badly designed in that the date can occur in either of the forms date/month/year or month/date/year, and they don't say which they're using. Analog will detect them automatically if it can tell which date format is being used (e.g., 13/2/98 or 2/13/98), but if it can't, it will tell you to use one of the LOGFORMAT strings below. Sometimes the date can even be in another format altogether, in which case analog won't be able to auto-detect it.

You can also specify a different type of logfile, using the LOGFORMAT or DEFAULTLOGFORMAT command. If all your logfiles are of formats that analog can diagnose, you need never use these commands.

When you start up analog, all logfiles have the default logfile format. This is normally automatic detection, as explained above, but you can change it if your logfiles are always in a format which analog doesn't know about. You do this by means of the command

-- we'll discuss what the formats can be in a minute.

Sometimes you might want to analyse several logfiles with different formats. For this you need the LOGFORMAT command. This command only applies to future logfiles in the same configuration file. So if you change the format with a command like

then any logfiles you select with a LOGFILE command later in the same configuration file will get the new format. If you put the LOGFORMAT after the LOGFILE command, it will not take effect for that logfile, and you will most likely get a "can't auto-detect format" warning.

The possible formats for use with the DEFAULTLOGFORMAT and LOGFORMAT commands are of two types. First there are some symbolic words, and then there are log format strings. We'll look at the words first.

There are format words for all the built-in formats analog knows about. For example, COMMON will select common format; you can also have COMBINED, REFERRER, BROWSER, EXTENDED, MICROSOFT-NA (North American date format), MICROSOFT-INT (international date format), MS-EXTENDED (Microsoft's attempt at extended format), MS-COMMON (a buggy version of common format in some versions of Microsoft software), NETSCAPE, WEBSTAR, NETPRESENZ-NA (North American) or NETPRESENZ-INT (international). There are also the words AUTO for automatic detection and DEFAULT for whatever the default log format is.

If your logfile is not in one of the recognised formats, you can tell analog about your format using a log format string. You only ever need this if your logfile has lines which are not in one of the standard formats. The format string consists of a template for the logfile line, with the various fields and special characters replaced by codes as follows.

host (computer making the request)
file requested
Mac-style filename, with colons instead of slashes
browser with +'s instead of spaces
referrer (URL referring to the file)
user (tip: a cookie can usefully be defined as %u too)
virtual host (also called virtual domain)
day of the month
month in digits
month, three letter abbreviation
year, last two digits
year, four digits
hour of the day
minute of the hour
a for am or p for pm (if %h is 12-hour clock)
number of bytes transferred
HTTP status code
Special code, specific to particular servers
query string (part of filename after ?, if recorded in a separate field)
junk: ignore this field (field can be empty too)
white space: spaces or tabs
optional white space
% sign
new line
tab stop
single backslash
(I shall refer to the first seven things above as items.) So for example, the common log format, which looks like
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243
can be represented by the LOGFORMAT command
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)
including two items, host and file. (The parentheses are needed because the argument contains spaces. Note also the use of %j to ignore two fields, the seconds and the timezone.)

Logfiles often contain lines in several different formats, so you can specify several log formats one after the other and they will accumulate. For example, the definition of common format should also include the line

LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b)
to handle lines where the HTTP/1.0 part of the request is absent. Or you might use
to represent a logfile which had lines in both those formats. Analog tries to match the line to the first format first, then if that fails the next, and so on, so the order of the formats is important. Usually you want to specify the most common one first, to minimise the time spent trying to match lines to inappropriate formats. The DEFAULTLOGFORMAT also accumulates in this way.

The log formats which analog can handle are those which are known as instantaneously decipherable: this means that the character which terminates a string can never occur in the string. In the above example, if the hostname ever contained a space, the line would be marked as corrupt, because analog terminates the host at the first space, not at the first occurrence of space-dash-space, and then the rest of the line wouldn't match. Of course, hostnames should never contain spaces, so this shouldn't be a problem. There are a couple of other restrictions: if there is any date or time information, then the year, month, date, hour and minute must all be present: and the same information may not occur twice in the format (so you can't have both %m and %M, for example).

Sometimes you need to read one of the fields in a logfile, but not analyse it. For example, if you have a separate common log and referrer log, the referrer log might look like

[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/
But the requests for /~sret1/analog/ would already have been counted when reading the main logfile, so you don't want to count them again now. You get round this by specifying a * in that item in the format string, like this:
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)
Any of the seven items can be treated in this way.

Here are the exact rules about which logfile gets which log formats. The default logfile format starts off at AUTO. You can change it with a DEFAULTLOGFORMAT command, and then the default format accumulates unless you specify DEFAULTLOGFORMAT AUTO to return to automatic detection.

The current logfile format starts off at DEFAULT. You can change it with a LOGFORMAT command, and then the current format accumulates until a LOGFILE command intervenes; then it restarts at the next LOGFORMAT command. It also restarts if you specify LOGFORMAT AUTO or LOGFILE DEFAULT; or when the current format is reset to DEFAULT automatically, which happens at the end of the command line, and of every configuration file, and whenever a LOGFILE none command is encountered.

The default logfile selected at compilation time always gets the default format (although exactly what the default format is can still be changed with a DEFAULTLOGFORMAT command). Any logfile declared later, in a configuration file for example, gets the current log format at the time it is selected. If you specify several logfiles, they will all use the same format, unless there's a LOGFORMAT command or an implicit return to DEFAULT format between them.

There's also a second argument to the logfile command, which specifies a prefix to add to all the filenames in that logfile. This is useful if you've got several different servers or virtual hosts, when the same filename may occur on each of the servers. The argument can contain a %v, and the name of the virtual host will then be inserted at that point. For example,
LOGFILE log1,log2 http://www.%v.mydomain.com
would translate a filename /file.html with virtual host spam in log1 or log2 to http://www.spam.mydomain.com/file.html. If you are using the second argument to the LOGFILE command, you will probably want to use the SUBDIR command as well.

If %v is included in the argument and the logfile line doesn't have a virtual host, that line will be marked as corrupt. If VHOSTLOWMEM 3 is specified, the %v's will not be translated and will just appear as %v in the output.

There is one other command which applies to individual logfiles, in a similar way to the LOGFORMAT. Sometimes your server is not (or believes it is not) in the same timezone as you. So that you can give your statistics in your local time, there is a command LOGTIMEOFFSET to change the time by a certain number of minutes. You have to be careful using this. Because of daylight savings time in operation in different parts of the world at different times, analog cannot attempt to convert between different timezones. So it's your responsibility to set the right offset for different times of year. For example, if you were in Chicago, but your server was recording time in GMT, you would need to specify two different time offsets, one of minus five hours for summer and one of minus six hours for winter. You would need to split your logfiles in the right places and then run commands like
LOGFILE summer*.log
LOGFILE winter*.log

While we're on the subject of time offsets, there is one other similar command, which is not directly to do with logfiles. You can specify a TIMEOFFSET command to say how much analog should offset the time of the computer on which it is running, to get your local time.

It is often convenient to store logfiles compressed to save disk space. Analog on the Mac can read logfiles compressed using gzip. And analog on Unix, Win32, and VMS 7.0 and above can read compressed logfiles provided that you use an UNCOMPRESS command to say how to uncompress them. You need to supply the types of file that you want to uncompress in a comma-separated list, together with the name of a command that will uncompress the files to standard output (rather than to a file). For example, on Unix you might use
UNCOMPRESS *.gz,*.Z  /usr/bin/gzcat
whereas on Windows NT, you might use
UNCOMPRESS *.gz "c:\Program Files\gzip\gzip -cd"
and on VMS, it could be
UNCOMPRESS *.LOG-GZ;*  "gunzip -c"
This would be a suitable command to include in the default configuration file.

If analog determines when it starts to uncompress a logfile that that file isn't wanted for the analysis, two undesirable things can happen. Either the program might pause until the logfile is fully uncompressed, or there might be a "broken pipe" error reported. This is system dependent, and out of analog's control.

Appendix: logfile formats

Here is a summary of the various logfile formats which analog knows about.

The common logfile format is written by most servers. Its lines look like

jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243
Specifying LOGFORMAT COMMON is the same as specifying the three commands
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b)
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%r" %c %b)
Some versions of Microsoft software have a buggy version of this with an extra quote mark before the HTTP like this:
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ "HTTP/1.0" 200 1243
Analog will understand these, but (as with any two formats) it will reject lines if the format changes half way through.
The NCSA referrer log looks like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/
and the browser (or agent) log looks like
[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)
The respective LOGFORMAT commands are
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %B)
In both of these logfiles the date can be omitted, except if the date is omitted in the browser log, analog will not be able to detect the log format automatically. (It doesn't contain enough clues, so there is too much danger of confusing other log formats with it; just use "LOGFORMAT %B").
The NCSA combined log is the same as the common log, except that it has the referrer and browser on the end in quotes, like this:
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243
"http://www.statslab.cam.ac.uk/" "Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"
except all one line. If you are using the Apache server, you can generate this with the mod_log_config module, using the command
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\""
The corresponding LOGFORMAT commands are
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b "%f" "%B")
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b "%f" "%B")
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%r" %c %b "%f" "%B")
It is usually better to use the combined log than separate logs, because it stores more information in less space.
The W3 extended log, the Netscape log, and the WebSTAR log can be recognised because they must include at or near the top a line telling analog what to expect on subsequent lines. Analog constructs a LOGFORMAT template based on this header line. (They may also contain later lines changing the format).

The extended log is described at http://www.w3.org/TR/WD-logfile.html. Its header line looks like

#Fields: date time cs-uri
In the rest of the logfile, the fields can be separated by spaces or tabs. There is also Microsoft's attempt at the extended format -- unfortunately they didn't read the spec., so they didn't enclose the browser and referrer in quotes, and they replaced spaces in the browser name with +'s.

The WebSTAR file has a header line like

In the rest of the logfile, the fields are separated by tabs. Some other Mac servers also use the WebSTAR format, or something looking like it. Analog will understand these too. Finally, the Netscape header line looks like
format=%Ses->client.ip% [%SYSDATE%] "%Req->reqpb.clf-request%"
%Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%

Sometimes these three logfile formats can contain header lines which refer to the same item in two different ways. Analog doesn't know which one you want to count, so such header lines will generate a "corrupt format line" warning. You can then use a LOGFORMAT command to specify the format more precisely.

The Microsoft IIS logfile looks like, -, 21/02/97, 00:03:46, W3SVC1, SPIDER,,
30, 303, 1455, 200, 0, GET, /siege.htm, -,
(except all on one line) or
LOGFORMAT (%S, %u, %d/%m/%y, %h:%n:%j, W3SVC%j, %j, %v, %j, %j, %b, %c, %j, %j, %r, %j,)
However, the format is extremely badly designed, in that the date follows local conventions: in other words, in North America the above example would have the date 02/21/97 instead. Analog will diagnose which form the logfile is in if possible: but if both the date and the month are at most 12, there is no way to tell which format it is. In this case, you need to use the LOGFORMAT command MICROSOFT-NA for North American date format, or MICROSOFT-INT for international date format. It may even be that the date is in neither of these formats, in which case you need to use a LOGFORMAT command of your own.

There are also various third-party extensions to the Microsoft format to include, for example, the browser and referrer. Analog can't automatically diagnose these: you need to write a LOGFORMAT string for them.

The Netpresenz logfile is unusual in that each entry can spread over several lines. It looks like
5:54 pm  14/11/96  HTTP    get file  Research.html
Referer: http://guide-p.infoseek.com/Titles
The fields are separated by tabs. It is equivalent to four LOGFORMAT commands:
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R\nReferer: %f)
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R)
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%R)
Again, the Netpresenz format uses local conventions for the date and time. Analog will diagnose it where it can: otherwise, you will have to use
LOGFORMAT NETPRESENZ-NA    # dates like 9:14 AM  3/23/98 (upper case AM)
LOGFORMAT NETPRESENZ-INT   # dates like 9:14 am  23/3/98 (lower case am)
Again, it can be that the date and time is in neither of these forms, in which case you will have to enter your own LOGFORMAT string.
Stephen Turner
E-mail: analog-author@lists.isite.net

[ Top | Up | Prev | Next | Map | Index ]