| 1 | The Webalizer - A log file analysis program -- DNS information |
| 2 | |
| 3 | The webalizer has the ability to perform reverse DNS lookups, and |
| 4 | fully supports both IPv4 and IPv6 addressing schemes. This document |
| 5 | attempts to explain how it works, and some things that you should be |
| 6 | aware of when using the DNS lookup features. |
| 7 | |
| 8 | Note: The Reverse DNS feature may be enabled or disabled at compile |
| 9 | time. DNS lookup code is enabled by default. You can run The |
| 10 | Webalizer using the '-vV' command line options to determine what |
| 11 | options are enabled in the version you are using. |
| 12 | |
| 13 | |
| 14 | How it works |
| 15 | ------------ |
| 16 | |
| 17 | DNS lookups are made against a DNS cache file containing IP addresses |
| 18 | and resolved names. If the IP address is not found in the cache file, |
| 19 | it will be left as an IP address. In order for this to happen, a |
| 20 | cache file MUST be specified when the Webalizer is run, either using |
| 21 | the '-D' command line switch, or a "DNSCache" configuration file |
| 22 | keyword. If no cache file is specified, no attempts to perform DNS |
| 23 | lookups will be done. The cache file can be made three different ways. |
| 24 | |
| 25 | 1) You can have the Webalizer pre-process the specified log file at |
| 26 | run-time, creating the cache file before processing the log file |
| 27 | normally. This is done by setting the number of DNS Children |
| 28 | processes to run, either by using the '-N' command line switch or |
| 29 | the "DNSChildren" configuration keyword. This will cause the |
| 30 | Webalizer to spawn the specified number of processes which will |
| 31 | be used to do reverse DNS lookups.. generally, a larger number |
| 32 | of processes will result in faster resolution of the log, however |
| 33 | if set too high may cause overall system degradation. A setting |
| 34 | of between 5 and 20 should be acceptable, and there is a maximum |
| 35 | limit of 100. If used, a cache filename MUST be specified also, |
| 36 | using either the '-D' command line switch, or the "DNSCache" |
| 37 | configuration keyword. Using this method, normal processing will |
| 38 | continue only after all IP addresses have been processed, and the |
| 39 | cache file is created/updated. |
| 40 | |
| 41 | 2) You can pre-process the log file as a standalone process, creating |
| 42 | the cache file that will be used later by the Webalizer. This is |
| 43 | done by running the Webalizer with a name of 'webazolver' (ie: the |
| 44 | name 'webazolver' is a symbolic link to 'webalizer') and specifying |
| 45 | the cache filename (either with '-D' or DNSCache). If the number |
| 46 | of child processes is not given, the default of 5 will be used. In |
| 47 | this mode, the log will be read and processed, creating a DNS cache |
| 48 | file or updating an existing one, and the program will then exit |
| 49 | without any further processing. |
| 50 | |
| 51 | 3) You can use The Webalizer (DNS) Cache file Manager program 'wcmgr' |
| 52 | to create and manipulate a cache file. A blank cache file can be |
| 53 | created which would be later populated, or data for the cache file |
| 54 | can be imported using tab delimited text files. See the wcmgr(1) |
| 55 | man page for usage information. |
| 56 | |
| 57 | |
| 58 | Run-time DNS cache file creation/update |
| 59 | --------------------------------------- |
| 60 | |
| 61 | The creation/update of a DNS cache file at run-time occurs as follows: |
| 62 | |
| 63 | 1) The log file is read, creating a list of all IP addresses that are |
| 64 | not already cached (or cached but expired) and need to be resolved. |
| 65 | Addresses are expired based on the TTL value specified using the |
| 66 | 'CacheTTL' configuration option or after 7 days (default) if no TTL |
| 67 | is specified. |
| 68 | |
| 69 | 2) The specified number of children processes are forked, and are used |
| 70 | to perform DNS lookups. |
| 71 | |
| 72 | 3) Each IP address is given, one at a time, to the next available child |
| 73 | process until all IP addresses have been processed. Each child will |
| 74 | update the cache file when a result is returned. This may be either |
| 75 | a resolved name or a failed lookup, in which case the address will be |
| 76 | left unresolved. Unresolved addresses are not normally cached, but |
| 77 | can be, if enabled using the 'CacheIPs' configuration file keyword. |
| 78 | |
| 79 | 4) Once all IP addresses have been processed and the cache file updated, |
| 80 | the Webalizer will process the log normally. Each record it finds |
| 81 | that has an unresolved IP address will be looked up in the cache file |
| 82 | to see if a hostname is available (ie: was previously found). |
| 83 | |
| 84 | Because there may be a significant amount of time between the initial |
| 85 | unresolved IP list and normal processing, the Webalizer should not be |
| 86 | run against live log files (ie: a log file that is actively being written |
| 87 | to by a server), otherwise there may be additional records present that |
| 88 | were not resolved. |
| 89 | |
| 90 | |
| 91 | Stand-Alone DNS cache file creation/update |
| 92 | ------------------------------------------ |
| 93 | |
| 94 | The creation/update of the DNS cache file, when run in stand-alone mode, |
| 95 | occurs as follows: |
| 96 | |
| 97 | 1) The log file is read, creating a list of all IP addresses that are |
| 98 | not already cached (or cached but expired) and need to be resolved. |
| 99 | |
| 100 | 2) The specified number of children processes are forked, and are used |
| 101 | to perform DNS lookups. If the number of processes was not specified, |
| 102 | the default of 5 will be used. |
| 103 | |
| 104 | 3) Each IP address is given, one at a time, to the next available child |
| 105 | process until all IP addresses have been processed. Each child will |
| 106 | update the cache file when a result is returned. |
| 107 | |
| 108 | 4) Once all IP addresses have been processed and the cache file updated, |
| 109 | the program will terminate without any further processing. |
| 110 | |
| 111 | |
| 112 | Larger sites may prefer to use a stand-alone process to create the DNS |
| 113 | cache file, and then run the Webalizer against the cache file. This |
| 114 | allows a single cache file to be used for many virtual hosts, and reduces |
| 115 | the processing needed if many sites are being processed. The Webalizer |
| 116 | can be used in stand alone mode by running it as 'webazolver'. When |
| 117 | run in this fashion, it will only create the cache file and then exit |
| 118 | without any further processing. A cache filename MUST be specified, |
| 119 | however unlike when running the Webalizer normally, the number of child |
| 120 | processes does not have to be given (will default to 5). All normal |
| 121 | configuration and command line options are recognized, however, many |
| 122 | of them will simply be ignored.. this allows the use of a standard |
| 123 | configuration file for both normal use and stand alone use. |
| 124 | |
| 125 | |
| 126 | Examples: |
| 127 | --------- |
| 128 | |
| 129 | webalizer -c test.conf -N 10 -D dns_cache.db /var/log/my_www_log |
| 130 | |
| 131 | This will use the configuration file 'test.conf' to obtain normal |
| 132 | configuration options such as hostname and output directory.. it |
| 133 | will then either create or update the file 'dns_cache.db' in the |
| 134 | default output directory (using 10 child processes) based on the |
| 135 | IP addresses it finds in the log /var/lib/my_www_log, and then |
| 136 | process that log file normally. |
| 137 | |
| 138 | |
| 139 | webalizer -o out -D dns_cache.db /var/log/my_www_log |
| 140 | |
| 141 | This will process the log file /var/log/my_www_log, resolving IP |
| 142 | addresses from the cache file 'dns_cache.db' found in the default |
| 143 | output directory "out". The cache file must be present as it will |
| 144 | not be created with this command. |
| 145 | |
| 146 | |
| 147 | for i in /var/log/*/access_log; do |
| 148 | webazolver -N 20 -D /var/lib/dns_cache.db $i |
| 149 | done |
| 150 | |
| 151 | The above is an example of how to run through multiple log files |
| 152 | creating a single DNS cache file.. this might be typically used on |
| 153 | a larger site that has many virtual hosts, all keeping their log |
| 154 | files in a separate directory. It will process each access_log it |
| 155 | finds in /var/log/* and create a cache file (var/lib/dns_cache.db). |
| 156 | This cache file can then be used to process the logs normally with |
| 157 | with the Webalizer in a read-only fashion (see next example). |
| 158 | |
| 159 | |
| 160 | for i in /etc/webalizer/*.conf; do webalizer -c $i -D /etc/cache.db; done |
| 161 | |
| 162 | This will process each configuration file found in /etc/webalizer, |
| 163 | using the DNS cache file /etc/cache.db. This will also typically be |
| 164 | used on a larger site with multiple hosts.. Each configuration file |
| 165 | will specify a site specific log file, hostname, output directory, etc. |
| 166 | The cache file used will typically be created using a command similar |
| 167 | to the one previous to this example. |
| 168 | |
| 169 | |
| 170 | Cache File Maintenance |
| 171 | ---------------------- |
| 172 | |
| 173 | The Webalizer DNS cache files generally require very little or no |
| 174 | special attention. There are times though when some maintenance |
| 175 | is required, such as occasional purging of very old cache entries. |
| 176 | The Webalizer never removes a record once it's inserted into the |
| 177 | cache. If a record expires based on its timestamp, the next time |
| 178 | that address is seen in a log, its name is looked up again and the |
| 179 | timestamp is updated. However, there will always be addresses that |
| 180 | are never seen again, which will cause the cache files to continue |
| 181 | to grow in size over time. On extremely busy sites or sites that |
| 182 | attract many one time visitors, the cache file may grow extremely |
| 183 | large, yet only contain a small amount of valid entries. Using |
| 184 | The Webalizer (DNS) Cache file Manager ('wcmgr'), cache files can |
| 185 | be purged, removing expired entries and shrinking the file size. |
| 186 | A TTL (time to live) value can be specified, so the length of time |
| 187 | an entry remains in the cache can be varied depending on individual |
| 188 | site requirements. In addition to purging cache files, 'wcmgr' can |
| 189 | also be used to list cache file contents, import/export cache data, |
| 190 | lookup/add/delete individual entries and gather overall statistics |
| 191 | regarding the cache file (number of records, number expired, etc..). |
| 192 | |
| 193 | To purge a cache file using 'wcmgr', an example command would be: |
| 194 | |
| 195 | wcmgr -p31 /path/to/dns.cache |
| 196 | |
| 197 | This would purge the 'dns.cache' cache file of any records that are |
| 198 | over 31 days old, and would reclaim the space that those records |
| 199 | were using in the file. If you would like to see the records that |
| 200 | get purged, adding the command line option '-v' (verbose) will cause |
| 201 | the program to print each entry and its age as they are removed. |
| 202 | You can also use the 'wcmgr' to display statistics on cache files |
| 203 | to aid in determining when a cache file should be purged. See the |
| 204 | 'wcmgr' man page (wcmgr.1) for additional information on the various |
| 205 | options available. |
| 206 | |
| 207 | |
| 208 | Stupid Cache Tricks |
| 209 | ------------------- |
| 210 | |
| 211 | The DNS cache files used by The Webalizer allow for efficient IP address |
| 212 | to name translations. Resolved names are normally generated by using an |
| 213 | existing DNS name server to query the address, either locally or over |
| 214 | the Internet. However, using The Webalizer (DNS) Cache file Manager, |
| 215 | almost any IP address to Name translation can be included in the cache. |
| 216 | One such example would be for mapping local network addresses to real |
| 217 | names, even though those addresses may not have real DNS entries on the |
| 218 | network (or may be 'local' addresses prohibited from use on the Internet). |
| 219 | A simple tab delimited text file can be created and imported into a cache |
| 220 | for use by The Webalizer, which will then be used to convert the local |
| 221 | IP addresses to real names. Additional configuration options for The |
| 222 | Webalizer can then be used as would be normally. For example, consider |
| 223 | a small business with 10 computers and a DSL router to the Internet. |
| 224 | Each machine on the local network would use a private IP address that |
| 225 | would not be resolved using an external (public) DNS server, so would |
| 226 | always be reported by The Webalizer as 'unknown/unresolved'. A simple |
| 227 | cache file could be created to map those unresolved addresses into more |
| 228 | meaningful names, which could then be further processed by the Webalizer. |
| 229 | An example might look something like: |
| 230 | |
| 231 | # Local machines |
| 232 | 192.168.123.254 0 0 gw.widgetsareus.lan |
| 233 | 192.168.123.253 0 0 mail.widgetsareus.lan |
| 234 | 192.168.123.250 0 0 sales.widgetsareus.lan |
| 235 | 192.168.123.240 0 0 service.widgetsareus.lan |
| 236 | 192.168.123.237 0 0 mgr.widgetsareus.lan |
| 237 | 192.168.123.235 0 0 support1.widgetsareus.lan |
| 238 | 192.168.123.234 0 0 support2.widgetsareus.lan |
| 239 | 192.168.123.232 0 0 pres.widgetsareus.lan |
| 240 | 192.168.123.230 0 0 vp.widgetsareus.lan |
| 241 | 192.168.123.225 0 0 reception.widgetsareus.lan |
| 242 | 192.168.123.224 0 0 finance.widgetsareus.lan |
| 243 | 127.0.0.1 0 1 127.0.0.1 |
| 244 | |
| 245 | |
| 246 | There are a couple of things here that should be noted. The first |
| 247 | is that the timestamps (first zero on each line above) are set to |
| 248 | zero. This tells The Webalizer that these cached entries are to |
| 249 | be considered 'permanent', and should never be expired (infinite |
| 250 | TTL or time to live). The second thing to note is that the resolved |
| 251 | names are using a non-standard TLD (top level domain) of '.lan'. |
| 252 | The Webalizer will map this special TLD to mean "Local Network" in |
| 253 | its reports, which allows local traffic to be grouped separately |
| 254 | from normal Internet traffic. Lastly, you may notice that the |
| 255 | last line of the file contains an entry with the same IP address |
| 256 | where a name should be. This entry will prevent the Webalizer |
| 257 | from ever trying to lookup 127.0.0.1, which is the 'localhost' |
| 258 | address, when it is found in a log. The second number after the IP |
| 259 | address (1) tells the Webalizer that it is an unresolved entry, not |
| 260 | a resolved hostname (ie: has no name). Entries such as this one can |
| 261 | be used to reduce DNS lookups on addresses that are known not to |
| 262 | resolve. |
| 263 | |
| 264 | |
| 265 | Considerations |
| 266 | -------------- |
| 267 | |
| 268 | Processing of live log files is discouraged, as the chances of log records |
| 269 | being written between the time of DNS resolution and normal processing will |
| 270 | cause problems. |
| 271 | |
| 272 | If you are using STDIN for the input stream (log file) and have run-time |
| 273 | DNS cache file creation/update enabled.. the program will exit after the |
| 274 | cache file has been created/updated and no output will be produced. If |
| 275 | you must use STDIN for the input log, you will need to process the stream |
| 276 | twice, once to create/update the cache file, and again to produce the |
| 277 | reports. The reason for this is that stream inputs from STDIN cannot |
| 278 | be 'rewound' to the beginning like files can, so must be given twice. |
| 279 | |
| 280 | Cached DNS addresses have a default TTL (time to live) of 7 days. This |
| 281 | may now be changed using the CacheTTL config file keyword to any value |
| 282 | from 1 to 100 (days). You may also now specify if unresolved addresses |
| 283 | should be stored in the DNS cache. Normally, unresolved IP addresses |
| 284 | are NOT saved in the cache and are looked up each time the program is |
| 285 | run. |
| 286 | |
| 287 | There is an absolute maximum of 100 child processes that may be created, |
| 288 | however the actual number of children should be significantly less than |
| 289 | the maximum.. typical usage should be between 5 and 20. |
| 290 | |
| 291 | Special thanks to Henning P. Schmiedehausen <hps@tanstaafl.de> for the |
| 292 | original dns-resolver code he submitted, which was the basis for this |
| 293 | implementation. Also thanks to Jose Carlos Medeiros for the inital IPv6 |
| 294 | support code. |
| 295 | |