Welcome to the wonderful world of statistics of a web server!
This document is intended to provide the necessary knowledge on how the analysis of a web server is performed, and what can be found.
Specifically, this document is intended for users of Webalizer, (one of the applications used in the servers of ICTEA) but it may be valid for most applications of analysis of web servers.
If you are a beginner in web analysis or, just want to know how is performed the analysis of a web server, this document is for you.
Visit me, please!
You have a website and want to know if someone is visiting it, and if so, what they see and how often.
Luckily for you, most web servers keep a log of what is happening, so you only have access to it and look.
Logs are text files, so you can use any text editor (such as Windows Notepad). Each time someone visits a page on your website, or request any component on it (known as URL or Uniform Resource Locator), the web server writes a line at the end of the log indicating that request.
Unfortunately, the logs are something that seems impossible to be read by a human uninitiated. While you may be able to know if someone accesses your website, any other information needed something extra to be read.
A typical log entry might look like:
192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117
This indicates a request from a computer with IP address 192.168.45.13 of the URL /mypage.html to the web server. It also indicates the date and time the request is made, the type of request, the result code of that request and how many bytes were sent to the remote browser.
There will a line be something like that for every request to the web server in the period covered by thelog.
A 'Hit' is another way of saying 'request to the web server', so each log line represents a 'Hit'. If you want to know how many 'hits' received your server, just count the number of lines in the logfile. And as each line indicates a request for a specific URL from a particular IP address, you can easily know the petitions received by each of your web pages or the number of requests received from a particular IP address simply counting the lines that contain them.
Yes, that simple. And while this can be done not automatically using a text editor or any other text processing tool, it is more practical and easier to use a program specifically designed to analyze the records, as Webalizer, as it does the work for you, provides tools to analyze other aspects of your web server, and presents the data in a simple way to interpret.
How does Webalizer work? -
Well, to understand what it can analyze, you should know what information gets your web server and as how is obtained.
In the end you should understand how the HTTP protocol (HyperText Transport Protocol) works and its strong and weak points.
Basically, a web server is simply waiting to receive a request from a web browser. Once the request is received, the server processes it and returns something to the browser that made the request (and as noted above, this request is registered).
Requests are typically of an URL, although there is other information that a remote browser may request as, the type of web server, supported versions of the HTTP protocol, modification dates, etc., although these are not common.
To visualize the interaction between the server, the browser and web pages let us see with an example the information flow. Suppose a simple webpage 'mypage.html', which is a simple HTML page containing two images: 'myimage1.jpg' y 'myimage2.jpg'.
The interaction between the server and the browser would be something like the following:
- The browser requests the URL 'mypage.html'.
- The server receives the request and sends the HTML page.
- The browser notes that on that page there are two graphics, so first requestes 'myimage1.jpg'.
- The server receives the request and sends the graphic.
- The browser then requests the second graphic, 'myimage2,jpg'.
- The server receives the request and sends the graphic.
- The browser displays the page with the two images to the user.
The following lines are added to the server log:
192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117
192.168.45.13 - - [24/May/2005:11:20:40 -0400] "GET /myimage1.jpg HTTP/1.1" 200 231
192.168.45.13 - - [24/May/2005:11:20:41 -0400] "GET /myimage2.jpg HTTP/1.1" 200 432
What can we know from this exchange? Based on the stated above, we can count the number of lines of the log file to know that the server received three 'hits' (requests) during the time period. We can also determine the number of requests for each URL (in this case, 1 each).
In each line we can see that the server received three requests from the IP address 192.168.45.13, and the time such requests were received. The two numbers at the end of each line represents the response code and number of bytes sent to the petitioner browser. Response codes indicate how the server handled the request and these codes are defined as part of the HTTP protocol.
In this example, all are 200 which means everything was OK. A response code that you may be very familiar with is "404 - Not Found ', indicating that the requested URL could not be found on the server. There are many other response codes, however, these are the most common.
And this is what you can precisely know from the logs. But you wonder why most programs analyzing logs show many other values. You can get other more 'dark' values as the number of different response codes, the number of requests (hits) during a given period, the total value of the bytes sent to remote browsers, etc.
Other values can be obtained based on certain assumptions, however, these can not be considered completely accurate, and some can be quite inaccurate.
A web server can use other formats of the log files that give more information than the CLF (Common Log Format) mentioned above, and these formats will be discussed shortly.
For now, know that, all you can know precisely is the IP address making the request and the requested URL, and when the request is made, as shown in the example above.
The Good, the Bad and the Ugly.-
Now you have an idea of how a web server works and what information can be obtained from its log, as the number of requests of specific URLs, number of IP addresses that perform requests, how many requests are made from each IP address, and when they were made.
With that information, you can answer questions like: "What is the most popular URL of my site?", "What is the most popular next?", "From what IP address more requests have been made to my server?" and "How busy has been the server in a period of time?".
Most analysis programs will allow you to answer questions like "At what time is my website most active?" or "What day of the week is the server busiest?". They allow you to analyze the pattern of visits to the website, which may not be as apparent just looking at the log files of the server.
All questions can be answered with absolute precision with a simple analysis of the log files.
Good news! Bad news? Well, with so many things you can determine by looking at the log files, there are many things you can not determine with precision. Unfortunately, some analysis programs invite you to think otherwise, and forget to tell (especially commercial programs) that many determinations are merely assumptions and can not be considered accurate at all. Like what? you will ask yourself. Well, for example, which some programs call 'visitor navigation' or simply 'navigation' and suppose to indicate which pages a user visits and in what order. Or about the time it is assumed that a visitor spent on your website? Another parameter that can not be calculated precisely is 'visits' or how many users 'have visited' your site over a period of time. All of these can not be calculated accurately for different reasons. Here are some:
The HTTP protocol is 'stateless'.-
In a computer program that runs on your own computer, you can always determine what the user is doing. You access to the program does whatever, and when completed, you exit the program. The HTTP protocol, however, is different. The web server sees only requests from an IP address. The IP address is connected, send the request, receive the response and disconnects. The web server has no idea what is happening on the remote side with these requests, and even what it did with the response sent.
This makes it impossible to determine things like how long a visitor spent on your website. For example, if an IP address requests to your home page to your server, then, after 15 minutes requests another page of your website. Can you determine how long the visitor spent on your website? The answer is, of course, not! Simply because there is a gap of 15 minutes between requests, you do not know what made the remote address between requests.
Some analysis programs say that visitors spend at least 15 minutes on your website plus some additional time to see the last page requested (as 5 minutes or so). This is really an estimate and nothing else.
It can not determine individual visitors.-
Web servers receive requests and send the response to the IP address making the request. There is no way to determine which is the IP address, only that it made the request. It can be an individual, a program that runs on a computer, or can be many people using the same IP address (more on this later). Some of you will notice that the HTTP protocol has a mechanism for user authentication, where 'username' and 'password' are requested to access a website or pages in this website. And while this is true, it is something that a public website, as are the majority, do not use (otherwise it would not be public).
For example, let us say a request to the server is made from an IP address, and one minute later, another IP address makes a request. Can you say how many people visited your site? Again, the answer is No! One of these requests may come from the 'spider' of a search engine, a program designed to track a web looking for links and things like that. Both petitions may have come from the same user, but from different IP addresses.
Some analysis programs try to determine the number of visitors based on things like the IP address plus the browser type, but even so, they are just estimates based on inaccurate assumptions.
The topology of Internet makes problematic even the IP addresses.-
Before each machine that wanted to use the Internet, had its own unique IP address. However, as the Internet grows more demand for IP addresses it occurs. As a result, different methods of connecting to the Internet to alleviate the problem of IP addresses develop.
Suppose, for example, a home user with a dial-up connection. He makes a phone-call to the Internet service provider; computers negotiate the connection and he is assigned an IP address from a pool of IP addresses that have been assigned to the supplier. Once the user disconnects, this IP address is available to other users. The home user typically gets a different IP address each time he connects, meaning that if for any reason he is disconnected, when reconnected he gets a different IP address.
Given this situation, the same user may appear with different IP addresses over a period of time.
Another typical situation occurs in a business where every PC use private IP addresses to talk to each other internally, and connect to Internet through a 'gateway' or 'firewall' machine that translates its private IP to the public IP used by the gateway/firewall. This can make all users in the company appear as if they were using the same IP address.
The servers 'proxy' operate in a similar manner, so, there may be thousands of users appearing from the same IP address. In those situations, Can you say how many users have visited your site if the log file shows 10 requests from the same IP address in the last hour? Again the answer is No! It can be the same user or different users behind the same 'firewall'. And what if the record shows 10 requests from 10 different IP addresses? Do you think they have been 10 different users? They may have been 10 different users, one or more users and the 'spider' of a search engine, or any combination.
But wait, there is more.-
Well, what have we learned? In short, you do not know who or what makes requests to the server, and you can not assume that a single IP address is a single user. Of course you can do all kinds of estimates, but that is what it is and can not be considered accurate.
Consider the following example: Direction A makes a request to the server. One minute later, direction B makes another request, and 10 minutes after direction A makes another request. What can be determined in this sequence? Well, we can assume the visits of two users. But what if the address A is a firewall? The two requests from A may have been two different users. And what if user A disconnects, and when reconnected gets a different direction (B address) and someone connected at this time and got the address A. Or it can be a user connects through a 'reverse-proxy' and all three requests came from the same user. And can you tell how the user browsed the site or how long he stayed?
Fortunately now you know the answer to these questions is a resounding NO. We can not, without being able to identify unique users, indicate what a single user does. However, all is not lost. Over time, people have found ways to overcome these limitations. Systems have been written to overcome the 'stateless' nature of the HTTP protocol. 'Cookies' and other identifiers are used to track users, and also dynamic webpages supported by databases. However, all this is basically external to the protocol, and thus is not registered as standard by the server, and specialized tools are needed to be analyzed.
In all other cases, any value giving information indicating these can only be considered estimates based on certain assumptions.
An example of this can be found in the tool Webalizer. The concept of 'Visit' is a value that can not be considered accurate, although is one of the thing that Webalizer shows. It was included because of the numerous requests from individuals using the program. It is based on the assumption that a single IP address represents a single user. We have already indicated how this assumption is very weak in the real world, and if you read the program documentation, you will see clearly that the 'Visits' value (with 'Entry page' and 'Exit page') should not be considered accurate, but simply estimated values. We have not yet delved into the concepts of pages 'Entry' and 'Exit', but are based on the concept of 'Visit' we already know is not accurate. They are supposed to be the first and last page a visit of your website see. If a request from what is considered a new 'Visit' is received, then the URL of that request is, in theory, the entry page (Entry page) to the website. Similarly, the last URL requested will be the abandon page of the site (Exit page).
Similarly, and being based on the concept of 'Visit', they should be considered with the same caution the concepts of navigation of the visit ('path' o 'trail'), that is, which pages have you visited and in what order.
One of the 'funniest' values to be considered in a given analysis program, is the one assuming from where the visit comes, based on where is registered the domain making the request. Clever idea, but worthless. Take for instance the domain of the provider AOL (America On-Line) which is registered in Virginia. The program considers all who use AOL as living in Virginia which we know is not the case as this supplier has access points to Internet all around the world (AOL is one of the most important suppliers of connectivity in the USA).
Personally, time ago I connected to the Internet from Madrid through AOL. For the websites that I was visiting the visit came from Virginia and it was not.)
Other values you can determine.-
Now that you have seen what is possible, you may think that there are other values that these programs may display and how accurate they can be.
Fortunately, based on what you have read so far, you should be able to find it out. Right? One of these values is 'page' or 'page view'. As we know, a web page is typically made up of a HTML text document and other elements such as images, audio or other multimedia objects, style sheets, etc. The request for a web page can generate dozens of requests for these elements, but most people want to know how many web pages are requested without having to consider all the elements that form them. You can get this number if you know what type of files can be considered a 'page'. On a typical server, this would be those URLs with .htm or .html extension. Maybe your website is dynamic, so the extensions of your webpages will be .asp, .php or .pl. Obviously you DO not want to count the .gif or .jpg images as pages, nor style sheets, dynamic flash graphics or other elements. You can analyze the log file and count requests for URLs that match your criteria of 'page' but most web analytics programs (including Webalizer) do this work for you.
More information.-
So far we have only considered the CLF format (Common Log Format), but there are others. The most common is called 'combined', and it takes the CLF format and adds two more elements of information. It adds 'user agent' and 'referrer'.
A 'user agent' it is simply the name of the browser or program (Internet Explorer, Mozilla Firefox, Chrome, Opera, Konqueror, Safari, etc.) used to make the request to the web server.
'Referrer' it is supposed to be the page from which the user gets to your website.
Unfortunately, both can lead to confusion. The name of 'user agent' can be set as desired in the modern browsers. Opera browser users use a common trick indicating that the 'user agent' is Internet Explorer, so that they can visit sites only allowing visits from Internet Explorer.
And the 'referrer', according to the documentation (RFC) for the HTTP protocol, can be used or not in the elected browsers, and if used, should not be accurate and even informative.
The Apache web server (one of the most widely used on Internet) allows the recording of other things such as cookie information, the time taken to handle a request and much more.
Unfortunately, inclusion and location of this information in the log files servers is not standard.
Another format developed by W3C (World Wide Web Consortium), can generate records from various pieces of information, and their location can be any in a record, being necessary a header to map them.
Web analytics programs handle some better than other formats.
Analysis Techniques.-
The only way to have an accurate view of what your server does is look at your web server log files.
This is how most programs get their data, and is the most accurate. Other methods may be used with different results.
A common method, which was very popular for a time, was the use of a 'Users Counter'. Basically it consisted of a dynamic 'bit' included in a web page which increased a counter and showed its value every time the page was requested.
One problem with this method was that it had to include a different graphic file for each page that you wanted to track. Another problem occurred if the remote user's browser did not show images (this option was not enabled or he was using an 'only text' web browser displaying no images). It was also possible 'inflate' the counter value by simply pressing the 'refresh' button on the browser again and again.
Similar methods were developed using Java and Javascript languages, in an attempt to get even more information about the visit, such as determining the screen resolution and operating system used.
Of course, this could also easily be 'tricked'.
Some companies develop systems that were capable of monitoring a remote web server by including an image or javascript element on your website, contacting the company system each time the image or javascript element were requested javascript.
All had the same problems and limitations. You could disable the display of images in the browser and/or java/javascript and navigate completely 'concealed' the web (except for log files).
Know that these types of counters and web of remote use are not as accurate as you can think.
Conclusion.-
Now it should be obvious that you can only accurately determine certain values of the log file server.
There are totally accurate values, and other vague on who can be trusted more or less depending on the assumptions that have been made.
Want to know how many requests have resulted in an '404 Error - Not Found'? Go and count them, and fully trust the number you get.
Want to know the number of users who have visited your site? Good luck with it! Unless you leave log files, it will be an imprecise value what you get. But you should have an idea of what is and is not possible, so, when you see the log records, you know determine what the numbers mean and how to trust them.
You will now also know a lot depends on the configuration of the analytics program, and misconfiguration gives bad results.
Consider the example of 'pages'. If your analytics program thinks that only URLs with .htm or .html extension is a page, and your sie is made-up of .php pages only, the number of page views will be totally invalid, not because the program is bad but because someone gave false information on which to base its calculations.
Remember that knowledge is power, so now you have the power to ask the right questions and get the precise results.
So, next time you see the server report, you will see it from a different perspective given your new knowledge.
Glossary of Terms.-
Hits represents the total number of requests made to the server at a period of time (month, day, hour, etc.)
Files represents the total number of requests (hits) that results in something sent by the server. Not all requests for information data are sent as requests '404 - Not Found', or requests for pages that are in the temporary memory of the PC.
Tip: Comparing the difference between 'hits' and 'files', you can get a crude indication of repeat visitors, as the greater the difference between the two, the greater the number of users requesting pages that are already in the temporary memory of their PC (they have already visited them).
Sites is the number of unique IP addresses making requests to the server. Care should be taken when using this value for another purpose. Many visitors seem to come from a single 'site', but they can also appear as coming from different IP addresses, so it should be considered an estimate of the number of visitors to your web.
Visits appear when a remote machine requests a page on your website for the first time. While still making requests within a timeout period, all these requests are considered part of the same 'Visit'.
If the remote machine makes a request to the server, and the length of time since last made is greater than the timeout period (usually 30 minutes) it is considered a new 'Visit'. Since only page requests generate 'Visits' requests that link to graphic and other non URls pages are not counted, reducing the number of false 'Visits'.
Pages are the URLs requested considered a whole page, and not all the individual elements (such as graphics and audio clips) forming them.
Some people call this value 'page views' or 'page impressions', and is default any URL whose extension is .htm, .html, .php, .cgi, etc.
Tráfico.- 1 KByte (1 KB) are 1024 bytes. It is used to indicate the amount of data (traffic) that have been transferred between the server and the remote machine.
A Site is a remote machine that makes requests to your server, and it is based on the IP address of the remote machine/hostname.
URL (Uniform Resource Locator). All requests made to a web server is to ask 'something'.
The URL is that 'something', and represents an object on the web server accessible to the remote user, or fails (eg, error 404 - Not Found). URLs can be of any type (HTML, Audio, Graphic, etc.)
Referrers are those URLs that lead a user to your website or force the browser to request something to your server.
Most requests are made from your own URLs, since most HTML pages have links to other objects such as graphics files.
If one of your HTML pages contains links to 10 images, then each request for the HTML page will produce 10 'hits' over the 'referrer' being this the URL of your own HTML page.
Search Strings are obtained by examining the words of the referent (referrer) and looking for known patterns from various search engines.
The search engines and models to look for can be specified by the user in a configuration file, but by default the main search engines are analyzed.
Note: Only available if this information is in the log files.
User Agent is a name that refers to the browser used by the user (Internet Explorer, Mozilla Firefox, Chrome, Opera, Safari, Konqueror, etc.).
Each browser is recorded uniquely on the server. Note, however, that many browsers allow the user to change the name, and you can find fake names in the report.
Note: Only available if this information is in the log files.
Entry/Exit Pages are those pages that are first requested in a 'Visit' (Entry), and last (Exit).
These pages are calculated using the logic of 'Visits' indicated above.
When a new 'Visit' is ounted, the page requested is recorded as 'Entry page', and whatever it is the last requested URL is recorded as an 'Exit page'.
Countries are determined based on the top-level domain (eg .com, .es, .fr, .it, etc.) of the 'site' making the request.
This is, however, questionable, since nowadays there is not a strong control of domains as there was in the past. A .COM domain may reside in the US, or anywhere else in the world. An .IL domain may actually be in Israel, however, can also be located in the US, or anywhere else in the world. The domains most frequently encountered are: .COM (USA commercial), .NET (Internet services), .ORG (Nonprofit organizations) and .EDU (Education institutions).
A high percentage can be displayed as Unresolved/Unknown, since many Internet connections do not result in a name but simply an IP address.
Response Codes are defined as part of the HTTP protocol. These codes are generated by the web server and indicate the state any request is terminated.