How it Works
Helping users to read the web page text
Glossary L to R
Log File Archive
Archiving Access Activity
Many web servers keep their raw log files only for a limited time period. This information may be crucial in determining what has happened at a particular time. If the web server crashes or the log files get lost then you need the security of an offsite log backup. Hosting companies may well back up the web site files but not the logs, the logs can grow to a very large size and therefore backup is an expensive option for them to provide. Some servers will keep only the previous month or week on the site before it is replaced by the next one.
All web servers will store information about the accesses made to the site in a log file archive. This enables a web master to investigate problems. Standard web hosting companies generate their 'free' access statistics by processing these log files to work what and when is being accessed on the site.
Site Vigil can archive log file contents automatically to a local folder on your PC. This gives you the security of an offsite copy of all of the data without needing to remember to FTP the contents every week or month.
Web Server Log Files
A web site host runs a special service that manages the HTTP protocol. The HTTP service is the standard way that HTML pages are transferred to browsers over the Internet. Each HTTP request received by the server is typically a request for the contents of a particular web page or graphics file. The server logs all these requests with each new request or 'hit' as a separate record as a line. The server log file is the source of information used to generate web site statistics offered by most web hosting companies. It gives information about the date and time, source IP address, data requested, the referring web page and browser. The referral data give vital information about the links which people are using to reach a web site. The data requested is the full URL used to reach the site, quite often this will include the keywords used by the search engine to list the web site.
Here is a sample line from a log file :
The meaning of each of these fields is as follows :
Site Vigil can automatically fetch and analyse log files in a variety of different formats
Log File Formats
Logging accesses to Web sites
All web servers will store information about the accesses made to the site in a log file. This enables a web master to investigate problems and for statistics to be generated showing what and when is being accessed on the site.
There are two main formats in use. Different servers (Microsoft® IIS, Apache®, ...) use different log formats. However, they are all text based and contain information about each access to a resource (HTML page or graphics file) is recorded as a single line of information (a hit). A web site administrator can usually control how much information goes into the log file, as the full set of information is rather large. Each log file record may contain the following fields of information :
Some servers will archive log files in compressed format (e.g. ZIP, GZ or CAB).
Site Vigil supports a wide variety of log file formats and will automatically select an appropriate one by scanning a sample log file. It can then analyse the records and pick out error and referral records.
Specifying information format
The HTTP protocol transfers information around in binary format, it is up to the client and server to negotiate so that the client (typically a browser) is only sent information that it can understand. This negotiation is carried out using MIME (Multipurpose Internet Mail Extensions) types.
As the acronym suggests this was originally developed to describe the content of email messages but is now much widely used within HTTP. It uses a simple two-part text description to describe the content format consisting of a type and a subtype. So text/html indicates that it is basically text but the text is in HTML format, text/plain is for raw untagged text (as is a .txt file) and image/jpg indicates a graphics file in JPEG image format.
When a browser requests data it states the MIME types it is willing to receive as a response, the server will then choose an available format for the response from these types.
Using Ping to check a Server is working
One important facet of site monitoring is knowing as soon as possible that servers have failed or are not accessible. A web server provides multiple services not just HTTP. Just because a server is not responding to HTTP requests does not imply it is not functioning at all.
The simplest means of establishing whether a site or server is alive is using the Interface Control Message Protocol (ICMP) protocol to Ping a server. This is a much simpler request than fetching an HTML page in terms of the communication overheads. It runs over the IP protocol and so checks that the IP part of TCP/IP is functioning OK. This protocol is also used by the tracert command line utility to find the route that communication is taking to a server.
The Ping connectivity check works well in an Office Intranet situation too, it can regularly monitor whether the key servers and workstations are responding properly to IP traffic within an office local area network.
The protocol supports a number of commands but the ECHO command is the one of interest for Ping monitoring. It instructs routers to pass the message over IP to a particular destination IP address requesting an ECHO REPLY to be sent back. Measuring the time between from issuing the ECHO and receiving the ECHO REPLY determines the responsiveness of the remote server. The ICMP echo reply includes a Time to Live (TTL) value. This indicates the number of router hops that the message has gone through from the source. Normally the packet starts off with a 255 TTL value and then each router it passes through decrements the value by one. If the number of hops is erratic or suddenly becomes large this indicates a router problem.
The same ECHO command can be used to trace a route over the Internet (as used in tracert program). In this case the protocol's TTL field is used to limit how many hops between routers it can make before the request fails. If the limit is reached then a failure response is returned, with the IP address of the most distant router on the path returned. By iterating over all TTL values until the destination server is reached all the routers can be identified. By inspecting the time delay between reaching routers along the communication path bottlenecks can be easily identified.
Site Vigil supports ping monitoring and reports any problems accessing servers (IP addresses) and also keeps track of how long it takes to access them. It keeps track of the TTL value returned and so also monitors the route to the server. All this information is displayed in a graphical format for each IP address being monitored.
Connecting to the correct Service Port
Each IP address can be accessed on a range of numeric port numbers. The port number requested is part of a client connect request and can be specified as part of a URL. When a URL omits the port number the default port number for that service is assumed (80 for HTTP web service). These map onto inter-communicating sockets, when a server socket is set up it chooses a unique port number on which to listen for requests (as part of the bind socket API call), the client issues a connect to a server giving an IP address and a port.
In most cases the port number is assigned to a particular service, so the number is really acting as a name for the service that is required. The only ports of interest on the Internet to users are the ones used for HTTP and FTP.
A more comprehensive list of standard ports is as follows :
Indirect proxy access to the Internet
Originally each computer wishing to use the Internet had to connect to it directly, this is OK for servers or a home user dialling up for a connection but not convenient for an office environment where hundreds of PCs may want to use the Internet all at the same time. To solve this problem Proxy servers are used. These servers have a dedicated Internet connection but make requests on behalf (as a proxy) for all the computers wishing to access the Internet through it.
Most browsers have connection settings that allow you to configure the IP address and port that is then used to communicate with a Proxy Server. HTTP requests are then sent to the Proxy Server using TCP/IP which then in turn sends them out onto the Internet. A proxy server may run on a separate machine (often in conjunction with a firewall) or as an ordinary program running on a PC. It needs to keep track of all client requests so it can route all the responses sent to it from the server back to the browser that requested the information.
You can configure Site Vigil to access the Internet via a Proxy Server for HTTP and HTTPS requests.
Getting information from a user
Queries are an important part of HTML. They are used to pass additional information to a server about the data requested. The most widespread usage is when an HTML Form is submitted (as an HTTP POST) and the various values entered on the form are sent as query strings tagged onto the end of the URL. Search engines such as Google use this mechanism to send the search phrase or keywords that the user typed in when the Search button is clicked upon. Each web server is free to use whichever keywords it likes in the query string there are few constraints it has to follow.
If you use Site Vigil to monitor referrals to a web site you can get it to report details of all the queries users have specified to reach your web site. This is a crucial source of information when choosing the most effective set of keywords to use.
Web Site Ranking
There are a number of Internet services that attempt to rank web sites in some sort of popularity order. As there are about 500 million web sites this is not an easy task and relative ranking scores are not to be totally trusted.
For example on this particular day the top five web sites according to Alexa are Yahoo!, MSN, Google, Passport, EBay, Microsoft.
It is not possible to look at individual web site traffic statistics in order to gauge popularity. The server logs are not publicly accessible. If you use the Google or Alexa Toolbars to reach web sites this is one way that these sites can build up statistics in order to rank sites. Each time the toolbar is used, the click is recorded and added to their database. This makes all ranking measures rather inaccurate, it is best to treat them as a very rough indication of relative popularity only.
Google also includes a Page Rank figure as a rough estimate (score out of ten) of the importance of the page. Most web sites manage to score between 3 and 6 out of ten. Only very large and very popular web sites score 8 or more ('Currently' CNN scores 8/10 and BBC News scores 9/10). View with suspicion any page with a rank below 3.
Looking at how visitors find your site
Do you know how people are reaching your web site?
When a web site has a set of links to it, you get a referral when the user follows the link. This is usually by a person clicking on a link in a browser or else by an automated scanning engine or robot.
It is important to monitor the number of referrals coming to a site to be able to adjust the site content and therefore the keywords to attract more visitors. It may be that you are getting people to come to the site for the wrong reason and are not going to stay or come back again.
When a user follows an HTML link to get from one web page to another, the web server typically store how a page (locally or remotely was referred to). Tracking how people get referred to a web site is crucial to measuring the effectiveness of a web site. Are people getting to the information quickly and easily? Are they only looking at one page and then leaving the web site? Which keywords and phrases are people using to reach your site?
Some firewalls give the option to remove the referral information in order to give the user more anonymity, if a web site needs to be sure it knows where it has been referenced from it needs to include this as part of the query part of a URL.
Referral monitoring allows the sources of external web traffic to be identified. Typically Google will be the largest source of referrals as most people find web sites from a search engine, and Google is by the far the most popular one at present.
By using referrer information you can work out :
Site Vigil can scan the server log files and check all the referrals to a site. It will then alert you when a new source of referrals starts and even warn you when the expected level of referrals declines. Referral monitoring is a crucial part of web site management. A detailed history of referrals gives you the information needed to check out any new references to your site.
Internet Standards and Proposals
The technical standards that govern how the various parts of the Internet function are documented as Request For Comment (RFC) documents. Although the name suggests that these are just early proposals, these documents include actual working standards for much of Internet technology. Some RFCs are experimental and some have been entirely superseded so it is important to refer to the appropriate RFC.
They are co-ordinated by committees of Internet professionals. There are over three thousand RFCs in existence covering all aspects of the Internet. A good starting point is Internet Engineering Task Force (IETF web site www.ietf.org . There are copies of the RFCs on different sites, including IETF and World Wide Web Consortium [W3C].
Here is a list of commonly referenced RFCs, but be warned that their technical nature can make them tough reading :
RFC777 : Internet Control Message Protocol
Automated Internet Scan
On the Internet a robot is not some mechanical human-like servant but just a special type of computer program. Ever wondered how search engines build up their indices of web sites? Well, search engines amongst other programs use robots to continually trawl web sites analyzing the contents as if they were human visitors using a browser. They use HTTP just like browsers in order to access information. A server can state which pages should be inspected by robots, the instructions are stored in the robots.txt file. Over time, search engines have grown much more sophisticated and the way that they scan sites is complex. Many will first scan the site's index page, coming back after weeks or months to drill down to scan the rest of the web site.
The server log usually includes a browser field in each log record indicating the name of the robot. No well-behaved robot will flood a server with requests as this would affect the server's performance. They will spread their site scan over hours or days. Robots should include a contact URL or email address in the browser information included in the HTTP request header so that a web master can analyze the activity by robots.
Reports on the recent visits by Robots can be easily monitored by Site Vigil,
this can indicate that search engines are busy indexing a web site.