The W3C Reference Library, a.k.a. Library of Common Code, is a general
code base written in portable C that can be used as a basis for
building clients, servers and many other Web applications. It contains
reference code for accessing HTTP, FTP, Gopher, News, WAIS, Telnet
servers, and the local file system. Furthermore it provides modules
for parsing, managing and presenting hypertext objects to the user and
a wide spectra of generic programming utilities. The Library is the
basis for many World-Wide Web applications and the
W3C reference applications are built on top of it.
This document describes the architecture of the Library in generic
terms without referring too much directly to the code itself. It is
meant to give an overview of the design which is required if you
intend to enhance the Library. If you are looking for a specific
description of the API then please read the User's
Guide.
DNS Cache and Host Name Canonicalization
Introduction
The W3C Reference Library is a general code base that can be used as a
basis for building a large variety of World-Wide Web applications. Its
main purpose is to provide services to transmit data objects rendered
in many different media types either to or from a remote server using
the most common Internet access methods or the local file system. It
provides plain C reference implementations of those specifications and
is especially designed to be used on a large set of different
platforms. Version 3.1 supports more than 20 Unix flavors, VMS,
Windows NT, and ongoing work is being done to extend the set of
platforms.
Even though plain C does not support an object oriented model but
merely enables the concept, many of the data structures in the Library
are derived from the class notation. This leads to situations where
forced type casting is required in order to use a reference to a
subclass where a superclass is expected. The forced type casting
problem and inheritance in general would be solved if an object
oriented programming language was to be used instead of C, but the
current standardization and deployment level of object oriented
languages in general would imply that a part of the portability would
get lost in the transition. There are several intermediate solutions
under consideration where one or more object oriented APIs built on
top of the Library provides the application programmer with a cleaner
interface.
Many of the features of the Library are demonstrated in the Line Mode Browser which is a text
terminal client built right on top of the Library. Even though this
application is usable as an independent Web application, its main
purpose is to show a working example of how the Library can be
used. However, it is important to note that the Line Mode Browser is
only one way of using the Library and many other applications
may want to use it in other ways.
The development of the W3C Reference Library was started by Tim Berners-Lee in 1990,
and today the Library is a multi functional code base with a large
amount of knowledge about network programming and portability built
into it with help from Ari
Luotonen, Jean-Francois
Groff, Håkon W. Lie
and a large number of people on the Internet.
Basic Design Model
The main criteria behind the design of the W3C Reference Library was
to make it easily extendable as new Internet standards evolve for
transportation and representation of data objects. The philosofi was
to make it possible to dynamically "plug-in" new modules without
touching the inner parts of the Library. On platforms that support
dynamic linking this can be used to change the functionality of an
application completely at runtime and eventually the Library can be
extended to support some of the new concepts of mobile code where new
modules can be down loaded from the network at runtime as they are
needed in the application. The result of this concept was a Library
architecture consisting of 5 main parts as illustrated in the figure
below:

The figure is similar to a protocol stack where the lower layers
provide a set of services to the upper layers. This is also the case
in the Library where the "layering" is as follows:
- Generic Tools
- The Library provides a large set of generic utility modules such
as container classes, string utilities, network utilities etc. They
have the important function to separate the upper layer code from
platform specific implementations using a large set of macros that
makes the Library more portable. The modules are used throughout the
Library itself and can easily be employed in many applications.
- Core
- This part is the fundamental part of the Library. The size of the
core is deliberately kept small and it is important to note that it
can do nothing on its own; all the functionality for accessing the
network, parsing data objects, handling user interaction, logging
etc. is part of the upper modules in the figure. The core provides a
standard interface to the application program for requesting a service
but most often the handling of the request itself takes place outside
the core.
- Stream Modules
- All data is transported back and forth from the application to the
network and vice verse using streams. Streams are objects that accept
blocks of characters, pretty much as ANSI C FILE strems accept blocks
of characters. A block can be as small as one character but large
blocks are normally preferred for better performance. Often, even
though not required, a stream has an output to which it directs
outgoing data. An example of a stream with no output is a stream that
acts like a black hole - it absorbs data without ever sending it out
again. However, the typical situation for a stream is to have an
output and to perform some kind of data conversion on the incoming
data before it is redirected to the output.
- Access Modules
- The Access modules are protocol specific modules that makes the
application capable of communicating with a wide range of Internet
services. The Library comes with a wide set of protocol access modules
such as HTTP, FTP, Gopher, WAIS, NNTP, Telnet, rlogin, TN3270, and the
local file system, but new ones can easily be added to the list.
- Application Modules
- The application modules are often specific for client applications
including functions that require user interaction, management of
history lists, call back functions, logging etc. The reference
implementation of these modules are often intended for character based
applications like the Line Mode
Browser. More advanced clients can override them, that is, a
module with an identical interface is provided by the application, and
the loading of the default module suppressed.
When writing an application most of the code interacting with the
Library will consist of access modules, stream modules, and
application modules. These modules can either provide additional
functionality or override existing functionality in the Library in
order to make use of more platform dependent implementations. The
latter will typically be the case with the application modules which
must be adjusted to a given graphic platform.
The User's Guide explains more on how to set
up and use the Access modules and the Stream modules in an application
and how to use the application modules. The rest of this document on
the architecture of the W3C Reference Library is devoted to describing
the Core of the Library.
Overview of the Core
The main concept in the Library is a "request/response" model where an
application issues a request for a URI (URL). The Library then tries
to fulfill the request as efficient as possible either by requesting
the URL at the origin server, a proxy server, a gateway, directly from
the local file system, or a locally cached version. Data is delivered
back to the application as soon as it gets ready which guarantees
minimum access delay for the application. From version 3.0, the
Library supports threads including its own platform independent thread
model called "libwww threads". This allows
multiple requests to be handled simultaneously without blocking the
application while waiting on data.
Requests and Responses
The "request/response" model is illustrated in the control/data
diagram shown below. The diagram shows only the core modules - the
other modules are "pasted in" later. Note, that the Library code is to
the right of the thick vertical line (green), and the application to
the left can be any type of application, for example a proxy or a
client. The architecture of the Library does support clients and
proxies in pretty much the same way as it makes little difference to
the Library: a client has a user interface whereas a server has a
network interface. It is a good idea to study the Line Mode Browser and the httpd as reference implementations using
the Library to see this duality.
Another thing to note is that the Library from version 3.1 supports
large scale data flow from the application to the network as well as
from the network to the application. This has an important impact on
the functionality that can be put into applications, for example
allowing collaborative authoring possibilities via the Web. The
architecture behind this is described in the section "Post Webs - an API for PUT and POST".

The thin lines (red) is control flow, the thick lines (blue) is data
flow and the "lightning" (magenta) is control flow as a result of
events handled by the Library. Let's see what happens when an
application issues a request. The description is based on having an
event loop - this can either be the one provided by the Library or an
external event loop provided by the application. The section on libwww threads explains more on how this can
be set up. The numbers refer to the figure above.
- The event manager is waiting for an event from the application.
This can for example be a user clicking the mouse on a link or types a
number on the keyboard.
- When an event arrives, the event manager calls the user event
handler provided by the application.
- The user event handler issues a request by calling the access
manager.
- The access manager contacts the cache manager to see if the object
is already cached. If data is to be sent to the network (for
example using the HTTP PUT method) then the cache manager is not
requested.
- If the cache manager says "no" then the protocol manager is
contacted to down load the object. If "yes" then the cache file is
accessed.
- The cache manager can also contact the protocol manager directly
if the cached object turns out to be stale or a reload has explicit
requested by the application.
- If the protocol manager successfully can access the data object
then the cache manager is contacted in order to cache or refresh the
object.
- When data is arriving, either from the cache manager or the
protocol manager it is passed to the format manager that handles any
data format conversion as requested by the application.
- The protocol can recursively call the access manager in case of
redirections and inadequate access authentication for the request
(after prompting the user).
- The converted data is either handed from the network to the
application or from the application to the network as it gets
ready. If no data is ready, control is given back to the event
manager.
- When data is ready to be sent or received from the network, the
event manager calls the protocol manager directly to handle the data.
- When the request is terminated the application is called with the
result of the request so that it for example can update a history list
of visited documents.
This description is the "macro" description of how the core modules
interact and in the rest of this document we shall see more of the
details of what is going on inside the core modules and what data
structures are involved. Note that by using a threaded model, the
Library can handle multiple requests simultaneously. An example on how
to do this is described in the section "Libwww
Threads".
- Access Manager
- The access manager is the main entry point for requesting a data
object pointed to by a URI. It has a set of methods that allows the
application to request different services, for example to get a URI,
post a URI, or to search a URI. When the application issues a request,
the access manager does the following:
- Translates the URI according to the rules given, for example by a
rule file. It also looks for gateways or proxies that should be
contacted for a specific access method. Rules can be registered
dynamically as described in the User's Guide.
- If the request is on the local file system, the access manager
verifies that access to local files is allowed. This might not always
be the case, as is the case when the Line
Mode Browser is used as a login shell for telnet sessions.
- Then the cache manager is contacted to see if the object
already has been accessed. The application might administer a memory
cache in which cache this is consulted before the cache.
- If the data object is not cached then the protocol module is
called to actually perform the access to the network.
- When a request is to be terminated, the access manager can log the
result of the request to a local file so that the "browse route" can
be reconstructed.
- Protocol Manager
- The protocol manager is invoked by the access manager in order to
access a document not found in memory or in cache. The manager
consists of a set of protocol modules handling the access schemes
HTTP, FTP, NNTP, Gopher, WAIS, Telnet, and access to the local file
system. The protocol modules are registered dynamically (using static
linking) and the User's Guide describes how
modules can be registered. Each protocol module is responsible for
establishing the connection to the remote server (or the local
file-system) and extract information using a specific access
method. When data arrives from the network, it is passed on to the
format manager.
- Format Manager
- The stream format manager takes care of the transportation of
streams of data from the network to the application and vice versa. It
also performs any parsing and data format conversion requested based
on a set of registered format converters and a simple algorithm for
selecting the best conversion. As the protocol modules, data format
converters can be registered dynamically, and the current set of
streams includes among others: MIME, SGML, HTML, and LaTeX.
- Cache Manager
- The cache manager is used to save data objects once they have been
down loaded from the network. The cache uses the hierarchy indicated
in the URLs as a way to identify items in the cache but is still under
construction and requires a lot of work to be a highly efficient cache
manager!
- Error Manager
- This module manages an information stack which contains
information of all errors occurred during the communication with a
remote server or simply information about the current state. Using a
stack for this kind of information provides the possibility of nested
error messages where each message can be classified and filtered
according to its impact on the current request, for example "Fatal",
"Non-Fatal", "Warning" etc. The filtering can be used to decide which
level of messages will be passed back to the user.
- Net Manager
- The net manager provides an interface for handling asynchronous
sockets which is an integral part of the Library.
- Event Manager
- The event manager is a "session layer" handling which thread
should be the active thread. A thread can either be an internal libwww
thread or an external thread, for example a Posix thread, and the
event manager can itself be either the internal Library manager or an
external event manager. Currently the internal event manager uses a
select function call to decide which thread should be made the active
one, however an external event manager can use another decision
model. One of the design ideas behind the event manager is that it can
be extended to a full session layer manager handling for example the
control of a HTTP-NG connection. The event manager is described
together with the internal thread model in the section "Libwww Threads".
Core Objects and Managers
The central data structures are the structures that are a part of the
core entity. Each of the core modules as explained in section "Control and Data Flow"are relying on one
or more of the central data structures. This section describes the
relationship between the core modules and the central data structures
and the relationship between the central data structures
themselves.
The figure below is very similar to the one in section "Control and Data Flow", but it also
introduces the set of central data structures as boxes that represent
the main structures connected to the corresponding core modules. This
does not mean that these are the only existing relations, but it can
be used as an indication.

HTRequest
- The HTRequest structure contains information necessary to
handle a request issued by the application. It contains information
about the method to be used (for example "GET" and "PUT"), user
preferences (language, content type etc.) specific for this request,
where the output of the data object should go etc. The HTRequest
structure ties together the other structures used by the core modules
in order to handle the request. It is intended to live until the
request reaches a final state, either success or failure, after which
it can be discarded.
Normally, the HTRequest structure is created by the application, but
the Library is capable of creating HTRequest structures on its own
under certain circumstances. An example is when the Library creates a
"Post Web" as explained in section "Building a
POST Web, an API for PUT and POST".
HTAnchor
- Anchors represent any data objects which may be the sources or
destinations of hypertext links. The HTAnchor structure contains all
information about the object, whether it has been loaded,
metainformation like language, media type etc., and any relations to
other objects. The Library defines two anchor classes: a parent anchor
and a child anchor. The former contains information about whole data
objects and the latter contains about subparts of a data object. The
HTAnchor structure is a generic superclass of both parent anchors and
child anchors. Section "Anchor Objects"
describes anchors and their relations in more detail.
HTNetInfo
- HTNetInfo is a network interface specific structure that
contains all information required to read and write from the
network. It contains the current socket descriptor (or ANSI C file
descriptor) used for reading and writing, which input buffer to use
and where to put the data once they are read. It also contains timing
information on how long it takes to connect to a remote host and how
many times it has tried to connect. This information is used by the DNS Cache in order to optimize access on
multi homed hosts.
The HTNetInfo structure is also a key structure in the libwww thread
model where a thread is identified by this structure. The libwww
thread model is explained in "Description of
libwww Threads".
HTCache
- The HTCache structure contains metainformation about
every cached object like the amount of times it has been requested from
the cache, the content type, the size, and how long it took to obtain
the data from the network. As the cache manager is yet to be fully
specified this structure is likely to change in the near future.
HyperDoc
- The HyperDoc structure is different from the other
central data structures as it is only declared in the Library - the
definition is left to the application. It is intended to contain
information about data objects, especially hypertext objects that are
to be presented to a user. As an example of a definition, you can look
at the Line Mode Browser
where it is defined in the GridText
Module. Here it is called "_HText" structure and it contains all
information needed to present and manage a data object in a text based
environment.
The memory management of the HyperDoc object is also left to the
application along with the definition. The Library does not use any
information from the object at all - the only interaction is that the
access manager checks if a HyperDoc object exists for a given anchor
or not as a part of servicing a request. The application can use this
to maintain a set of HyperDoc objects in memory as a fast
cache. Again, the Line Mode Browser can be used as an example as it
keeps the 5 latest accessed hypertext objects in memory (regardless of
their size) in order to allow fast back track for the user. The
relation to the HTAnchor object requires that there is a link from the
HyperDoc to the corresponding anchor in order for the application to
do proper garbage collection of the HyperDoc objects.
Even though the Library does not interfere with the contents of the
HyperDoc object it does provide an API for managing the object. This
API is known as the "HText" API and it is described further in the User's Guide
HTErrorInfo
- The HTErrorInfo contains information about errors occured
in the protocol manager. Each request (in form of a HTRequest
structure) has an error stack which is a linked list of HTErrorInfo
structures. The HTErrorInfo structure contains an error number that
refers to a list of error messages, the severity of the error, any
parameters registered together with the error, and if this specific
error should be ignored by the application or not - independently of
the severity. A parameter can for example be a file name causing the
error.
HTStream
- The stream
structure is an object which accepts sequences of characters. It is a
destination of data which can be thought of much like an output stream
in C++ or an ANSI C-file stream for writing data to a disk or another
peripheral device. The broad definition makes streams very flexible
and they are used as the main method to transport data from the
application to the network and vise versa. The Library defines two
stream classes: A generic stream class and a specialized stream class
for structured data using SGML lexical tokens. The contents of the two
classes is described in detail in section "Streams Objects".
The following figure illustrates the relations between the central
data structures themselves. As before there might be other relations
between the structures, but these are the main relations.

- When an application issues a request the access manager binds the
anchor corresponding to a URL together with a request object. The
binding exists until the request reaches a final state after which the
application can discard the request object. Normally the anchor object
stays in memory during the whole life time of the application as the
set of anchors represent the part of the Web that the application has
been in touch with including metainformation etc.
- The application can make a binding between the request object and
the desired destination for the data when it arrives, typically from
the network. The request object is by default bound to a presentation stream which
presents a hypertext object to the user on the screen, but it can also
be written to a file, represented as source text etc.
- If the file cache is enabled a cache object is created and linked
to the anchor object by the cache manager so that the access manager
on any future requests can use the cached version (if not stale). As
mentioned, the cache manager is yet to be fully designed, and the
current approach may change.
- If the data object is not found in the cache or in memory the
protocol manager is called by the access manager. The protocol manager
then executes a specific protocol module which creates a netinfo
object and binds it to the request object. The netinfo object is
maintained uniquely by the protocol module and is removed by the
protocol module as soon as the communication with the remote server
reaches a final state.
- The request object also has a link to any error information
related to it. At the end of the request this information is handled
by the error manager and an error message may be generated and passed
to the user.
- When data starts arriving, typically from the network, it is
directed down the stream chain which can either already exist or is
created as data arrives (stream chains are described in the section
"Stream Objects". In the case where the
application is transmitting a data object to a remote server, there
are two steam chains directed in opposite directions: one from the
application to the network and one from the network to the
application.
- The end of the stream chain is the stream that the user may have
defined when the request first was issued or it can be the default
destination which is presenting the information on the screen. Between
the first and the last stream in the stream chain there can be any
number of other stream objects performing operations either directly
on the data, or on the stream flow itself. A T-stream is an example of
the latter where the stream flow is divided into two.
- The application receives the data arriving from the network via
the "HText" object (or any of the other stream interfaces as explained
in section The HTML Parser in
the User's Guide).
- The HyperDoc object must have a link to the HTAnchor
object in order to verify the anchor whether it has a data object
attached to it or not. The HyperDoc may have a link to the request
structure but this is not required.
The Anchor Object
Anchors represent any references to data objects which may be the
sources or destinations of hypertext links. This section contains a
general description of the model used to bind anchors together in an
internal representation in the W3C Reference
Library.
The anchors are organized into a sub-web which represents the
part of the web that the application (often the user) has been in
touch with. In this sub-web, any anchor can be the source of zero,
one, or many links and it may be the destination of zero, one, or many
links. That is, any anchor can point to and be pointed to by any
number of links. Having an anchor being the source of many links is
often used in the POST method, where for example the same data object
is to be posted to a News group, a mailing list and a HTTP
server. This is explained in the section "Building a POST Web, an API for PUT and
POST"
Every data object has an anchor associated with it. Anchors exist
throughout the lifetime of the application, but as this generally is
not the case for data objects, it is possible to have an anchor
without a data object. If the data object is stored in the file cache
or in memory, the parent anchor contains a link to it so the
application can access it either directly or through the file cache manager. There are two
types of anchors in the Library:
- parent anchors
- Represents whole data objects. That is, the destination of a link
pointing to a parent anchor is the full contents of the data object.
Parent anchors are used to store all information about a data object,
for example the content type, language, and length.
- child anchors
- Represents a subpart of a data object. A subpart is declared by
making a NAME tag in the anchor declaration and a child anchor is the
destination of a link if the HREF link declaration contains a "#" and
a tag appended to the URI. Child anchors do not contain any
information about the data object itself. They only keep a handle (or
a "tag") pointing into the data object kept by the corresponding
parent anchor.
Both types of anchors are subclasses of a generic anchor
class which defines a set of outgoing links to where the anchor
points. Every parent anchor points to an address which may or may not
exist. In the case of posting an anchor to a remote server, the
address pointed to is yet to be created. The client can assign an
address for the object but it might be overridden (or completely
denied) by the remote server. The relationship between parent anchors
and child anchors is illustrated in the figure.

- Parent anchors keep a list of its children which is used to avoid
having multiple example of the same child and in the garbage
collection of anchors.
- All child anchors have a pointer to their parent as only the
parent anchors keep information about the data object itself. Parent
anchors simply have a pointer to themselves.
- Every parent anchor have an address which is a URL pointing to a
resource that may or may not exist.
- Parents can have a data object associated using the HyperDoc
structure. In this case anchor B and C has a data
object but A hasn't which can either be because the anchor has
not yet been requested or the data object has been discarded from
memory by the application.
- Any anchor can have any number of links pointing to a set of
destinations. In most situations there is only one destination, but
multiple destinations is typical when posting data objects to a remote
server.
- This anchor has two destinations. By default the main destination
will be the one selected.
- Parent anchors keep a list of other anchors pointing to it. This
information is required if a single parent anchor (and its children)
is removed from the sub-web.
Protocol Manager
Under
construction. Any suggestions or ideas are welcome at
libwww@w3.org.
Libwww Threads and Net Objects
In a single-process, single threaded environment all requests to, for
example, the I/O interface blocks any further progress in the process.
Any combination of a multiprocess or multi threaded implementation of
the Library makes provision for the application to request several
independent documents at the same time without getting blocked by slow
I/O operations. As a Web application is expected to use much of the
time doing I/O such as "connect" and "read", a high degree of
optimization can be obtained if multiple threads can run at the same
time.
Library version 3.0 was designed to be thread compatible. It can
either be used with conventional threads or with the "libwww thread" concept which
allows an application to handle requests in a constrained asynchronous
manner using non-blocking I/O and an event loop based on a select
system call. As a result, I/O operations such as establishment of a
TCP connection to a remote server and reading from the network can be
handled without letting the user wait until the operation has
terminated. Instead the user can issue new requests, interrupt ongoing
requests, scroll a document etc.
Version 3.1 of the Library has an enhanced libwww thread model as it
supports writing large amount of data from the application to the
network, also using non-blocking I/O operations. The main purpose of
Librray 3.1 was to provide a basic support for remote collaborative
work through the HTTP methods
PUT and POST.
As libwww threads are not really threads but a notion of using
non-blocking I/O for accessing data objects from the network (or local
file system), it can be used on any number of platforms with or
without native support for threads, and this section describes the
model behind libwww threads and how it affects applications.
The Net object
The Net object contains all the state information required to
stop and start execute a request using asynchronous IO. The use of
aynchronous IO has an important implication on the implementation of
the access modules in the Library, for example the HTTP module which
is explained later:
- Global variables can be used only if they at all time are
independent of the current state of the active Net object.
- Automatic variables can be used only if they are initialized on
every entry to the function and stay state independent of the current
request throughout their lifetime.
- All information necessary for completing a request must be kept in
an autonomous data object that is passed round the via the stack.

The main reason for keeping the Net object separate from the Request
object is that some requests require more than one Net object, for
example FTP which has a Net object for the control TCP connection and a
Net object for each data TCP connection. In the case of HTTP/1.0 and
HTTP/1.1, there is a 1:1 correspondance between a Net object and a
request object. In HTTP/1.2 a Net object can live longer than a single
request as persistent connections might handle a set of requests over
the same TCP socket. Net objects can be used in three different ways:
- All requests are preemtive and all I/O is blocking
- Requests are non-preemtive managed by an internal event loop
- Requests are non-preemtive managed by an external event loop
The three modes are described in more detail in the section on Internal and External Events. In mode 2) the Net
objects is used to make the binding between the socket based internal
event loop (using a select() call) and a request, so that
a socket ready for an I/O action can make the corresponing libwww
thread active. In mode 1) and 3) Net objects represents the socket
interface of a request.
Families and Groups
Net objects are often related to other Net objects and hence it is useful to
introduce a mechanism by which related Net objects can be treated
together. The Library provides two ways of grouping Net objects:
- Family
- A Family is a set of Net objects that are logically
related to eachother, for example a HTML document with a set of
inlined objects (images, video, audio etc). On a GUI client a family
can be regarded as a set of objects to be displayed in the same
widget. Members of a family does not have to come from the same origin
server even though it in the case of inlined images often is the case
- what is important is the logical connection between them. There are
especially two situations that concerns a whole family:
- The user hits the interrupt button in a widget and all the treads
in the family are to be killed.
- Each family can be assigned a priority that allows
background families to be loaded with a lower priority than the
foreground family.
Families are typically created by the application, as the logical
relation often is known at the application level only. However, the
Library might add a Net object to a family for example in the case of
FTP where multiple Net objects are related to the same request.
- Group
- A Group is a set of Net objects that all are requesting a
(possibly different) resource from the same server. The Net objects do
not have to be logically related - they can belong to different
families but they have one thing in common: they are all pointing to
the same remote server. A group is used to support persistent
connection management which is the case in HTTP/1.2 and HTTP-NG. In
the case of a proxy server, all requests to that proxy server will
belong to the same group.
Groups also serve to manage the number of open TCP connections an
application uses at any one point in time in order not to overload
either the application or the network. Groups are typically created by
the Library as the final destination for a request is often known at
the Library level only.
Creation and Termination of a Net object
A Net object is created by the Net manager from within the Request
manager every time a request is passed to the Library. A request can
either be issued by the application or the Library itself for example
as a result of redirection, access authentication, or when a new data
connection is created in a FTP request. All new Net objects are
automatically associated with a group which might already exist or be
created together with the new Net object.
When a Net object has been created, the Request manager returns
immediately to the caller and does not see the request before it has
terminated either with a success or an error as result. The request
can either be started immediately by the Net manager or put into a
queue if the maximum number of open TCP connections is reached. When a
request is terminated there are typically a set of tasks that the
application would like to do:
- Update the history list
- Report the result to the log manager
- Update the display
- etc.
Handling the termination of a request is based on call back functions
that can be registered in the Net manager dynamically at run
time. Multiple call back functions can be registered in which case
they are all called from the Net manager in the sequence they were
registered. As an example, the Request manager registers a call back
function to handle the status of the request regarding to some
internal actions. This function is registered at initialization time
of the Library. The application can add its own call back functions to
be called on termination of a request.
Internal and External Events
This section describes what happens when an event arrives to the
Library - either from the application or from the network. The Library
provides three different ways of handling events, and it is necessary
to be aware of these modes in the design phase of an application as
they have an impact on the architecture of the application. The
Library can be used in multiple modes simultaneously and an
application can change mode as a function of the action requested by
the user. The three different modes are described in the
following:
- Base Mode
- In this mode all requests are handled in a preemtive way that does
not allow for any events to pause the execution of a thread or kill
it. This mode is in other words strictly single threaded and the major
difference between this mode and the next two modes is that all
sockets are made blocking instead of non-blocking. This mode can
either be used in forking applications or in threaded applications
using an external thread model where non-blocking I/O is not a
requirement.
- Active Mode
- In this mode the event loop is placed in the Library in the HTEvntrg module. The mode
can either be used by character based applications with a limited
capability of user interaction, or it can be used by more advanced GUI
clients where the window widget allows redirection of user events to
one or more sockets that can be recognized by a select() call. It
should be noted, that even though all sockets are non-blocking, the
select() function is blocking if no sockets are pending so if no
actions are pending, the select call will be put to sleep.
The HTNet module
contains a thread scheduler which gives highest priority to the events
on the redirected user events which allows a smooth operation on GUI
applications with a fast response time. This mode has a major impact
on the design of the application as much of the application code may
find itself within call back functions. As an example, this mode is
currently used by the Arena client and
the Line Mode Browser.
- Passive mode
- This mode is intended for applications where user events can not
be redirected to a socket or there is already an event loop that can
not work together with the event loop in the Library. The major
difference from the Active mode is that instead of using the
event loop defined in the HTEvntrg module, this
module is overridden by the application as described in the "User's Guide". The Passive mode has the
same impact on the application architecture as the Active mode
except for the event loop, as all library interactions with the
application are based on call back function.
One important limitation in the libwww thread model is that the
behavior is undefined if an external scheduler is provided using the
internal threads in the Library with preemptive scheduling mechanism.
The reason for this is that the Library is "libwww thread safe" when
using one stack and one set of registers as in Active mode only
when a change of active thread is done as a result of a blocking I/O
operation. However, using an external thread model, this problem does
not exist.
Providing Call Back Functions
The thread model in the Library is foreseen to work with native thread
interfaces but can also be used in a non-threaded environment. In the
latter case, the Library handles the creation and termination of its
internal threads without any interaction required by the
application. The thread model is based on call back functions of which
at least one user event handler and a event terminator must must be
supplied by the application. However, the application is free to
register as many additional user event handlers as it wants.
The dashed lines from the event loop to some of the access modules
symbolizes that the access method is not yet implemented using
non-blocking I/O, but the event loop is still a part of the
call-stack. In this situation the Library will automatically use
blocking sockets which is equivalent to the Base Mode.
- User Event Handlers
- An application can register a set of user event handlers to handle
events on sockets defined by the application to contain actions taken
by the user. This can for example be interrupting a request, start a
new request, or scroll a page. However, this requires that the actual
window manager supports redirection of event on sockets.
- Event Termination
- This function is called from the Library every time a request is
terminated. It passes the result of the request so that the
application can update the history list etc. depending on the
result. From the Library's point of view there is little difference
between a user event handler and this function, as it in both cases is
a call back function.
- Timeout Handler
- In Active mode, the select() function in the Library
event loop is blocking even though the sockets are non-blocking. This
means that if no actions are pending on any of the registered sockets
then the application will block in the select() call. However, in
order to avoid sockets hanging around forever, a timeout is provided
so that hanging threads can be terminated.
Often an event handler needs to return information about a change of
state as a result of an action executed by the handler, for example if
a new request is issued, a ongoing request is interrupted, the
application is to terminated etc. This information can be handed back
to the Library using the return values of the call back function.
There are several situations where a thread has to be killed before it
has terminated normally. This can either be done internally by the
Library or the application. The application indicates that a thread is
to be interrupted, for example if the user has requested the operation
to stop, by using a specific return value from one of the user event
handlers. The Library then kills the thread immediately and the result
is returned to the application.
The Cache Manager
Caching is a required part of any efficient Internet access
applications as it saves bandwidth and improves access performance
significantly in almost all types of accesses. This sections describes
the architecture behind the cache management in the Library. The cache
management is intended to be used both as a proxy cache and a client
cache or simply as a cache relay. It does not include the interaction
between an application and a proxy server as this is regarded as an
external access and hence outside the scope of the local cache. The
basic structure of the cache is illustrated in the figure below.

The figure described the cache hierarchy starting from left to right;
it does not describe the data flow. Any of the three cache handlers
can be left out in which case a cache request will fall through to the
next handler in the hierarchy and finally be passed to the protocol
manager which issues a request to either the origin server, a proxy
server, or a gateway. Any of the handlers can also be short circuited
by using a set of cache directives which are explained in the User's Guide. In the following, each
part will be described in more detail.
The memory cache is completely handled by the application and is only
consulted by the Library when servicing a request. It is considered
private to a specific instance of an application and is not intended
to be shared between instances. Handling the memory cache includes the
following tasks: object storage, garbage collection, and object
retrieval. The application can initiate a memory cache handler by
registering a call back function that is called from within the
Library on each request. The details of this registration is described
in the User's Guide.
Traditionally, the memory cache is based on handling the graphic
objects described by the HyperDoc structure in memory as
the user keeps requesting new documents. The HyperDoc
structure is only declared in the Library - the real definition is
left to the application as it is for the application to handle graphic
objects. For example, the Line Mode Browser has its own definition of
the HyperDoc structure called HText which describes a fully parsed HTML object
with enough information to display itself to the user. However, the
memory cache handler can handle other objects than HTML, for example
images, audio clips etc. It is important to note that the Library does
not imply any limitations on the usage of the memory cache.
The memory cache must define its own garbage collection algorithm
which can be based on available memory etc. Again, the Line Mode Browser has a very simple
memory management of how long objects stay around in memory. It is
determined by a constant in the GridText
module and is by default set to 5 documents. This approach can be much
more advanced and the memory garbage collection can be determined by
the size of the graphic objects, when they expire etc. but the API is
the same no matter how the garbage collector is implemented.
Private File Cache
The private file cache is to be regarded as a direct extension of the
memory cache as intended for intermediate term storage of data
objects. As the memory cache, it is intended to be private to a single
instance of an application as long as the instance is
running. However, as a file cache is persistent, it can be shared
between several instances of various applications as long as exactly
one instance owns the private cache at any one time. The single
ownership of a private cache means that the cache can be accessed via
the local file system by one instance of an application only.
There are two purposes of the private file cache:
- To maintain a persistent cache for applications that do not have a
shared cache.
- To maintain a private persistent cache for specific groups of
documents that are not to be shared among other applications. Examples
of such are documents with a HTTP header Pragma: Private
which will be introduced in HTTP/1.1
Often an important difference between the memory cache and the file
cache is the format of the data. As mentioned above, in the memory
cache, the cached objects can be pre-parsed objects ready to be
displayed to the user. In a file cache the data objects are always
stored along with their metainformation so that important header
information like Expires, Last-Modified, Language etc. is a part of
the stored object together with any unknown metainformation that might
be a part of the object.
Shared File Cache
A shared file cache which can be accessed by several independent
applications requires its own cache manager in order to ensure a
consistent cache and to handle garbage collection. A shared file cache
can in many ways be regarded as similar to a proxy cache as a single
application do not know when a cached object is either discarded or
refreshed in the shared cache area.
If a shared cache manager does exist then the only remaining purpose
of a private file cache is to store explicitly private objects. All
other objects will be stored in the shared cache.
As for the private file cache, the data objects are always stored
along with their metainformation so that any metainformation
associated with an object can be returned to the requesting
application.
Data Transportation using Streams
A stream is an object
which accepts sequences of characters. It is a destination of data
which can be thought of much like an output stream in C++ or an ANSI
C-file stream for writing data to a disk or another peripheral
device. It can be anything that accepts data, for example another
stream, ANSI C-file stream, or even a black hole which absorbs data
without ever sending it out again. Streams are used to transport data
internally in the Library between the application, the network, and
the local file system. Streams can be cascaded into a stream chain by
directing the output of a stream which often is called the sink or
target into another stream. This means that the processing of data
can be done as the total effect of several cascaded streams.
From version 3.1 of the Library, streams are both used to transport
data from the application to the network and vice verse which enables
applications to send data objects to the remote server which is a
requirement for doing collaborative work using HTTP as the transport
carrier. The stream-based architecture allows the Library to be event
driven in the sense that data is put down a stream as it gets ready,
for example from the network, and any necessary actions then cascade
off this event. An event can also be data arriving from the
application which would be the case when an application is posting a
data object to a remote server.
The Library has two fundamental stream classes which are described in
the following:
- A generic superclass
- A structured stream subclass
Apart from these classes, many stream modules have their own subclass
definitions of either the generic stream class or the structured
class. These definitions can be found in the individual stream
modules.
The Generic Stream Class
The generic stream class is a superclass of all other streams and it
provides a uniform interface to all stream objects regardless of what
stream sub-class they originate from. The generic stream class is
defined with the following set of methods.

The Structured Stream Class
A structured stream is a subclass of a stream, but instead of just
accepting data, it also accepts the SGML "begin element", "end
element", and "put entity". The conversion from a generic stream to a
structured stream is done by the SGML tokenizer which recognizes basic
SGML mark up like "<", ">", entities etc.

A structured stream therefore represents a structured document. The
elements and entities in the stream are referred to by numbers, rather
than strings. A DTD contains the mapping between element names and
numbers, so each instance of a structured stream is associated with a
corresponding DTD. The only DTD which is currently in the Library is
an extended version of the HTML DTD level 1, but current work is done
to update this to comply with the emerging HTML level 3 specification.
As for generic streams, it is not required that the stream
actually has a output - it can for example be a stream writing to a
file where no output is required.
Cascaded Streams
Streams are often cascaded into a stream chain but before explaining
why a stream chain is a flexible construction for data transportation,
let's have a look at what kind of streams, the Library provides. The
stream modules be divided into groups depending on their behavior:
- Protocol Streams
- Internal streams that parses or generates protocol specific
information to communicate with remote servers.
- Converters
- Streams that can be used to convert data from one media type to
another or create a data object and present it to the user.
- Presenters
- These are streams that save the data to a local file and calls an
external program, for example a postscript viewer.
- I/O Streams
- Streams that can write data to a socket or an ANSI C FILE object.
This can be used when redirecting a request to a local file of when
saving a document in the cache
- Basic Streams
- A set of basic utility streams with no or little internal contents
but required in order to cascade streams.
The first four stream classes often fall into a natural order in a
stream chain which is indicated in the the figure below. Here two
typical stream pipes are shown for data flowing from the network to
the application and vise verse:

As a more specific example, the figure below shows how streams are
cascaded when data from a remote HTTP server is handled by the
Library. In this case, the stream chain is built as data arrives to
the Library from the network: The first stream can decide whether it
is a 0.9 or a 1.0 response from the first line in the response; The
HTTP header parser stream can decide the format of the body when the
header part is parsed and so forth. In other situations the stream
chain can be setup before data arrives if the format is known a priori
to the data acquisition.
The ground symbol symbolizes that all data goes into a black hole
where nothing is radiated from. The two stream outputs going to the
application from each of the converters symbolizes that error
information is separated from other data objects. This allows the
application to direct any body part in an error message, for example
from a "401 Unauthorized" HTTP status, code to a separate "debug"
window where it can be displayed without affecting the current
document view.
The Data Format Manager
Under
construction. Any suggestions or ideas are welcome at
libwww@w3.org.
Handling Error Situations
MORE
What is a libwww Request?
Under
construction. Any suggestions or ideas are welcome at
libwww@w3.org.
Protocol Modules as State machines
A part of the libwww thread model is to keep track of the current
state in the communication interface to the network. As an example,
this section describes the current implementation of the HTTP module and how it has
been implemented as a state machine. The HTTP module is based on the
HTTP 1.0 specification but is backwards compatible with the 0.9
version. The major difference between the implementation before
version 3.0 of the Library is that this version is a state machine
based on the state diagram illustrated below. This implementation has
several advantages even though the HTTP protocol is stateless by
nature.

The individual states and the transitions between them are explained
in the following sections.
- BEGIN State
- This state is the idle state or initial state where the HTTP
module awaits a new request passed from the application.
- NEED_CONNECTION State
- The HTTP module is now ready for setting up a connection to the
remote host. The connection is always initiated by a connect
system call. In order to minimize the access to the Domain Name
Server, all host names to previous visited hosts are stored in a local
host cache as explained in section "DNS Cache
and Host Name Canonicalization". The cache handles multi homed
hosts in a special way in that it measures the time it takes to
actually make a connection to one of the IP-addresses. This time is
stored together with the specific IP-address and the host name in the
cache and on the next connection to the same host the IP-address with
the fastest connect time is chosen.
- NEED_REQUEST State
- The HTTP Request is what the
application sends to the remote HTTP server just after the
establishment of the connection. The request consists of a HTTP header
line, a set of HTTP Headers, and possibly a data object to be posted
to the server. The header line has the following format:
<METHOD> <URI> <HTTP-VERSION> CRLF
- SENT_REQUEST State
- When the request is sent the module waits until a response is
given from the server or the connection is timed out in case or an
error situation. As the module does not know whether the remote server
is a HTTP 0.9 server or a HTTP 1.0 it must look at the first part of
the response to figure out what version of HTTP is returned. The
reason is that the HTTP protocol 0.9 does not contain a HTTP header
line in the response. It simply starts to send the requested data
object as soon as the GET request is handled.
- NEED_ACCESS_AUTHORIZATION State
- If a 401 Unauthorized status
code is returned the module asks the user for a user id and a
password, see also the " HTTP Basic
Access Authorization Scheme". The connection is closed before the
user is asked for the user-id and password so any new request
initiated upon a 401 status code
causes a new connection to be established. This is done in order to
avoid having the connection hanging around waiting while the
applications is waiting for user input.
- REDIRECTION State
- The remote server returns a redirection status code if the URI has either been
moved temporarily or permanent to another location, possibly on
another HTTP server or any other service, for example FTP or
gopher. The HTTP module supports both a temporarily and a permanent
redirection code returned from the server:
- 301 Moved
- The load procedure is recursively called on a 301 redirection
code. The new URI is parsed back to the user as information via the Error and
Information module, and a new request generated. The new request
can be of any
access scheme accepted in a URI. An upper limit of redirections
has been defined (default to 10) in order to avoid infinite loops.
- 302 Found
- The functionality is the same as for a 301 Moved return status. A
clever application can use the returned URI to change the document in
which the URI originates so that the URI points to the new location.
-
- NO_DATA State
- When a return code indicates that no data object or resource
follows the HTTP headers the HTTP module can terminate the request and
pass control back to the application.
- NEED_BODY State
- If a body is included in the response from the server, the module
must prepare to read the data from the network and direct it to the
destination set up by the application. This is done by setting up a
stream stack with the required conversions.
- GOT_DATA State
- When the data object has been parsed through the stream stack, the
HTTP module terminates the request and handles control back to the
application.
- ERROR or FAILURE State
- If at any point in the request handling a fatal error occurs the
request is aborted and the connection closed. All information about
the error is parsed back to the application via the Error and
Information Module. As the HTTP protocol is stateless, all errors
are fatal between the server and the server. If the erroneous request
is to be repeated, the request starts in the initial state.
Post Webs - a Generic Model for Posting on the Web
The HTTP PUT and POST are required
features when extending the Web to a
fully collaborative tool with features like remote authoring,
annotations, update of data bases etc. Many Web applications are
currently capable of transferring data from HTML forms to a HTTP
server. However, form data is typically small amounts of text based
data, and a more generic mechanism is needed for transmitting an
arbitrary data object to any kind of remote server. This document
describes how this functionality can be provided by the "Post Web"
model and how this model interacts with the user, the application, and
the W3C Reference Library. One of the advantages
of this model is that it does not require any modification, neither to
the HTTP/1.0 specification nor to
the HTML form definition.
What is a Post Web?
A "Post Web" is used as an abstraction mechanism for enabling the user
to perform multiple operations (methods) on a data object rendered in
multiple representations determined for multiple destinations. This
may seem complicated but the Post Web is in fact a very simple model
as will become clear in the following sections. The purpose of the
Post Web is to take a set of common situations from the world of email
and news; merge it with the features of HTTP, and put the result into
the Web model. This leads to the following set of requirements:
- A post operation can involve one source and a multiple number of
destinations.
- The source can either be a URL referencing a local or a remote
data object, or it can be any object internally managed by the
application, for example a memory buffer containing a document created
by the user.
- Any of the destinations can be a URL referencing either a local or
a remote data object. The object may or may not exist by the time the
posting is initiated.
- The model must not be limited to use HTTP but should be a generic
mechanism for any kind of access scheme supported by the Web model.
- The model must provide possibility for data format conversion from
one media type to another on the fly when the data object is moved
from the source to one or more of the destinations.
- The user must be able to specify a relation between a source and
any of the destinations, for example "Written by". This is equivalent
to the "<LINK>" element in HTML and the "Link:" header in HTTP
and is used to incorporate semantics into the Web topology.
- It must be possible to specify individual operations used for each
destination where an operation can be any non-idempotent operation (or
method) defined by HTTP/1.0. For
example, if three destinations are specified then one can use PUT,
another POST, and the third can use LINK. In the following, post
written in lower case refers to any non-idempotent HTTP method whereas
POST written in uppercase refers to a specific HTTP method.
The Post Web model provides a homogeneous interface to a post
operation regardless of the destination, the specific method, and the
data format used. It describes the full operation from defining the
source and destinations to actually transfer the data over the
network. This process involves there players: the user, the
application, and the W3C Reference Library. Each of these uses the
Post Web model but on different levels of abstraction:
- The user
- To the user, the Post Web is a way of defining a source object and
one or more destinations to where the object is to be posted. The
model allows the user to describe relations between the source and any
of the destinations and also what method should be used.
- The application
- To the application, the Post Web is a set of bindings between a
source and any of the destinations describing a request for changing
the current Web topology. A binding is described by the link itself, a
link relation, the method (operation) to be performed, and if any data
format conversion has to be performed.
- The Library
- The Library interprets the Post Web as a set of related requests
specifying the access scheme, the operation to be done, the data flow
between them, and the data formats in this data flow.
The following paragraphs describe the three layers of abstraction, how
they are interconnected and thus defining the Post Web model.
For all the possible destinations in a Post Web, the user can specify
what method should be applied, any relations between the source and
any of the destinations, and if any data for conversion should be
performed. The relations are semantically identical to the HTML "Link" tag and the HTTP "Link" header, and it can for
example describe authorship, relations to other data objects etc.
The description of the Post Web model includes a basic example in
which a user wants to post the same data object or variations thereof
to two mailing lists, a news group and at the same time store the data
object on a remote HTTP server. This scenario can be graphically
represented as a Post Web consisting of five nodes: one source and
four destinations:

This document does not specify the user interface for building a Post
Web as this is tightly connected to the platform involved, but
obviously it should take advantage of any graphic features
etc. Typically a GUI-client could use drag-and-drop icons for building
the Web. For example, the Post Web could be visualized using a
collection of icons representing commonly used recipients and then let
the user drag lines between the data object to be posted and the
recipients.
When the user has finished specifying the source, the destinations,
the methods, and any relations between them, the user's version of the
Post Web is ready to be submitted and the application can take the
information and convert it to a lower abstraction level.
While the description of the user's view of a Post Web is fairly
abstract, an actual application must transform the information into a
specific representation supported by the Library. To the application,
the Post Web is a request for change in the topology of the Web. The
application can describe this change using anchor objects
which is the Library's representation of the Web where each node
represents a data object or a subpart of a data object that the
application has been in contact with while browsing on the Web.
In the figure below, each of the four anchors has a data object and a
URL related to it. Any of the addresses or data objects may or may not
exist when the Post Web is submitted by the application. If the source
does not exist then this will result in an error, but if a destination
data object does exist then the post operation is committed then might
result in replacement, deletion, update, or any other outcome as a
result of the method applied.

The Library provides an API for handling anchor objects including how
to link the objects together as indicated in the figure above. This is
explained in more detail in the User's Guide.
When the application has bound the source anchor to the destination
anchors with the appropriate methods and link relations, the Post Web
can be handed over to the Library in order to transfer the data object
from the source to the destinations. The Library is responsible for
handling the actual protocol communication, and hence this part of the
Post Web model is the lowest layer of abstraction. Therefore the
design goals for this layer of the Post Web is somewhat more technical
than the first two layers:
- Posting to multiple destinations must be compatible with libwww
threads and extern thread implementations. In the case of libwww
threads, it must use non-blocking, interruptible I/O.
- The Library must be capable of handling concurrent write and read
operations to and from the network.
- There must be no timing requirements that can lead to race
conditions between any of the destinations and the source or between
destinations.
- Redirections and access authentication must be handled on both the
source side and any of the destinations.
Internally, the Library represents a Post Web in two different ways: A
static and a dynamic binding between the source the
destinations. The static binding is created when the application
issues the request, and it exists until all the sub-requests in the
Post Web have reached a final state. The dynamic binding depends on
the data flow and exists only as long as data is passed through the
Post Web. The dynamic binding can be set up and taken down
independently of the static binding, and often this happens multiple
times during the handling of a request.
As described in the section "Central
Data Structures", the HTRequest structure is
one of the central data structures used to describe a request from the
application. This structure is used in the static binding between the
source and the destinations and it is initialized as soon as the
request is passed to the Library from the application.

At this point no information is known about the data object itself, so
the static binding only contains information about who the source and
the destinations are. The dynamic binding carries information about
data format, content length and other essential metainformation about
the object. The dynamic binding is basically a stream chain that is
established as this information gets available from the source
server:

- As soon as the source server (which might be the local file system
or a remote HTTP server) is ready to accept a request, it is sent of
by the Library.
- The Library then waits until the source server starts sending back
a response. In the mean time, the application can issue request other
requests as the model is based on non-blocking I/O.
- As soon as data arrives and the data format is identified, the
dynamic bindings between the source and the destinations can be
setup. The binding is basically a connection between the target of the
source request and the input of any of the destination requests.In the
case of multiple destination, T-streams can be added to supply the
required number of outgoing data flows.
- The destination is now ready for transmitting a request. In the
case of HTTP, the destination request can not be transmitted before
the full header is known, which is when the meta information from the
source data object is parsed.
- A response will arrive to each of the destination requests
determining whether the posting can continue or not.
- When the dynamic binding is established, any data format
conversion can be inserted between the target of the source request
and the input of any of the destination requests. A converter can
either be placed directly at the target or on any of the inputs, so
that all destinations can have different renditions of the data
object. As the content length often will change as a converter other
than a through line is used, it can be required to insert a content
length counter stream which will buffer the data object before it is
emitted from the stream.
Updating the Web Topology
The application can use the result of the operation returned from the
Library to either regard the change in the topology of the Web as
successful, erroneous, or any degree in between. The application can
use this information to for example update any graphical visualization
of the part of the Web that the user has traversed.
The result of posting a data object varies from protocol to
protocol. Typically transaction oriented protocols can provide an
immediate result whereas relayed protocols can not. As a general rule
in the design of the Library other protocols than HTTP should be
supported but not extended beyond their individual limitations. This
means that the Library has to be flexible enough to handle more than
one result from a posting transaction dependent on the protocol
used. As an example, an immediate result from a post transaction is
available using NNTP or HTTP whereas the result from SMTP might be
delayed several days. In practice there is no way that the application
can await a response for that amount of time, and it should therefore
be treated as "Accepted" with no guarantee of completeness.
The Library handles the update of the internal anchor representation
of the Web by registering the outcome of each post operation and bind
that to the link between the source and the destination. This allows
the application to query how two anchors are related and what the
outcome of the operation was that caused the link to be established.
DNS Cache and Host Name Canonicalization
An excessive communication with remote Domain Name Servers (DNS) can
produce a significant time-overhead in requesting a document from a
remote server which can result in degraded performance of the
application. This is often the case in spite of DNS's own cache, as
the request still has to cross the network. In order to prevent this,
the Library has its internal memory cache of host names which is
updated every time a host name is looked up in the DNS cache. Once the
host name has been resolved into an IP-address, it is stored in the
cache. The entry stays in the cache until either an error occurs when
connecting to the remote host or it is removed during garbage
collection. However, as the information kept in the cache is fairly
small, it can contain a large set of elements.
Multi-homed hosts are treated specially as all available IP-addresses
returned from DNS are stored in the cache. Every time a request is
made to the host, the time-to-connect is measured and a weight
function is calculated to indicate how fast the IP-address was. The
weight function used is

where
indicates the
sensitivity of the function and
is the connect time. If one IP-address is
not reachable a penalty of x seconds is added to the weight where the
penalty is a function of the error returned from the "connect"
call. The next time a request is initiated to the remote host, the
IP-address with the smallest weight is used.
A problem with both the host name cache and the data object cache is
to detect when two URLs are equivalent. The only way this can be done
internally in the Library is to canonicalize the URLs before they are
compared. This has for some time been done by looking at the path
segment of the URLs and remove redundant information by converting
URLs like
foo/./bar/ = foo/redundant/../bar/ = foo/bar/
The method is optimized and expanded so that also host names are
canonicalized. Hence the following URLs are all recognized to be
identical:
http://www/ = http://www.w3.org:80/ = http://Www.W3.Org/ =
http://www.w3.org./ = http://www.w3.org/
However, the canonicalization does not recognize alias host names
which would require that this information is stored in the cache. In
order to do this, a separate resolver library must be provided as this
information is normally not returned by the default resolver
libraries. Also these library do not support non-blocking sockets and
hence delay can not be avoided when resolving a host name. The
solution is of course to write a resolver library which handles these
features, and it is under consideration.
Henrik Frystyk, libwww@w3.org, November 1995