W3C Library Internals
Any suggestions or
ideas are welcome at libwww@w3.org.
This guide describes the modules and their internal relations in the
W3C Reference Library. Every module in the
Library has a HTML document associated with it containing a detailed
description of the functionality and interface to other modules. This
page is the top node for the implementation specific documentation and
contains links to all the modules in the Library. The documentation
is dynamically kept up to date as the actual include files (.h) are
generated from the HTML documents using the Line Mode Browser.
This document is also available as one big
HTML file intended for printout. Please note that not all
links in this version work!
Table of Contents
- Core Modules
- Application Preferences
- Application Specific Implementations
- Network Specific Modules
- Protocol Modules
- Protocol Utility Modules
- Libwww Thread Modules
- Central Data Structures
- Generic Streams
- Structured Streams
- HTML Presentation Modules
- Style Sheets
- URI Management
- Generic Programming Utilities
When compiling the Library please make sure that you have a relatively
new version of the Line Mode Browser
that parses the HTML documents correctly (2.13 and newer
versions). Find out what version of the Line Mode Browser you are
using by typing
./www -version
Also remember that when editing the module interfaces or adding
functionality then always use the HTML files and not the .h
files!
Core Modules
The "core modules" are
the fundamental part of the Library. The core entity contains hooks
for the dynamic modules and provides the major access points for
applications issuing requests, for example to access a data object.
- WWWLib Include
- This is not really a core module, but an important part as this is
the only include file you need to use the Library.
- Access Manager
- The access manager is the main entry point for requesting a data
object pointed to by a URI. It has a set of methods that allows the
application to request different services, for example to get a URI,
post a URI, or to search a URI.
- Protocol Manager
- The protocol manager is invoked by the access manager in order to
access a document not found in memory or in file cache. The manager
consists of a set of protocol modules handling the access schemes
HTTP, FTP, NNTP, Gopher, WAIS, Telnet, and access to the local file
system. The protocol modules are registered dynamically (using static
linking) and the User's Guide describes how
modules can be registered. Each protocol module is responsible for
establishing the connection to the remote server (or the local
file-system) and extract information using a specific access
method. When data arrives from the network, it is passed on to the
format manager.
- Format Manager
- The stream format manager takes care of data format conversion
requested based on a set of registered format converters and a simple
algorithm for selecting the best conversion.
- Cache Manager
- The cache manager is used to save data objects once they have been
down loaded from the network. The cache uses the hierarchy indicated
in the URLs as a way to identify items in the cache but is still under
construction and requires a lot of work to be a highly efficient cache
manager!
- Error Manager
- This module manages an information stack which contains
information of all errors occurred during the communication with a
remote server or simply information about the current state. Using a
stack for this kind of information provides the possibility of nested
error messages where each message can be classified and filtered
according to its impact on the current request, for example "Fatal",
"Non-Fatal", "Warning" etc. The filtering can be used to decide which
level of messages will be passed back to the user.
- Event Manager
- The event manager is a "session layer" handling which thread
should be the active thread. A thread can either be an internal libwww
thread or an external thread, for example a Posix thread, and the
event manager can itself be either the internal Library manager or an
external event manager. Currently the internal event manager uses a
select function call to decide which thread should be made the active
one, however an external event manager can use another decision
model. The event manager is described together with the internal
thread model in the section "Libwww Threads", and more
modules are described in the section Libwww
thread Modules
Application Preferences
These modules handles all the modules that can be registered either
dynamically (using static binding) or statically by the application.
Default Initializations
The HTInit module
defines a standard set of
initializations where all protocol modules and converters are
setup at startup time of the application in the HTLibInit(). Often
you can take this module and just override it with your own
preferences.
Handling a Configuration File
Bindings can also be set up by a rule file that is handled by the Configuration File
Manager. The format of the rule file is yet to be specified and it
does still need some work.
Bindings to the local file system
The Bind Module makes
bindings between file suffixes and content-type, content-language,
charset etc. As an application this is used when talking to, for
example FTP servers that do not support media types, and servers use
it when performing format negotiation between multiple representations
of a document.
Registering of Methods
The idea behind this module is to allow dynamic registration of HTTP
methods, for example PUT, POST, GET etc. and then also allow new
methods to be registered. This has not yet been implemented but the
structure is there :-)
Registration of Access Schemes
Access schemes like HTTP, FTP etc. can be registered dynamically
(using static binding) and the HTProt module provides the
support for this.
Registration of Proxies and Gateways
The HTProxy Module
provides functionality for registering proxies and gateways
dynamically. This has traditionally been handled by environment
variables, but is now a registration module just like the registration
of access schemes.
Application Specific Modules
When all public functions and variables within a module are
overwritten by module other than the one in the Library, the linker
takes the new version and ignores the module in the Library. The
following modules are implemented in the Library in order to support
the Line Mode Browser but can be
overwritten by GUI clients etc.
Displaying and Prompting User Messages
The HTAlert module
contains the code for prompting the user for file names, userid,
password etc. Furthermore, it presents messages containing status
information, error messages etc. to the user. The implementation in
the library is meant for the Line Mode Browser (i.e. it writes to
stderr) but can easily be overwritten by GUI browsers.
History Manager
The HTHist module
records and replays on request the documents which the user
visits. There are no calls to this module within the Library so if the
application does not use it then it is not linked in at all. If the
application wants a more advanced history management, then this should
be overwritten.
Internal Event loop
The internal event loop in the HTEvtrg module is made to
support libwww threads If
an application wants its own event loop, then this module must be
overwritten.
Default Initializations
The HTInit module
defines a standard set of initializations where all protocol modules
and converters are setup at startup time of the application in the HTLibInit(). Often
you can take this module and just override it with your own
preferences.
Network Modules
Network modules handles all the network access including DNS access,
reading and writing to the and from the network etc. Most of these
modules are internal to the Library, however some applications might
use some of them directly.
TCP Communication
The functionality of the
HTTCP Module covers several topics but they are all related to
TCP/IP communication. All active and passive connection establishment
from the Protocol Modules goes through
this module. Furthermore, the module manages a local host cache of
visited hosts so that the Domain Name Server is only consulted when
necessary.
Other topics includes:
- I/O status indication (errno etc.)
- Information on remote hosts
- Information on local host (domain name etc.)
- Information on current user (mail address)
Reading Data from a Socket
The HTSocket Module
controls all the functionality for reading information from a socket
and from the local file system using either a socket or a ANSI C file
descriptor.
The module is currently being transformed into a completely reentrant
version, but the old non-reentrant interface is still needed for some
functions. However, it will be taken out as soon as possible.
Protocol Modules
A protocol module is invoked by the HTAccess module in order
to access a document. Each protocol module is responsible for handling
the transmission of a data object either from the application to a
remote server, or vice verse.
The protocol modules are registered dynamically (using static linking)
and the User's Guide describes how modules can
be registered. Each protocol module is responsible for establishing
the connection to the remote server (or the local file-system) and
extract information using a specific access method. When data arrives
from the network, it is passed on to the format manager.
Most of the protocol modules are now implemented as state machines in
order to support libwww
Threads.
When the client parses a request to the library a HTRequest Structure
is filled out and parsed to a load function in the access manager, for
example
HTLoadAnchor. HTRequest contains all information needed by the
Library in order to fulfill a request.
- File access
- This module provides access to files on a local file system. Due
to general confusion of the "file://" access scheme in the URL Specifications tries FTP access on
failure.
- FTP access
- This is a complete state based FTP client which is capable of
communicating with a lot of weird FTP servers. It uses
PASV as the default method for establishing the data
connection as PORT does not work if the application is
run from a firewall machine, as is often the case with proxy server.
- HTTP access
- The HTTP
module handles document search and retrieve using the HTTP protocol. See also information on
the current
implementation of the HTTP client. The module is now a complete
state machine which is a required functionality in the libwww thread model. It uses
streams for both outgoing and incoming data, the outgoing stream is
implemented in HTTPReq.c
and the incoming stream in HTTP.c
- News access
- The NNTP internet news protocol is handled by HTNews which builds a
hypertext object.
This module is under
reconstruction!
- Gopher access
- The internet gopher access to menus and flat files (and links to telnet
nodes, WhoIs servers, CSO Name Server etc.) is handled by HTGopher Module.
- Telnet access
- This module provides the possibility of running telnet sessions in
a subshell. It also provides functionality for rlogin and tn3270.
- WAIS access
- WAIS access is not compiled into the Library by default as it
requires the freeWAIS library. This is easily changed in the platform
dependent
Makefile.include in the
WWW/All/<platform⁢
directory. However, if this library is present then the application
can communicate directly with a WAIS server. Otherwise it must go
through a gateway
program.
Protocol Utility Modules
The protocol modules themselves can be registered dynamically (using
static binding) but some of the functionality used by the modules are
kept in a set of protocol utility modules that are described in this
section.
Access Authorization
In order to prevent unauthorized access on a Web server, a basic
authorization scheme has been developed, see Access
Authorization for more details on the scheme. The access
authorization is implemented in the following modules:
- HTAABrow
- This module contains WWW Browser specific code, that is composing
the HTTP Authorization Header, recording users information etc.
- HTAAUtil
- This module contains the authorization code that is common to both the
servers and clients, e.g., handling information on different authentication
etc.
- UU Encoding and Decoding
- Provides functions to encode and decode a data buffer according to
the RFC
1421 "Privacy Enhancement for Internet Electronic Mail".
Presenting Directory Listings and other Listings
When listings return from the protocol modules they are converted into HTML
and parsed to the client. Listings might be HTTP directory listings, Gopher
menus, FTP directory listings, CSO Name server etc. The modules providing this
functionality are:
- HTDir
- This is a very configurable module to actually present the listings
- HTDescript
- This module handles the description field in a HTTP directory listing.
For a HTML file, the default action is to peek the title of the document.
- HTIcons
- This module handles the set of icons used in the listings (HTTP, Gopher,
FTP etc.).
Logging Requests
The HTLog Module is a
simple log manager that can log the result of a request to the access
manager.
Thread Modules
From version 3.0 (unreleased) of the W3C Reference Library, support
for libwww threads and
other thread models has been implemented multi threaded functionality
has been added as an extra set of modules. For the moment, only the
HTTP module can take full advantage of libwww threads but both FTP and
Gopher are foreseen for the same functionality. The modules included
are:
Net Manager
The HTNet module
registers sockets as ready for read or write (this includes the
connect statement that is basically a write request). It is an
internal module to the Library.
Internal Event Loop
The HTEvent module is
the Library's own version of the event-loop serving the HTTP client,
and it is the application interface to the multi threaded
Library. Clients can either use this module as is or they can
overwrite it with their own even-loop. Note, that GUI-clients
can use the current implementation! The module is now
ported to Windows NT thanks to Charlie Brooks
Central Data Structures
The central data structures are the structures that are a part of the
core entity. Each of the core modules as explained in section "Control and Data Flow"are
relying on one or more of the central data structures. This section
describes the relationship between the core modules and the central
data structures and the relationship between the central data
structures themselves.
Anchors
All anchor management is handled by the HTAnchor module. You
should normally not have to look directly into the HTAnchor structure,
but use the methods provided by the anchor manager.
Request
The HTRequest
structure is defined in the access manager. It is
currently not a completely opaque data structure but it will soon be
so be prepared to use methods like the ones to handle the HTAnchor Object Then we
can also better call it an object and not a data structure ;-).
Streams
Almost each stream module has its specific implementation of the
stream structure, but the generic one is defined in the HTStream Module. The
structured stream definition is placed in the HTStruct module. None of
these modules have any code directly associated with - this is left to
the specific stream modules, for example the HTFWrite stream.
HyperDoc Structure
The HyperDoc structure is different from the other central data
structures as it is only declared in the Library - the definition is
left to the application. It is intended to contain information about
data objects, especially hypertext objects that are to be presented to
a user. As an example of a definition, you can look at the Line Mode Browser where it is defined in
the GridText
module. Here it is called "_HText" structure and it contains all
information needed to present and manage a data object in a text based
environment.
Even though the Library does not interfere with the contents of the
HyperDoc object it does provide an API for managing the object. This
API is known as the "HText
API" and it is described further in the User's Guide
Stream Modules
A stream is an object
which accepts sequences of characters. It is a destination of data
which can be thought of much like an output stream in C++ or an ANSI
C-file stream for writing data to a disk or another peripheral
device. The Library defines a generic stream class in the HTStream module, but
almost all stream modules define their own sub class definition of the
stream object.
Protocol Streams
These are normally internal streams that parses or generates protocol
specific information to communicate with remote servers.
- HTTP Request Stream
- This stream is one of the first real protocol streams - more are to come!
Converters and Presenters
Streams that can be used to convert data from one media type to
another or create a graphic object and present it to the user. These
are streams that save the data to a local file and then calls an
external program, for example a postscript viewer. These are normally
initialized as a application preference.
- SGML Tokenizer
- Parses the data and generates a
structured stream. Each parser instance is created with reference
to a particular DTD structure.
- Plain to HTML Converter
- This stream takes a plain file and converts it into HTML. Like the
SGML tokenizer, it also converts a generic stream into a structured
stream.
- Plain Text Presentation
- Takes plain ASCII text and presents it to the user as preformatted text.
- HTTP/MIME header parser
- Parse a MIME format message and puts all the information in an Anchor object
- WAIS source file Stream
- Parses a WAIS source description file. By default, this
is enabled even if direct WAIS access is not present (no linking with
the freeWAIS library).
- Guessing Stream
- If the input format is unknown at the time when putting up a
stream stack, then this module scans a part of the stream and on a
statistical basis determines the type of stream needed from the content-type.
- External Parser with Call back
- This is a call back stream module where the implementation is
defined in the application and not in the Library.
- Save Locally
- The HTSaveLocally stream saves the data object to a local file.
- Save Locally and Execute Application
- The HTSaveAndExecute stream saves the data object to a
local file and calls an external application, for example a post
script viewer.
- Save Locally and Execute Application
- The HTSaveAndCallBack stream saves the data object to a
local file, calls an external application and when the stream is
freed, the libwww application gets called with a specified call back
function.
I/O Streams
Streams that can write data to a socket or an ANSI C FILE object.
This can be used when redirecting a request to a local file of when
saving a document in the cache.
- ANSI C File Writer stream
- Writes to an ANSI C FILE * object, as opened by fopen, etc.
- Cache Writer Stream
- This is the stream that's used by the cache manager
- Socket Writer Stream
- Writes to a socket or something opened with the UNIX file I/O open
function.
- Net to Text Converter
- Converts "Net ASCII" line terminators
<CRLF>
into the equivalent C representation which is a '\n'.
Basic Streams
A set of basic utility streams with no or little internal contents
but required in order to cascade streams.
- Tee Stream
- Just writes into two streams at once. Useful for taking a copy for a cache.
- Black Hole
- A quite expensive way of piping data into a hole for then to be forgotten forever.
- Through Line
- A short circuited stream that returns the same output sink as it is called with.
Structured Stream Modules
The SGML stream generates
a structured stream that is sub class of a generic stream. The Library defines a
structured stream class in the HTStruct module, but
almost all stream modules define their own sub class definition of the
stream object.
A structured stream uses a DTD definition expressed in a C data
structure - no, it doesn't not understand a real DTD (yet). The
definition of the HTML DTD is placed in the HTMLPDTD module.
- HTML Presentation
- The HTMLPresent stream presents a HTML object to the user
using the "HText API" as explained in the section on HTML Presentation.
- HTML Generation
- The HTMLGenerator stream generates a HTML object which
for example can be written to a file.
- HTML to C conversion
- The HTMLToC stream converts an HTML object into a C
compliant text object. This is for example used by the Line Mode Browser in order to generate C
like include files from the HTML files that you have been reading
throughout this document.
- HTML to LaTeX Conversion
- The HTMLToTeX stream is a very simple HTML to LaTeX
converter. It is not error free but a start.
- HTML to Plain Text Converter
- The HTMLToPlain stream takes a HTML object and converts
it into plain text using the styles in the "HText API"
Presentation Modules
Generating a Graphic Object
This document describes the methods provided for presentation a graphic object to the user. The
implementation in the Library is made for a text oriented browser, so
more advanced GUI clients must overwrite some of these modules. See
more information on which modules to
overwrite.
As mentioned in the how to get started
guide, the definition of a graphic object is free for the
application. Some graphic objects work by storing the whole structure
of the document. Others work by converting the nested structure into
a linear sequence of styled text for display.
Generally, a new platform has a new implementation of the hypertext
object. A GUI client must overwrite the graphic object modules in the
Library in order to take advantage of a more advanced
user-interface. The graphic object as defined in the Library has two
interfaces, depending on how much of the Library code the client wants
to handle on its own:

- SGML Level
- If the client has its own HTML parser then the interface is
between the client HTML parser and the Library SGML parser. The SGML
parser is a general SGML parser which can be setup with a specific DTD
and it feeds the HTML parser with structured data. In this case, you
will be emulating the HTML
module, and generating a hypertext object from the structured
stream. The actual structured stream definition is in the SGML module.
- HTML Level
- If the client wants to use the HTML parser in the Library then
this is the second interface to the Library. The hypertext object is
parsed and the communication with the client is based on a set of
call-back functions in the HTML parser. The call-back functions are
all defined as prototypes in the HText module but the client
must provide the actual code that defines the presentation method used
for a specific HTML tag. If you wish to maintain the structure of the
SGML file within your object, then the SGML interface will be a better
place to connect your code.
You are free to define the structure of the hypertext object (the
structure is left undefined in the HText module. You may want to
define your own styles and font definitions.
Style Sheets
We are currently working
on implementing a HTML level
3 parser in the Library, so the following description will soon be
out of date.
The Style module in
the Library currently only handles a flat style structure with no
functionality for nested styles. You don't have to use this module, as
you can replace the entry points in the HTML module with your own,
to prevent the library version from being loaded.
URI Management
The functionality for handling URIs is placed in the following modules:
Parsing URIs
The HTParse module
provides functions for parsing URIs, simplify them by removing
redundant information. This is automatically called by the Anchor manager every
time an anchor is created in order to minimize the number of redundant
anchors.
URI Encoding and Decoding
The HTEscape module
can search a URI for unsafe characteres and escape and unescape them
according to the URI
Specifications.
Utilities
This document covers the basic programming utilities that can be used
in the client or server to make life easier.
Container Modules
These modules are generic data object storage modules that might be used
wherever convenient. The general rule for freeing memory from these modules is
that free methods handles data structures generated within the modules whereas
user data is for the caller to free. The modules consist of:
- Binary Trees
- This is a complete balanced binary tree that might be used for storage
and sort of a large number of data objects, e.g. filenames in directory
listings etc.
- Dynamic Strings
- A Chunk is a block wise expandable array of type (char *) and is a sort of
apology for real strings in C. Chunks make it easier to handle dynamic strings
of unknown size. It is often faster than using the String Copy Routines.
- Linked Lists
- This module provides the functionality for managing a generic list of data
objects. The module is implemented as a single linked list using the scheme
first in - last out (FILO).
- Association Lists
- This is a small module build on top of HTList that provides a way to
store Name-Value pairs in an easy way.
- Strings
- Routines for dynamic arrays of characters include string copy,
case insensitive comparison etc. It also contains functions for
generating date and time stamps, MessageID etc.
- Atoms
- Atoms are strings which are given representative pointer values so that
they can be stored more efficiently, and comparisons for equality done more
efficiently. The pointer values are in fact entries into a hash table.
Basic Utilities
Look into these modules
before you start defining system dependent stuff. Most things are
already defined here! The list of basic utility modules are currently
as follows
- System specifics
- The tcp.h file
includes system-specific include files and flags for I/O to network
and disk. The only reason for this file is that the Internet world is
more complicated than Posix and ANSI.
- Platform Independent macros
- The HTUtil.h
include file contains things we need everywhere, generally macros for
declarations, booleans, etc.
Henrik Frystyk, libwww@w3.org, November 1995