W3C Lib LibGuide

W3C Library Internals

Any suggestions or ideas are welcome at libwww@w3.org.

This guide describes the modules and their internal relations in the W3C Reference Library. Every module in the Library has a HTML document associated with it containing a detailed description of the functionality and interface to other modules. This page is the top node for the implementation specific documentation and contains links to all the modules in the Library. The documentation is dynamically kept up to date as the actual include files (.h) are generated from the HTML documents using the Line Mode Browser.

NOTE This document is also available as one big HTML file intended for printout. Please note that not all links in this version work!

Table of Contents

  1. Core Modules
  2. Application Preferences
  3. Application Specific Implementations

  4. Network Specific Modules
  5. Protocol Modules
  6. Protocol Utility Modules
  7. Libwww Thread Modules

  8. Central Data Structures
  9. Generic Streams
  10. Structured Streams
  11. HTML Presentation Modules
  12. Style Sheets

  13. URI Management
  14. Generic Programming Utilities
When compiling the Library please make sure that you have a relatively new version of the Line Mode Browser that parses the HTML documents correctly (2.13 and newer versions). Find out what version of the Line Mode Browser you are using by typing
	./www -version
Also remember that when editing the module interfaces or adding functionality then always use the HTML files and not the .h files!

Core Modules

The "core modules" are the fundamental part of the Library. The core entity contains hooks for the dynamic modules and provides the major access points for applications issuing requests, for example to access a data object.
WWWLib Include
This is not really a core module, but an important part as this is the only include file you need to use the Library.
Access Manager
The access manager is the main entry point for requesting a data object pointed to by a URI. It has a set of methods that allows the application to request different services, for example to get a URI, post a URI, or to search a URI.
Protocol Manager
The protocol manager is invoked by the access manager in order to access a document not found in memory or in file cache. The manager consists of a set of protocol modules handling the access schemes HTTP, FTP, NNTP, Gopher, WAIS, Telnet, and access to the local file system. The protocol modules are registered dynamically (using static linking) and the User's Guide describes how modules can be registered. Each protocol module is responsible for establishing the connection to the remote server (or the local file-system) and extract information using a specific access method. When data arrives from the network, it is passed on to the format manager.
Format Manager
The stream format manager takes care of data format conversion requested based on a set of registered format converters and a simple algorithm for selecting the best conversion.
Cache Manager
The cache manager is used to save data objects once they have been down loaded from the network. The cache uses the hierarchy indicated in the URLs as a way to identify items in the cache but is still under construction and requires a lot of work to be a highly efficient cache manager!
Error Manager
This module manages an information stack which contains information of all errors occurred during the communication with a remote server or simply information about the current state. Using a stack for this kind of information provides the possibility of nested error messages where each message can be classified and filtered according to its impact on the current request, for example "Fatal", "Non-Fatal", "Warning" etc. The filtering can be used to decide which level of messages will be passed back to the user.
Event Manager
The event manager is a "session layer" handling which thread should be the active thread. A thread can either be an internal libwww thread or an external thread, for example a Posix thread, and the event manager can itself be either the internal Library manager or an external event manager. Currently the internal event manager uses a select function call to decide which thread should be made the active one, however an external event manager can use another decision model. The event manager is described together with the internal thread model in the section "Libwww Threads", and more modules are described in the section Libwww thread Modules

Application Preferences

These modules handles all the modules that can be registered either dynamically (using static binding) or statically by the application.

Default Initializations

The HTInit module defines a standard set of initializations where all protocol modules and converters are setup at startup time of the application in the HTLibInit(). Often you can take this module and just override it with your own preferences.

Handling a Configuration File

Bindings can also be set up by a rule file that is handled by the Configuration File Manager. The format of the rule file is yet to be specified and it does still need some work.

Bindings to the local file system

The Bind Module makes bindings between file suffixes and content-type, content-language, charset etc. As an application this is used when talking to, for example FTP servers that do not support media types, and servers use it when performing format negotiation between multiple representations of a document.

Registering of Methods

The idea behind this module is to allow dynamic registration of HTTP methods, for example PUT, POST, GET etc. and then also allow new methods to be registered. This has not yet been implemented but the structure is there :-)

Registration of Access Schemes

Access schemes like HTTP, FTP etc. can be registered dynamically (using static binding) and the HTProt module provides the support for this.

Registration of Proxies and Gateways

The HTProxy Module provides functionality for registering proxies and gateways dynamically. This has traditionally been handled by environment variables, but is now a registration module just like the registration of access schemes.

Application Specific Modules

When all public functions and variables within a module are overwritten by module other than the one in the Library, the linker takes the new version and ignores the module in the Library. The following modules are implemented in the Library in order to support the Line Mode Browser but can be overwritten by GUI clients etc.

Displaying and Prompting User Messages

The HTAlert module contains the code for prompting the user for file names, userid, password etc. Furthermore, it presents messages containing status information, error messages etc. to the user. The implementation in the library is meant for the Line Mode Browser (i.e. it writes to stderr) but can easily be overwritten by GUI browsers.

History Manager

The HTHist module records and replays on request the documents which the user visits. There are no calls to this module within the Library so if the application does not use it then it is not linked in at all. If the application wants a more advanced history management, then this should be overwritten.

Internal Event loop

The internal event loop in the HTEvtrg module is made to support libwww threads If an application wants its own event loop, then this module must be overwritten.

Default Initializations

The HTInit module defines a standard set of initializations where all protocol modules and converters are setup at startup time of the application in the HTLibInit(). Often you can take this module and just override it with your own preferences.

Network Modules

Network modules handles all the network access including DNS access, reading and writing to the and from the network etc. Most of these modules are internal to the Library, however some applications might use some of them directly.

TCP Communication

The functionality of the HTTCP Module covers several topics but they are all related to TCP/IP communication. All active and passive connection establishment from the Protocol Modules goes through this module. Furthermore, the module manages a local host cache of visited hosts so that the Domain Name Server is only consulted when necessary.

Other topics includes:

Reading Data from a Socket

The HTSocket Module controls all the functionality for reading information from a socket and from the local file system using either a socket or a ANSI C file descriptor.

The module is currently being transformed into a completely reentrant version, but the old non-reentrant interface is still needed for some functions. However, it will be taken out as soon as possible.

Protocol Modules

A protocol module is invoked by the HTAccess module in order to access a document. Each protocol module is responsible for handling the transmission of a data object either from the application to a remote server, or vice verse.

The protocol modules are registered dynamically (using static linking) and the User's Guide describes how modules can be registered. Each protocol module is responsible for establishing the connection to the remote server (or the local file-system) and extract information using a specific access method. When data arrives from the network, it is passed on to the format manager.

Most of the protocol modules are now implemented as state machines in order to support libwww Threads. When the client parses a request to the library a HTRequest Structure is filled out and parsed to a load function in the access manager, for example HTLoadAnchor. HTRequest contains all information needed by the Library in order to fulfill a request.

File access
This module provides access to files on a local file system. Due to general confusion of the "file://" access scheme in the URL Specifications tries FTP access on failure.
FTP access
This is a complete state based FTP client which is capable of communicating with a lot of weird FTP servers. It uses PASV as the default method for establishing the data connection as PORT does not work if the application is run from a firewall machine, as is often the case with proxy server.
HTTP access
The HTTP module handles document search and retrieve using the HTTP protocol. See also information on the current implementation of the HTTP client. The module is now a complete state machine which is a required functionality in the libwww thread model. It uses streams for both outgoing and incoming data, the outgoing stream is implemented in HTTPReq.c and the incoming stream in HTTP.c
News access
The NNTP internet news protocol is handled by HTNews which builds a hypertext object.

This module is under reconstruction!

Gopher access
The internet gopher access to menus and flat files (and links to telnet nodes, WhoIs servers, CSO Name Server etc.) is handled by HTGopher Module.
Telnet access
This module provides the possibility of running telnet sessions in a subshell. It also provides functionality for rlogin and tn3270.
WAIS access
WAIS access is not compiled into the Library by default as it requires the freeWAIS library. This is easily changed in the platform dependent Makefile.include in the
	WWW/All/<platform⁢
directory. However, if this library is present then the application can communicate directly with a WAIS server. Otherwise it must go through a gateway program.

Protocol Utility Modules

The protocol modules themselves can be registered dynamically (using static binding) but some of the functionality used by the modules are kept in a set of protocol utility modules that are described in this section.

Access Authorization

In order to prevent unauthorized access on a Web server, a basic authorization scheme has been developed, see Access Authorization for more details on the scheme. The access authorization is implemented in the following modules:
HTAABrow
This module contains WWW Browser specific code, that is composing the HTTP Authorization Header, recording users information etc.
HTAAUtil
This module contains the authorization code that is common to both the servers and clients, e.g., handling information on different authentication etc.
UU Encoding and Decoding
Provides functions to encode and decode a data buffer according to the RFC 1421 "Privacy Enhancement for Internet Electronic Mail".

Presenting Directory Listings and other Listings

When listings return from the protocol modules they are converted into HTML and parsed to the client. Listings might be HTTP directory listings, Gopher menus, FTP directory listings, CSO Name server etc. The modules providing this functionality are:
HTDir
This is a very configurable module to actually present the listings
HTDescript
This module handles the description field in a HTTP directory listing. For a HTML file, the default action is to peek the title of the document.
HTIcons
This module handles the set of icons used in the listings (HTTP, Gopher, FTP etc.).

Logging Requests

The HTLog Module is a simple log manager that can log the result of a request to the access manager.

Thread Modules

From version 3.0 (unreleased) of the W3C Reference Library, support for libwww threads and other thread models has been implemented multi threaded functionality has been added as an extra set of modules. For the moment, only the HTTP module can take full advantage of libwww threads but both FTP and Gopher are foreseen for the same functionality. The modules included are:

Net Manager

The HTNet module registers sockets as ready for read or write (this includes the connect statement that is basically a write request). It is an internal module to the Library.

Internal Event Loop

The HTEvent module is the Library's own version of the event-loop serving the HTTP client, and it is the application interface to the multi threaded Library. Clients can either use this module as is or they can overwrite it with their own even-loop. Note, that GUI-clients can use the current implementation!

The module is now ported to Windows NT thanks to Charlie Brooks

Central Data Structures

The central data structures are the structures that are a part of the core entity. Each of the core modules as explained in section "Control and Data Flow"are relying on one or more of the central data structures. This section describes the relationship between the core modules and the central data structures and the relationship between the central data structures themselves.

Anchors

All anchor management is handled by the HTAnchor module. You should normally not have to look directly into the HTAnchor structure, but use the methods provided by the anchor manager.

Request

The HTRequest structure is defined in the access manager. It is currently not a completely opaque data structure but it will soon be so be prepared to use methods like the ones to handle the HTAnchor Object Then we can also better call it an object and not a data structure ;-).

Streams

Almost each stream module has its specific implementation of the stream structure, but the generic one is defined in the HTStream Module. The structured stream definition is placed in the HTStruct module. None of these modules have any code directly associated with - this is left to the specific stream modules, for example the HTFWrite stream.

HyperDoc Structure

The HyperDoc structure is different from the other central data structures as it is only declared in the Library - the definition is left to the application. It is intended to contain information about data objects, especially hypertext objects that are to be presented to a user. As an example of a definition, you can look at the Line Mode Browser where it is defined in the GridText module. Here it is called "_HText" structure and it contains all information needed to present and manage a data object in a text based environment.

Even though the Library does not interfere with the contents of the HyperDoc object it does provide an API for managing the object. This API is known as the "HText API" and it is described further in the User's Guide

Stream Modules

A stream is an object which accepts sequences of characters. It is a destination of data which can be thought of much like an output stream in C++ or an ANSI C-file stream for writing data to a disk or another peripheral device. The Library defines a generic stream class in the HTStream module, but almost all stream modules define their own sub class definition of the stream object.

Protocol Streams

These are normally internal streams that parses or generates protocol specific information to communicate with remote servers.
HTTP Request Stream
This stream is one of the first real protocol streams - more are to come!

Converters and Presenters

Streams that can be used to convert data from one media type to another or create a graphic object and present it to the user. These are streams that save the data to a local file and then calls an external program, for example a postscript viewer. These are normally initialized as a application preference.
SGML Tokenizer
Parses the data and generates a structured stream. Each parser instance is created with reference to a particular DTD structure.
Plain to HTML Converter
This stream takes a plain file and converts it into HTML. Like the SGML tokenizer, it also converts a generic stream into a structured stream.
Plain Text Presentation
Takes plain ASCII text and presents it to the user as preformatted text.
HTTP/MIME header parser
Parse a MIME format message and puts all the information in an Anchor object
WAIS source file Stream
Parses a WAIS source description file. By default, this is enabled even if direct WAIS access is not present (no linking with the freeWAIS library).
Guessing Stream
If the input format is unknown at the time when putting up a stream stack, then this module scans a part of the stream and on a statistical basis determines the type of stream needed from the content-type.
External Parser with Call back
This is a call back stream module where the implementation is defined in the application and not in the Library.
Save Locally
The HTSaveLocally stream saves the data object to a local file.
Save Locally and Execute Application
The HTSaveAndExecute stream saves the data object to a local file and calls an external application, for example a post script viewer.
Save Locally and Execute Application
The HTSaveAndCallBack stream saves the data object to a local file, calls an external application and when the stream is freed, the libwww application gets called with a specified call back function.

I/O Streams

Streams that can write data to a socket or an ANSI C FILE object. This can be used when redirecting a request to a local file of when saving a document in the cache.
ANSI C File Writer stream
Writes to an ANSI C FILE * object, as opened by fopen, etc.
Cache Writer Stream
This is the stream that's used by the cache manager
Socket Writer Stream
Writes to a socket or something opened with the UNIX file I/O open function.
Net to Text Converter
Converts "Net ASCII" line terminators <CRLF> into the equivalent C representation which is a '\n'.

Basic Streams

A set of basic utility streams with no or little internal contents but required in order to cascade streams.
Tee Stream
Just writes into two streams at once. Useful for taking a copy for a cache.
Black Hole
A quite expensive way of piping data into a hole for then to be forgotten forever.
Through Line
A short circuited stream that returns the same output sink as it is called with.

Structured Stream Modules

The SGML stream generates a structured stream that is sub class of a generic stream. The Library defines a structured stream class in the HTStruct module, but almost all stream modules define their own sub class definition of the stream object.

A structured stream uses a DTD definition expressed in a C data structure - no, it doesn't not understand a real DTD (yet). The definition of the HTML DTD is placed in the HTMLPDTD module.

HTML Presentation
The HTMLPresent stream presents a HTML object to the user using the "HText API" as explained in the section on HTML Presentation.
HTML Generation
The HTMLGenerator stream generates a HTML object which for example can be written to a file.
HTML to C conversion
The HTMLToC stream converts an HTML object into a C compliant text object. This is for example used by the Line Mode Browser in order to generate C like include files from the HTML files that you have been reading throughout this document.
HTML to LaTeX Conversion
The HTMLToTeX stream is a very simple HTML to LaTeX converter. It is not error free but a start.
HTML to Plain Text Converter
The HTMLToPlain stream takes a HTML object and converts it into plain text using the styles in the "HText API"

Presentation Modules

Generating a Graphic Object

This document describes the methods provided for presentation a graphic object to the user. The implementation in the Library is made for a text oriented browser, so more advanced GUI clients must overwrite some of these modules. See more information on which modules to overwrite. As mentioned in the how to get started guide, the definition of a graphic object is free for the application. Some graphic objects work by storing the whole structure of the document. Others work by converting the nested structure into a linear sequence of styled text for display.

Generally, a new platform has a new implementation of the hypertext object. A GUI client must overwrite the graphic object modules in the Library in order to take advantage of a more advanced user-interface. The graphic object as defined in the Library has two interfaces, depending on how much of the Library code the client wants to handle on its own:

SGML Level
If the client has its own HTML parser then the interface is between the client HTML parser and the Library SGML parser. The SGML parser is a general SGML parser which can be setup with a specific DTD and it feeds the HTML parser with structured data. In this case, you will be emulating the HTML module, and generating a hypertext object from the structured stream. The actual structured stream definition is in the SGML module.
HTML Level
If the client wants to use the HTML parser in the Library then this is the second interface to the Library. The hypertext object is parsed and the communication with the client is based on a set of call-back functions in the HTML parser. The call-back functions are all defined as prototypes in the HText module but the client must provide the actual code that defines the presentation method used for a specific HTML tag. If you wish to maintain the structure of the SGML file within your object, then the SGML interface will be a better place to connect your code.
You are free to define the structure of the hypertext object (the structure is left undefined in the HText module. You may want to define your own styles and font definitions.

Style Sheets

We are currently working on implementing a HTML level 3 parser in the Library, so the following description will soon be out of date.

The Style module in the Library currently only handles a flat style structure with no functionality for nested styles. You don't have to use this module, as you can replace the entry points in the HTML module with your own, to prevent the library version from being loaded.

URI Management

The functionality for handling URIs is placed in the following modules:

Parsing URIs

The HTParse module provides functions for parsing URIs, simplify them by removing redundant information. This is automatically called by the Anchor manager every time an anchor is created in order to minimize the number of redundant anchors.

URI Encoding and Decoding

The HTEscape module can search a URI for unsafe characteres and escape and unescape them according to the URI Specifications.

Utilities

This document covers the basic programming utilities that can be used in the client or server to make life easier.

Container Modules

These modules are generic data object storage modules that might be used wherever convenient. The general rule for freeing memory from these modules is that free methods handles data structures generated within the modules whereas user data is for the caller to free. The modules consist of:

Binary Trees
This is a complete balanced binary tree that might be used for storage and sort of a large number of data objects, e.g. filenames in directory listings etc.
Dynamic Strings
A Chunk is a block wise expandable array of type (char *) and is a sort of apology for real strings in C. Chunks make it easier to handle dynamic strings of unknown size. It is often faster than using the String Copy Routines.
Linked Lists
This module provides the functionality for managing a generic list of data objects. The module is implemented as a single linked list using the scheme first in - last out (FILO).
Association Lists
This is a small module build on top of HTList that provides a way to store Name-Value pairs in an easy way.
Strings
Routines for dynamic arrays of characters include string copy, case insensitive comparison etc. It also contains functions for generating date and time stamps, MessageID etc.
Atoms
Atoms are strings which are given representative pointer values so that they can be stored more efficiently, and comparisons for equality done more efficiently. The pointer values are in fact entries into a hash table.

Basic Utilities

Look into these modules before you start defining system dependent stuff. Most things are already defined here! The list of basic utility modules are currently as follows
System specifics
The tcp.h file includes system-specific include files and flags for I/O to network and disk. The only reason for this file is that the Internet world is more complicated than Posix and ANSI.
Platform Independent macros
The HTUtil.h include file contains things we need everywhere, generally macros for declarations, booleans, etc.


Henrik Frystyk, libwww@w3.org, November 1995