WNILS Working Group Chris Weider INTERNET-DRAFT Merit Network, Inc. Jim Fullton CNIDR Simon Spero UNC-Chapel Hill October, 1993 Architecture of the Whois++ Index Service Status of this memo: The authors describe an architecture for indexing in distributed databases, and apply this to the WHOIS++ protocol. This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. This Internet Draft expires April 26, 1994. 1. Purpose: The WHOIS++ directory service [Deutsch, et al, 1992] is intended to provide a simple, extensible directory service predicated on a template-based information model and a flexible query language. This document describes a general architecture designed for indexing distributed databases, and then applys that architecture to link together many of these WHOIS++ servers into a distributed, searchable wide area directory service. 2. Scope: This document details a distributed, easily maintained architecture for providing a unified index to a large number of distributed WHOIS++ servers. This architecture can be used with systems other than WHOIS++ to provide a distributed directory service which is also searchable. 3. Motivation and Introduction: It seems clear that with the vast amount of directory information potentially available on the Internet, it is simply not feasible to build a centralized directory to serve all this information. If we are to distribute the directory service, the easiest (although not necessarily the best) way of building the directory service is to build a hierarchy of directory information collection agents. In this architecture, a directory query is delivered to a certain agent in the tree, and then handed up or down, as appropriate, so that the query is delivered to the agent which holds the information which fills the query. This approach has been tried before, most notably in some implementations of the X.500 standard. However, there are number of major flaws with the approach as it has been taken. This new Index Service is designed to fix these flaws. WNILS Working Group Whois++ Index Service Weider, et al. 3.1. The search problem One of the primary assumptions made by recent implementations of distributed directory services is that every entry resides in some location in a hierarch- ical name space. While this arrangement is ideal for reading the entry once one knows its location, it is not as good when one is searching for the location in the namespace of those entries which meet some set of criteria. If the only criteria we know about a desired entry are items which do not appear in the namespace, we are forced to do a global query. Whenever we issue a global query (at the root of the namespace), or a query at the top of a given subtree in the namespace, that query is replicated to _all_ subtrees of the starting point. The replication of the query to all subtrees is not necessarily a problem; queries are cheap. However, every server to which the query has been replicated must process that query, even if it has no entries which match the specified criteria. This part of the global query processing is quite expensive. A poorly designed namespace or a thin namespace can cause the vast majority of queries to be replicated globally, but a very broad namespace can cause its own navigation problems. Because of these problems, search has been turned off at high levels of the X.500 namespace. 3.2. The location problem With global search turned off, one must know in advance how the name space is laid out so that one can guide a query to a proper location. Also, the layout of the namespace then becomes critical to a user's ability to find the desired information. Thus there are endless battles about how to lay out the name space to best serve a given set of users, and enormous headaches whenever it becomes apparent that the current namespace is unsuited to the current usages and must be changed (as recently happened in X.500). Also, assuming one does impose multiple hierarchies on the entries through use of the namespace, the mechanisms to maintain these multiple hierarchies in X.500 do not exist yet, and it is possible to move entries out from under their pointers. Also, there is as yet no agreement on how the X.500 namespace should look even for the White Pages types of information that is currently installed in the X.500 pilot project. 3.3. The Yellow Pages problem Current implementations of this hierarchical architecture have also been unsuited to solving the Yellow Pages problem; that is, the problem of easily and flexibly building special-purpose directories (say of molecular biologists) and of automatically maintaining these directories once they have been built. In particular, the attributes appropriate to the new directory must be built into the namespace because that is the only way to segregate related entries into a place where they can be found without a global search. Also, there is a classification problem; how does one adequately specify the proper categories so that people other than the creator of the directory can find the correct subtree? Additionally, there is the problem of actually finding the data to put into the subtree; if one must traverse the hierarchy to find the data, we have to look globally for the proper entries. 3.4. Solutions The problems examined in this section can be addressed by a combination of two new techniques: directory meshes and forward knowledge. WNILS Working Group Whois++ Index Service Weider, et al. 4. Directory meshes and forward knowledge We'll hold off for a moment on describing the actual architecture used in our solution to these problems and concentrate on a high level description of what solutions are provided by our conceptual approach. To begin with, although every entry in WHOIS++ does indeed have a unique identifier (resides in a specific location in the namespace) the navigational algorithms to reach a specific entry do not necessarily depend on the identifier the entry has been assigned. The Index Service gets around the namespace and hierarchy problems by creating a directory mesh on top of the entries. Each layer of the mesh has a set of 'forward knowledge' which indicates the contents of the various servers at the next lower layer of the mesh. Thus when a query is received by a server in a given layer of the mesh, it can prune the search tree and hand the query off to only those lower level servers which have indicated that they might be able to answer it. Thus search becomes feasible at all levels of the mesh. In the current version of this architecture, we have chosen a certain set of information to hand up the mesh as forward knowledge. This may or may not be exactly the set of information required to construct a truly searchable directory, but the protocol itself doesn't restrict the types of information which can be handed around. In addition, the protocols designed to maintain the forward knowledge will also work perfectly well to provide replication of servers for redundancy and robustness. In this case, the forward knowledge handed around by the protocols is the entire database of entries held by the replicated server. Another benefit provided by the mesh of index servers is that since the entry identification scheme has been decoupled from the navigation service, multiple hierarchies can be built and easily maintained on top of the existing data. Also, the user does not need to know in advance where in the mesh the entry is contained. Also, the Yellow Pages problem now becomes tractable, as the index servers can pick and choose between information proffered by a given server; because we have an architecture that allows for automatic polling of data, special purpose directories become easy to construct and to maintain. 5. Components of the Index Service: 5.1. WHOIS++ servers The whois++ service is described in [Deutsch, et al, 1992]. As that service specifies only the query language, the information model, and the server responses, whois++ services can be provided by a wide variety of databases and directory services. However, to participate in the Index Service, that underlying database must also be able to generate a 'centroid', or some other type of forward knowledge, for the data it serves. 5.2. Centroids as forward knowledge The centroid of a server is comprised of a list of the templates and attributes used by that server, and a word list for each attribute. The word list for a given attribute contains one occurrence of every word which appears at least once in that attribute in some record in that server's data, and nothing else. WNILS Working Group Whois++ Index Service Weider, et al. For example, if a whois++ server contains exactly three records, as follows: Record 1 Record 2 Template: User Template: User First Name: John First Name: Joe Last Name: Smith Last Name: Smith Favourite Drink: Labatt Beer Favourite Drink: Molson Beer Record 3 Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar the centroid for this server would be Template: User First Name: Joe John Last Name: Smith Favourite Drink: Beer Labatt Molson Template: Domain Domain Name: foo.edu Contact Name: Mike Foobar It is this information which is handed up the tree to provide forward knowledge. As we mention above, this may not turn out to be the ideal solution for forward knowledge, and we suspect that there may be a number of different sets of forward knowledge used in the Index Service. However, the directory architecture is in a very real sense independent of what types of forward knowledge are handed around, and it is entirely possible to build a unified directory which uses many types of forward knowledge. 5.3. Index servers and Index server Architecture A whois++ index server collects and collates the centroids (or other forward knowledge) of either a number of whois++ servers or of a number of other index servers. An index server must be able to generate a centroid for the information it contains. In addition, an index server can index any other server it wishes, which allows one base level server (or index server) to participate in many hierarchies in the directory mesh. 5.3.1. Queries to index servers An index server will take a query in standard whois++ format, search its collections of centroids, determine which servers hold records which may fill that query, and then either a) forward the query to the appropriate servers on behalf of the user, (chaining in the X.500 terminology) or b) notify the user's client of the next servers to contact to submit the query (referral in the X.500 model). 5.3.2. Index server distribution model and centroid propogation The diagram on the next page illustrates how a mesh of index servers might be created for a set of whois++ servers. Although it looks like a hierarchy, the protocols allow (for example) server A to be indexed by both server D and by server H. WNILS Working Group Whois++ Index Service Weider, et al. whois++ index index servers servers servers for for whois++ lower-level servers index servers _______ | | | A |__ |_______| \ _______ \----------| | _______ | D |__ ______ | | /----------|_______| \ | | | B |__/ \----------| | |_______| | F | /----------|______| / _______ _______ / | | | |- | C |--------------| E | |_______| |_______|- \ \ _______ \ ______ | | \----------| | | G |--------------------------------------| H | |_______| |______| Figure 1: Sample layout of the Index Service mesh _______________________________________________________________________________ In the portion of the index tree shown above, whois++ servers A and B hand their centroids up to index server D, whois++ server C hands its centroid up to index server E, and index servers D and E hand their centroids up to index server F. Servers E and G also hand their centroids up to H. The number of levels of index servers, and the number of index servers at each level, will depend on the number of whois++ servers deployed, and the response time of individual layers of the server tree. These numbers will have to be determined in the field. 5.3.3. Centroid propogation and changes to centroids Centroid propogation is initiated by an authenticated POLL command (sec. 5.2). The format of the POLL command allows the poller to request the centroid of any or all templates and attributes held by the polled server. After the polled server has authenticated the poller, it determines which of the requested centroids the poller is allowed to request, and then issues a CENTROID-CHANGES report (sec. 5.3) to transmit the data. When the poller receives the CENTROID-CHANGES report, it can authenticate the pollee to determine whether to add the centroid changes to its data. Additionally, if a given pollee knows what pollers hold centroids from the pollee, it can signal to those pollers the fact that its centroid has changed by issuing a DATA-CHANGED command. The poller can then determine if and when to issue a new POLL request to get the updated information. The DATA-CHANGED command is included in this protocol to allow 'interactive' updating of critical information. WNILS Working Group Whois++ Index Service Weider, et al. 5.3.4. Centroid propogation and mesh traversal When an index server issues a POLL request, it may indicate to the polled server what relationship it has to the polled. This information can be used to help traverse the directory mesh. Only one field is specified in the current proposal to transmit the relationship information, although it is expected that richer relationship information will be shared in future revisions of this protocol. The field used for this information is the X-hierarchy field, and can take on three values. The first is 'topology', which indicates that the indexing server is at a higher level in the network topology (e.g. indexes the whole regional ISP). The second is 'geographical', which indicates that the polling server covers a geographical area subsuming the pollee. The third is 'administrative', which indicates that the indexing server covers an administrative domain subsuming the pollee. 5.3.5. Query handling and passing algorithms When an index server receives a query, it searches its collection of centroids, and determines which servers hold records which may fill that query. As whois++ becomes widely deployed, it is expected that some index servers may specialize in indexing certain whois++ templates or perhaps even certain fields within those templates. If an index server obtains a match with the query _for those template fields and attributes the server indexes_, it is to be considered a match for the purpose of forwarding the query. There are two methods of forwarding a query, called 'chaining' and 'referral'. 5.3.5.1. Query referral Query referral is the process of informing a client which servers to contact next to resolve a query. The syntax for notifying a client is outlined in section 5.5. 5.3.5.2. Query chaining Query chaining is done when the queried index server takes responsibility for resubmitting the query to the appropriate lower servers. The server will then forward the query using the syntax in section 5.4, but then takes no further responsibility for the query. A whois++ query can specify the 'trace' option, which causes each server which receives the query to send its IANA handle and an identification string to the client. 6. Syntax for operations of the Index Service: 6.1. Data changed syntax The data changed template look like this: DATA-CHANGED: Version-number: // version number of index service software, used to insure // compatibility Time-of-latest-centroid-change: // time stamp of latest centroid change, GMT Time-of-message-generation: // time when this message was generated, GMT Server-handle: // IANA unique identifier for this server Best-time-to-poll: // For heavily used servers, this will identify when // the server is likely to be lightly loaded // so that response to the poll will be speedy, GMT Authentication-type: // Type of authentication used by server, or NONE Authentication-data: // data for authentication END DATA-CHANGED // This line must be used to terminate the data changed // message WNILS Working Group Whois++ Index Service Weider, et al. 6.2. Polling syntax POLL: Version-number: // version number of poller's index software, used to // insure compatibility Type-of-poll: // type of forward data requested. CENTROID is the only // one currently defined Start-time: // give me all the centroid changes starting at this time, GMT End-time: // ending at this time, GMT Template: // a standard whois++ template name, or the keyword ALL, for a // full update. Field: // used to limit centroid update information to specific fields, // is either a specific field name, a list of field names, // or the keyword ALL Server-handle: // IANA unique identifier for the polling server. // this handle may optionally be cached by the polled // server to announce future changes X-Hierarchy: // This field indicates the relationship which the poller // bears to the pollee. Typical values might include // 'Topology', 'Geographical", or "Administrative" Authentication-type: // Type of authentication used by poller, or NONE Authentication-data: // Data for authentication END POLL // This line must by used to terminate the poll message 6.3. Centroid change report As the centroid change report contains nested multiply-occuring blocks, each multiply occurring block is surrounded *in this paper* by curly braces '{', '}'. These curly braces are NOT part of the syntax, they are for identification purposes only. The syntax of a Data: item is either a list of words, one word per line, with the syntax: word weight or the keyword: ANY . The weight is not required, but is expected to be used by weighting servers. The keyword ANY as the only item of a Data: list means that any value for this field should be treated as a hit by the indexing server. The field Any-field: needs more explanation than can be given in the body of the syntax description below. It can take two values, True or False. If the value is True, the pollee is indicating that there are fields in this template which are not being exported to the polling server, but wishes to treat as a hit. Thus, when the polling server gets a query which has a term requesting a field not in this list for this template, the polling server will treat that term as a 'hit'. If the value is False, the pollee is indicating that there are no other fields for this template which should be treated as a hit. This field is required because the basic model for the WHOIS++ query syntax requires that the results of each search term be 'and'ed together. This field allows polled servers to export data only for non-sensitive fields, yet still get referrals of queries which contain sensitive terms. CENTROID-CHANGES: Version-number: // version number of pollee's index software, used to // insure compatibility Start-time: // change list starting time, GMT End-time: // change list ending time, GMT Server-handle: // IANA unique identifier of the responding server Case-sensitive: // states whether data is case sensitive or case // insensitive. values are TRUE or FALSE Authentication-type: // Type of authentication used by pollee, or NONE Authentication-data: // Data for authentication Compression-type: // Type of compression used on the data, or NONE Size-of-compressed-data: // size of compressed data if compression is used Operation: // One of 3 keywords: ADD, DELETE, FULL // ADD - add these entries to the centroid for this server // DELETE - delete these entries from the centroid of this // server // FULL - the full centroid as of end-time follows { // The multiply occurring template block starts here BEGIN TEMPLATE Template: // a standard whois++ template name Any-field: // TRUE or FALSE. See beginning of 6.3 for explanation. { // the template contains multiple field blocks BEGIN FIELD Field: // a field name within that template Data: // Either the keyword ANY, or // the word list itself, one per line, cr/lf terminated // each word may be optionally followed by a and a weight. END FIELD } // the field ends with END FIELD END TEMPLATE } // the template block ends with END TEMPLATE END CENTROID-CHANGES // This line must be used to terminate the centroid // change report 6.4. Forwarded query FORWARDED-QUERY: Version-number: // version number of forwarder's index software, used to // insure compatibility Forwarded-From: // IANA unique identifier of the server forwarding query Forwarded-time: // time this query forwarded, GMT (used for debugging) Trace-option: // YES if query has 'trace' option listed, NO if not. // used at message reception time to generate trace // information Query-origination-address: // address of origin of query Body-of-Query: // The original query goes here Authentication-type: // Type of authentication used by queryer Authentication-data: // Data for authentication END FORWARDED-QUERY // This line must be used to terminate the body of the // query 6.5. Query referral SERVERS-TO-ASK: Version-number: // version number of index software, used to insure // compatibility Query-id: // some query identifier so the client knows which query to // issue to the following servers Body-of-Query: // the original query goes here Next-Servers: // A list of servers to ask next, either IP addresses or // hostnames, one per line, cr/lf terminated END SERVERS-TO-ASK 7. References Deutsch, et al. Architecture of the WHOIS++ service. August 1992. Available by anonymous FTP as ucdavis.edu://pub/archive/wnils/Architecture.Overview WNILS Working Group Whois++ Index Service Weider, et al. 8. Author's Addresses Chris Weider clw@merit.edu Industrial Technology Institute, Pod G 2901 Hubbard Rd, Ann Arbor, MI 48105 O: (313) 747-2730 F: (313) 747-3185 Jim Fullton fullton@cnidr.org MCNC Center for Communications Post Office Box 12889 3021 Cornwallis Road Research Triangle Park North Carolina 27709-2889 O: 919-248-1499 F: 919-248-1405 Simon Spero ses@sunsite.unc.edu 310 Wilson Library CB #3460 University of North Carolina Chapel Hill, NC 27599-3460 O: (919) 962-9107 F: (919) 962-5604