WWW specification: understanding the Web (*) Haim Kilov Bellcore, MRE-2F049 445 South Street Morristown, NJ 07960 haim@cc.bellcore.com WWW has been described in terms of navigation, i.e., nodes, links, and keywords. This description is based on currently existing technology. However, users still complain about getting lost -- and properly so! There exists a need to understand the Web in service- oriented terms, i.e., in terms of intellectual contents of WWW information. Only in this manner will it be possible to provide a better roadmap -- in fact, a concept map -- for WWW. This paper will show how information modeling concepts (rooted in programming methodology concepts) may be used to understand the Web documents. Abstraction and precision As the Web is a very good example of an open system, it is possible to use the reference model for Open Distributed Processing [ODP 2] as a framework for understanding. This framework is based on information semantics rather than on syntactic details and describes important concepts essential for formulating, understanding, and implementing a specification. Information -- intellectual contents of documents -- has to be specified in an abstract manner. Abstraction (suppression of irrelevant details to enhance understanding [ODP 2]) is essential because humans cannot deal with large amounts of "unstructured" information. Therefore, to understand documents, higher-level precisely-defined constructs, such as composition, dependency, reference, and so on, have to be used instead of links [Kilov 94]. Existing approaches to understanding (hypertext) documents have too often been based on existing products. Document users were less than happy with these approaches. There have already been requests and papers on the need for a better framework to deal with documents ("unstructured data"), some using SQL as an example [Eddison 94]. Indeed, SQL provides modeling facilities on a higher level than links and keyword search, but these facilities are still quite rudimentary with respect to information semantics and are based more on technology (relational DBMSs) than on the information viewpoint. The model for and the documentation of a system need not be the same as the model for and the documentation of the enterprise (or an application -- a task --) for which the system has been created [Raven 94]. Although certain object concepts are being incorporated in SQL, there still are problems in reconciling the existing, value-based, SQL framework, with the identity-based object- oriented one. These issues are out of scope of document modeling, but the general need to better understand and model documents is pretty well exemplified by efforts of this kind. A document should be understood -- and specified! -- using concepts that describe its information contents rather than using concepts that describe existing software tools. More generally, understanding an enterprise and an application is possible only in terms of concepts that describe appropriate business rules [Kilov, Ross 94; ODP 2] rather than in terms of existing systems technology. [As an aside, it has been acknowledged for a while that programming concepts are independent of existing programming languages and software tools [Hehner 93, Dijkstra 76].] Fortunately, programming methodology concepts -- the ones used to create and understand specifications -- are perfectly applicable to describe an information model as well! This approach may and should be used to better understand and specify both traditional and non-traditional (e.g., Web) documents. There is no need to invent (from scratch) new models for representing and communicating information in documents because generic concepts from existing information models are perfectly reusable. This permits, in particular, reuse of the same constructs (generic relationships) to understand and specify both an enterprise and a document that describes the enterprise. Obviously, such an approach makes creating and documenting a specification substantially easier. Observe that the specifications are usually declarative, i.e., they define the "what" rather than the "how". In other words, they define the invariants, i.e., the predicates that have to be true for the entire lifetime of the set of objects, as well as the pre- and postconditions for each operation, i.e., the predicates that have to be true for the operation to occur and, correspondingly, the predicates that have to be true immediately after the occurrence of an operation [ODP 2]. These specifications are simple because they do not have to describe any algorithmic details. Business rules may perfectly be defined in this manner [Kilov, Ross 94; ODP 2; Swatman 93; Ainsworth 94; Hayes 93; Trader 94]. This approach reuses concepts well-known in programming methodology [Dijkstra 76, Hehner 93]. A specification has, first and foremost, to be understandable. To create such a specification, i.e., to understand information better, cooperation is needed between subject matter experts and modelers. This is well-known in information modeling. Only a formal specification provides complete understanding because an "ordinary" natural language specification may be redundant or inconsistent due to the properties of a natural language. However, the subject matter expert (SME) does not have to read and understand the formal notation used: formal specifications very helpful for the modelers [Ainsworth 94, Kilov 93] may well be translated into stylized English for the SMEs. Indeed, essential specifications in [Kilov, Ross 94] are formulated both in Object Z and in stylized English, i.e., translated from Object Z. The English specifications are presented to and perfectly understandable by a careful SME. At least some English specifications in [ODP 2, Trader 94] have also been translated from more formal ones. "Finding an object": concept maps created by document users A document user browses and understands the document using its implicit or explicit model provided by the document author. Ideally, a document model (concept map) should be shown explicitly, by presenting appropriate information specifications to the users. Good prefaces to good textbooks widely use constructs like composition, reference, and dependency for this purpose (for example, [Hehner 93] uses reference associations and compositions in a quick tour of his book). The Web provides an excellent opportunity for document authors to specify document models explicitly. However, there exist bad examples as well: when a Web document is presented only as an ordered composition of its pieces to be retrieved one after the other, this specification does not add anything to the intellectual contents of the document, and probably just relates to the document's logical layout. To understand a document better, its user may want to make notes - -- i.e., to create appropriate documents -- himself: "Roland Barthes suggested... that when you read you should be writing yourself, transposing the information from print to your notebook" [Ulmer 88]. To do that, the document user has to find concepts and relationships between them in the original text (a student often implements that by highlighting appropriate fragments of the textbook in different ways). In the same manner, an enterprise modeler together with the SME have to find enterprise objects and relationships between them -- i.e., business rules that are already there! -- to create an information model. Obviously, different users of a document will find different interrelated concepts of interest, in the same manner as different customers of an enterprise will find different interrelated objects of interest. In many cases, the document model provided by its author is a very good framework for understanding its concepts. This model (concept map) should be accessible to the document user together with the document itself. However, the framework provided by the author may have to be appropriately extended (or restricted) by the document user. Understanding a document by internalizing the concepts of that document is made possible by finding fragments of interest and relationships between them and by writing about (i.e., specifying, in a formal or informal manner) these fragments and relationships (see, e.g., [Ulmer 88, Wegner 94]). Obviously, information modeling experience suggests good ways to formulate these specifications. Future users of a document may understand it better by reusing not only the specification of the document provided by its author, but also specifications of different document viewpoints provided by different earlier users of the document. Thus, user-creatable roadmaps -- in fact, concept maps -- become essential in order to better understand a document. Obviously, these considerations are perfectly applicable to collections of documents. The Web interfaces provide convenient tools to represent such concept maps, but these tools are quite restricted: they usually are applied by document authors rather than users and hardly permit users to mark up document fragments not anticipated by the authors (a fragment may be defined as the -- existing -- unit by which information is picked up (Tim Berners-Lee)); they usually do not distinguish between different types of relationships between these fragments; they usually represent binary relationships; and they do not provide a way to find the relationships in which a particular document fragment participates. The latter shortcoming is especially important: the knowledge of relationships that refer to a given document fragment substantially improves understanding both of the fragment and of these relationships. Again, information modeling provides important reusable concepts [Kilov, Ross 94] -- including a library of generic relationship types - -- applicable for creating document concept maps. Application- specific relationships used in these concept maps are created by instantiating these generic relationships for particular documents and their fragments. It is possible to use a graphical or a linear representation of these relationships, so that existing tools used to navigate the Web may also be used to represent document concept maps, i.e., to understand the documents. In this manner, it will be possible to distinguish between different types of relationships between documents; to represent both binary and non-binary relationships; and to provide a way to find the relationships in which a particular document participates. It is important to notice that a linear representation of a relationship does not have explicit links: a link is a property of a particular (graphical) representation of a relationship rather than of the relationship itself. Let us look at two examples of such a linear representation, for Bellcore legal disclaimers [Kilov 94-1] and for the organization of the family of ODP standards: DisclaimerTypes: Exhaustive Subtyping (Disclaimer, {Technical Analysis Disclaimer, Technical Audit Disclaimer, General Disclaimer}), or ODP Refinement: Ordinary Reference (Descriptive Model [ODP 2], Prescriptive Model [ODP 3]). This consideration alone shows that links are not necessary for understanding a relationship and therefore a hypertext or a Web document. The analogy with goto's or pointers in traditional programming is obvious [Kilov 94]. Document users may want to acquaint themselves with examples of document concept maps before creating their own maps. Examples of this kind may be provided in the Web. Obviously, reading is easier than writing, and for quite a few document users it will be sufficient to choose among several existing concept maps for the same collection of documents, rather than to create their own maps. In the same manner, but to a larger extent, an implementor of a specification construct in most cases picks and chooses a particular element of an implementation library as a refinement of this construct rather than invents his own refinement [Kilov, Ross 94; Welsh 94]. Several kinds of documents have a well-established information model. Examples of such documents are well-known: databases, spreadsheets, program code, error messages, etc. Fragments of these "structured" documents and relationships between them are quite rigid; a document user often cannot change them or specify new fragments (although in relational databases specification of new relationships between existing fragments is possible). In fact, software development may well be considered as a document-based process [Welsh 93, Welsh 94] which refers to both traditional documents (e.g., requirements) and structured ones (e.g., databases or program code), with the need to specify, clearly and explicitly, relevant fragments of these documents and relationships (e.g., reference, composition, dependency, and so on) between these fragments. The same information modeling concepts may be used to better understand a collection of documents of any kind. Collective behavior A document component cannot be understood and used in isolation: it is referred to in the specifications of structural and behavioral properties of one or, more often, several, documents. Moreover, a document is also not isolated: a traditional document has explicit and implicit references to other documents; and, as we see in the Web, there are quite a few explicit links between documents (but not between their components?). These references and links implement different types of relationships between documents. The concepts used in understanding and modeling a single document, such as precise and abstract specifications of relationships between document fragments, are perfectly (re)usable for understanding and modeling a collection of documents. In particular, a new composite document may be created by reusing fragments of different existing documents and organizing them in an appropriate way. The Web provides an excellent framework for doing so: before creating information, a Web user is encouraged to find and reorganize it. The concept of specifications referring to collections of objects and, in particular, to collective behavior of these objects, is reasonably well-known. It is encountered both in information modeling [Kilov, Ross 94] and in several standards related to object reference models and open systems [ODP 2, GRM 94, OODBTG 91, FM 94]. Obviously, collective behavior is essential to understand any enterprise, including a collection of interrelated documents. Viewpoints The Web, like any other large information system, presents a challenge of specifying and reconciling several different (and possibly time-varying) viewpoints referring to the "same" information. In particular, the information and technology viewpoints should be clearly distinguished. For a document, be it a traditional or a Web document, it means distinguishing between the document's intellectual contents and its (logical and physical) layout. On the other hand, several different information viewpoints may emphasize different information characteristics of the "same" enterprise. As mentioned earlier, for a document it means distinguishing among different relationships between different document fragments. These fragments and relationships are different because different users are interested in different aspects of a document or a document collection. In particular, a document reader can create -- and precisely specify -- new fragments and new relationships between fragments of an existing document (without changing its text!). A good example is quoted by [Welsh 94]: in the WEB system for literate programming, it is possible to be interested either in the narrative describing the development steps, or in the consolidated copy of the Pascal program -- the result of this development. Let us consider another example -- the international standard describing the Open Distributed Processing Trading Function [Trader 94]. This standard describes the means to advertise services and to match service offers with service requests. The reader of this document may be interested either in a general overview, or only in aspects related to the service-oriented viewpoints, or only in aspects related to the implementation-related viewpoints, or else in aspects related to formal specifications, and so on. For each of these viewpoints, only certain fragments of [Trader 94], together with certain, viewpoint-specific, relationships between these fragments, will be of interest. There is no need for all of these fragments and relationships to have been perceived as such and therefore to have been marked up by the authors of [Trader 94]. Notice that a good traditional index for a traditional document may provide a reasonable approximation of several concept maps representing several different viewpoints for a document. Obviously, creating a particular viewpoint merges the activities of a traditional document reader, writer, and editor, together with the activities of an information modeler. Providing an explicit specification of a document viewpoint does not differ from providing an explicit specification of any enterprise or application (as P.Wegner noted, there exists a strong analogy between document engineering and software engineering [Wegner 94]). As noted earlier, a collection of documents need not be fixed: it may evolve reflecting the evolution of the enterprise described by these documents. Documents describing software development are a well-known example [Welsh 93, Welsh 94]. In these documents, the reference associations between specification fragments and their refinements should be preserved during "maintenance", i.e., updates. The invariant for a reference association states that the properties of the maintained entity should correspond to the properties of its reference entity. In this example, the refinement (code document) is a maintained entity with respect to its reference entity -- its specification, so that properties of the refinement should correspond to the properties of the specification. This invariant should be satisfied all the time, and not just when the refinement is created. In more technical terms, the reference association between the specification and its refinement is an ordinary reference rather than a reference for create. A more detailed description of reference associations with many examples can be found in [Kilov, Ross 94]. Less restrictions on documents The approach of using information concepts rather than computer implementations permits creating semantically rich information models of documents, without artificial restrictions imposed by currently existing products. The same problem (and solution) exists in information modeling for traditional enterprises. Some of these unnecessary restrictions deal with the following: - -- a document is not always a tree. Although the Web does not impose a tree-like document structure, too many documents (both traditional and hypertext ones!) are still presented as trees; - -- composition is encountered more often than subtyping. Dogmatic adherence to the "traditional" object-oriented approach used in programming languages emphasizes subtyping hierarchies and inheritance and de-emphasizes compositions, thus leading to inappropriate information models; - -- relationships are often non-binary. Composition is a typical example [Kilov, Ross 94]: "A composite type corresponds to one or more component types, and a composite instance corresponds to zero or more instances of each component type. There exists at least one resultant property of a composite instance dependent upon the properties of its component instances. There exists also at least one emergent property of a composite instance independent of the properties of its component instances. The sets of application- specific types for the composite and its components should not be equal."; - -- the same component may belong to several different composites (i.e., a composition is not hierarchical). More generally, the same document fragment may belong to several different relationships, in the same manner as in any other enterprise the same (business) object may belong to several different relationships. This becomes visible in a traditional document when it is marked up by its readers: the same fragment may be marked up using highlighters of different colors, where a highlighter of a particular color is used to specify (the semantics of) a collection of fragments related in a particular way. Relationships in a document or collection of documents explicitly specify different ways of browsing this document or this collection - -- it is well-known that even traditional documents are not read cover to cover. Obviously, the document author cannot and should not anticipate all potential relationships (and even all potential document fragments representing concepts) of interest to the future readers of his document, and therefore the reader of a document should be able to specify ("highlight") these concept maps. In traditional documents, these implicit specifications have been implicitly tolerated (almost any used textbook from a university library may be used as an example). In Web documents, these (semantic) relationship specifications should become precise and explicit, be defined using concepts rather than links (compare [Murray 93, Kilov 94]), and be explicitly promoted. Names Naming within different contexts (including different names for the same thing) should be considered very carefully in dealing with the huge amount of Web documents. In particular, a name should denote a document (or its component) rather than a place where (a pointer to) the document happens to be stored. Using names to uniquely identify particular documents within WWW is a quite non-trivial task because the Web is a huge open system. [ODP 2] clearly states that naming in an open system is possible only relative to a given naming context. A naming context is defined as a relation between a set of names and a set of entities, whereby the set of names belongs to a single name space. An identifier is defined as an unambiguous name of an entity in such a context. [For an almost trivial example, consider such "unreal" names as section numbers in a document: in the context of the subsequent version of the document, the same sections may well be denoted by different names.] Establishing object equality (and therefore document equality) by comparing their properties (e.g., behaviors) is possible only at some abstraction level whereby details irrelevant for this level are suppressed. There have been various proposals of establishing document identity within the Web, based, e.g., on reusing the idea of establishing a book identity -- something like an ISBN for every Web document. This may be quite difficult to organize, and may lead to the existence of many names for the same document with respect at least to documents that do not have a well-defined "owner". The number of such documents may be quite large due to the current proliferation of essentially the same document from various Web sources. Nevertheless, this idea is a good first step as it abstracts away document storage considerations. The concept of "keyword search" is rather close to the concept of naming. Indeed, a keyword is supposed to denote a concept that is searched for by the Web user. However, it is rather well-known that keyword search is often inadequate, as the following quote from comp.infosystems.www perfectly shows: ">Is there something wrong with just using WAIS? No, not at all, unless you don't know exactly what you're looking for, exactly what words to search on, and there aren't too many documents that match those words, and you can figure out which sources to search, and there are sources that cover the subject, and there aren't too many sources that cover the subject... and a few other reasons. ...most search either return nothing or too much, but rarely the desired answer. Nick Arnett Multimedia Computing Corp. (strategic consulting) Campbell, California" Standards Several international standards for information management and open systems (such as [GRM 94, ODP 2]) are applicable to the specification of documents and document collections. In fact, these standards have influenced and have been influenced by information modeling considerations, and have been successfully used for modeling complex enterprises. Specifying enterprises and documents that describe these enterprises by means of the same concepts and using the same standards leads to improved understanding and consistent and readable documents. Obviously, reuse of these standards to specify Web documents is very helpful to both specifiers and users of these documents. In particular, it should be noticed that these standards promote the use of abstract and precise (formal) specifications to better understand and specify information systems in general. Web documents in particular will be understood substantially better in this manner. Acknowledgment Thanks go to Mark Buckley for helpful comments. References [Ainsworth 94] M.Ainsworth, A.H.Cruickshank, P.J.L.Wallis, L.J.Groves. Viewpoint specification and Z. Information and Software technology, Vol. 36 (1994), No. 1, pp. 43-51. [Dijkstra 76] E. W. Dijkstra, A Discipline of Programming. Englewood Cliffs, NJ: Prentice Hall, 1976. [Eddison 94] P.Eddison. Adopting the SQL paradigm: text-retrieval solves data access problems. IMC Journal, Vol. 30 (1994), No. 2, pp. 11-13. [FM 94] ANSI ASC X3H7. Object model features matrix. Document number X3H7-93-007v7. April 1994. [GRM 94] ISO/IEC JTC1/SC21/WG4, Information Technology - Open Systems Interconnection - Management Information Services - Structure of Management Information - Part 7: General Relationship Model. CD ISO/IEC 10165-7 N 8454. March 30, 1994. [Hayes 93] I.Hayes. Specification case studies. Second Edition. Prentice-Hall, 1993. [Hehner 93] E.C.R.Hehner. A practical theory of programming. Springer Verlag, 1993. [Kilov 93] H. Kilov, Information Modeling and Object Z: Specifying Generic Reusable Associations, in Proceedings of NGITS-93 (Next Generation Information Technology and Systems, Haifa, Israel, June 28-30, 1993), ed. O. Etzion and A. Segev, pp. 182-91. [Kilov, Ross 94] H.Kilov, J.Ross. Information modeling: an object- oriented approach. Prentice-Hall, 1994. [Kilov 94-1] H.Kilov. Information modeling: a path to document analysis, in Proceedings of Electronic Document Delivery Conference (EDD'94), pp. 267-280. [Kilov 94] H.Kilov. On understanding hypertext: are links essential? ACM Software Engineering Notes, Vol. 19, No. 1 (January 1994), p.30. [Murray 93] P.Murray. Tyrannical links, loose associations, and other difficulties of Hypertext. ACM SIGLINK Newsletter, Vol. 2, No. 1 (March 1993), pp. 10-12. [ODP 2] ISO/IEC JTC1/SC21/WG7, Basic Reference Model for Open Distributed Processing - Part 2: Descriptive Model. (CD 10746-2, February 1994). [ODP 3] ISO/IEC JTC1/SC21/WG7. Basic Reference Model for Open Distributed Processing - Part 3: Prescriptive Model. (ISO/IEC JTC1/SC21/WG7 N 7525, December 1992). [OODBTG 91] Object Data Management Reference Model. (ANSI Accredited Standards Committee. X3, Information Processing Systems.) Document Number OODB 89-01R8. 17 September 1991. (Also in: Computer Standards & Interfaces, Vol. 15 (1993), pp. 124-142.) [Raven 94] M.E.Raven and R.Thompson. Can principles of object- oriented system documentation be applied to user documentation? * (The journal of computer documentation), Vol. 18 (1994), No. 1, pp. 15-19. [Swatman 93] P.Swatman. Increasing formality in the specification of high-quality information systems in a commercial context. Department of Computer Science, Curtin University of Technology, Australia, 1993. [Trader 94] ISO/IEC JTC1/SC21/WG7, Information Technology - Open Distributed Processing - ODP Trading Function. ISO/IEC JTC1/SC21/WG7 N 897, 1994-02-04. [Ulmer 88] Gregory L.Ulmer. Handbook for a theory hobby. Visible Language, Vol. 22, No. 4 (1988), pp. 399-423. [Wegner 94] P.Wegner. Course on computer literacy for non-majors. Brown University (Providence, RI), Department of Computer Science, CS-94-21, Draft, May 1, 1994. [Welsh 93] Jim Welsh and Jun Han. Software documents: concepts and tools. Technical Report No. 93-23, Software Verification Research Centre, The University of Queensland, Australia, 1993. [Welsh 94] J.Welsh. Software is history! In: A Classical Mind (Essays in Honour of C.A.R.Hoare), ed. by A.W.Roscoe. Prentice-Hall, 1994, pp. 419-429. (*) Copyright 1994, Bell Communications Research, Inc. (Bellcore). Permission to use, copy, modify and distribute this material for any lawful purpose and without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies, and that the name of Bellcore not be used in advertising or publicity pertaining to this material without the specific, prior written permission of an authorized representative of Bellcore. BELLCORE MAKES NO REPRESENTATIONS OR WARRANTY, EXPRESS OR IMPLIED, ABOUT THE ACCURACY, SUFFICIENCY, OR SUITABILITY OF THIS MATERIAL FOR ANY PURPOSE. IT IS PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. Bellcore expressly disclaims any liability for any damage or injury incurred by any person arising out of the sufficiency, accuracy, or utility of any information contained herein. Any use of this material is at the sole risk of the user.