Chapter 6: Metadata

Introduction

In this chapter, the term metadata is defined. Metadata is central to the research and its definition and operation are essential to understanding the thesis. There is extensive consideration of emerging mapping technologies because the evolving Web is composed of increasingly smaller (atomic) components and discovery and use of these is essential to the AccessForAll metadata approach at the core of the research. There are a number of ways to build a metadata profile of a resource and as the technology in this process is the very technology to be exploited by the research, some of the possibilities, such as Topic Maps and the Resource Description Framework [RDF], are included in this chapter. Reference to the emerging Web by versions 1, 2 and now 3, is further extended.

As a leading agency in the development of metadata and metadata practices, the Dublin Core Metadata Initiative [DCMI] is frequently cited in this and following chapters. DCMI was, until recently, a project of the world's biggest library cataloguing organisation, the Online Computer Library Center [OCLC]. The name comes from the location of the office of OCLC, in Ohio USA. DCMI has been scantily staffed and mostly relies on volunteer support. The author has been involved in the administration and substantive work of DCMI for most of its history.

The following diagrams (Table 2) offer a metaphoric view of the process known in computer science as the 'abstraction' of metadata from the data itself. The objects are replaced by representations which themselves are iteratively replaced by representations. For this to be possible, metadata needs to be what is known as a first class object, not just data. Table 2 aims to promote a somewhat 'intuitive' model of metadata that is elaborated in this chapter.

Images Explanations
drawers In the home, we put our clothes away and remember which drawer holds what and assume that, if we're not wearing the clothes, they will be in the drawers or in the wash. We know which drawer to go to for our socks.
drawers In the office, we put documents in files in drawers and number them so we can look up the number, or name, and find the file and thus the document.
drawers In the digital world, we have invisible digital objects so we write labels for them and look through the labels to find the object we want.
drawers If we label our digital objects in the same way, even using the same grammar, we can attach a lot of different labels to the same object and still find what we want.
drawers If we have rules for organising the labels, we can use the labels to sort and organise the objects.
drawers Then we can connect objects to each other by referring to the labels, even without looking at the objects themselves.

Table 2: A progressive set of images showing how (RDF or other) tagging of content can be used to separate content from tags and then the tags themselves can be tagged, or sorted in multiple ways.

Definitions of metadata

Before anyone started to work on metadata (as it is now known), extending the idea of library catalogues into the Web world, there was already some metadata being developed for the Web. This was known as the Platform for Internet Content Selection [PICS] and designed to enable users to choose what they wanted in terms of resources based on criteria they chose. PICS enabled 'pull' as opposed to 'push' publication, relying on the flexibility of applications to manipulate content. To make this work, a series of numbers were used to indicate which of a set of criteria were the user’s choice. A number was then embedded in the resource and at the user’s end, when the resource was received, their browser could determine from that number if the resource was to be displayed or otherwise. A number such as 13271 meant that the user wanted value 1 for the first criterion, 3 for the next, 2 for the next and so on. Thus, they could select on a set of criteria so long as there were no more than 10 values for any one criterion as shown in Figure 29.

Figure 29: insert image of PICS showing how one 'number' referred to ten sets of choices...

At that time of the PICS development, Tim Berners-Lee, the inventor of the Web, was writing his set of axioms for the architecture of the Web, which has since evolved. Currently this document states:

Metadata is machine understandable information for the web (W3C Metadata Activity).

The phrase "machine understandable" is key.  We are talking here about information which software agents can use in order to make life easier for us, ensure we obey our principles, the law, check that we can trust what we are doing, and make everything work more smoothly and rapidly. Metadata has well defined semantics and structure.

Metadata was called "Metadata" because it started life, and is currently still chiefly, information about web resources, so data about data.  In the future, when the metadata languages and engines are more developed, it should also form a strong basis for a web of machine understandable information about anything: about the people, things, concepts and ideas.  We keep this fact in our minds in the design, even though the first step is to make a system for information about information. (Berners-Lee, 1997)

The current version of the Berners-Lee document says:

    1. Metadata is data
    2. Metadata may refer to any resource which has a URI
    3. Metadata may be stored in any resource no matter to which resource it refers
    4. Metadata can be regarded as a set of assertions, each assertion being about a resource (A u1 ...).
    5. Assertions which state a named relationship between two resources are known links (A u1 u2)
    6. Assertion types (including link relationships) should be first class objects in the sense that they should be able to be defined in addressable resources and referred to by the address of that resource A in { u }
    7. The development of new assertion types and link relationships should be done in a consistent manner so that these sort of assertions can be treated generically by people and by software. (Berners-Lee, 1997)

Berners-Lee provides a theoretical approach to metadata that is essential to any technology. He goes on to explain how the logical operations that computers perform well can be enabled by correctly formed metadata.

The Dublin Core Metadata Initiative's description in plain English points out that:

Metadata has been with us since the first librarian made a list of the items on a shelf of handwritten scrolls. The term "meta" comes from a Greek word that denotes "alongside, with, after, next." More recent Latin and English usage would employ "meta" to denote something transcendental, or beyond nature. Metadata, then, can be thought of as data about other data. It is the Internet-age term for information that librarians traditionally have put into catalogs, and it most commonly refers to descriptive information about Web resources.

A metadata record consists of a set of attributes, or elements, necessary to describe the resource in question. For example, a metadata system common in libraries -- the library catalog -- contains a set of metadata records with elements that describe a book or other library item: author, title, date of creation or publication, subject coverage, and the call number specifying location of the item on the shelf.

The linkage between a metadata record and the resource it describes may take one of two forms:
1. elements may be contained in a record separate from the item, as in the case of the library's catalog record; or
2. the metadata may be embedded in the resource itself.

Examples of embedded metadata that is carried along with the resource itself include the Cataloging In Publication (CIP) data printed on the verso of a book's title page; or the TEI header in an electronic text. Many metadata standards in use today, including the Dublin Core standard, do not prescribe either type of linkage, leaving the decision to each particular implementation. (DCMI Usage Guide, 2005)

The (draft) guidelines for the use of the forthcoming AGLS metadata standard for Australia restate the DCMI definition. 1.3 in "AGLS Metadata Standard Part 2: Usage Guide" draft - not available to public yet...

and continue:

The properties in the sets of DCMI and AGLS Metadata Terms form the current AGLS Metadata Standard. AGLS can be used for describing both online (ie web pages or other networked resources) and offline resources (eg books, museum objects, paintings, paper files etc). AGLS is intended to describe more than information resources – it is also designed to describe services and organisations. in 1.4. "AGLS Metadata Standard Part 2: Usage Guide" draft - not available to public yet...

In describing the Content Standard for Digital Geospatial Metadata, the Clinton administration's Federal Geographic Data Committee (FGDC) said:

The objectives of the standard are to provide a common set of terminology and definitions for the documentation of digital geospatial data. The standard establishes the names of data elements and compound elements (groups of data elements) to be used for these purposes, the definitions of these compound elements and data elements, and information about the values that are to be provided for the data elements. (FGDC, 1998)

They go on to add:

The standard was developed from the perspective of defining the information required by a prospective user to determine the availability of a set of geospatial data, to determine the fitness [of] the set of geospatial data for an intended use, to determine the means of accessing the set of geospatial data, and to successfully transfer the set of geospatial data. ... The standard does not specify the means by which this information is organized in a computer system or in a data transfer, nor the means by which this information is transmitted, communicated, or presented to the user. (FGDC, 1998)

There are many definitions of metadata but generally they share two characteristics; they are about "a common set of terminology and definitions" and they have a shared structure for that language. Although metadata is analogous to catalogue and other filing descriptions, the name usually indicates that it is recorded and used electronically.

One difficulty in the use of the term is that it is, correctly, a plural noun but as that is awkward and not usually recognised in common practice, it will herein be treated as a singular noun, following common practice (Buca, 1998).

Another difficulty arises from the frequency with which the word 'mapping' is used. It is used to denote the relating of one mapping scheme to another. It is also used in the expression 'metadata application profile' (MAP) where it means a particular set of metadata rules and, more specifically, where it is used by the DCMI, for a set of metadata rules where those rules are a combination of rules from other sets.

Yet another difficulty is a quality of good metadata: one man's metadata can be another's data. The characteristic of metadata being referred to here is its 'first class' nature: any metadata can be either the data about some other data or itself the subject of other metadata. This is exemplified by the work of the Open Archives Initiative [OAI] who developed a standard for describing metadata so that it can be 'harvested'.

In a very early presentation about the Web (Table 3), Berners-Lee (1994) briefly showed his vision for the role of metadata:

Berners-Lee slide Berners-Lee explanation
web This is the classic picture of hypertext which used to be a part of talks in the days when we had to explain what hypertext is. The web is a set of nodes and links.
web-level

To a user, this has become an exciting world, but there is very little machine-readable information there. The meaning of the documents is clear to those with a grasp of (normally) English, and the significance of the links is only evident from the context around the anchor.

To a computer, then, the web is a flat, boring world devoid of meaning.

This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them.
author level

For example, a document might describe a person. The title document to a house describes a house and also the ownership relation with a person.

Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values. Only when we have this extra level of semantics will we be able to use computer power to help us exploit the information to a greater extent than our own reading.

An important effect of developing security protocols on the web is the abstract space of web information is linked to reality. By taking verifiable responsibility for web statements, a party guarantees an isomorphism between the web and reality.

semantic level This means that machines, as well as operating on the web information, can do real things. For example, a program could search for a house and negotiate transfer of ownership of the house to a new owner. The land registry guarantees that the title actually represents reality.

Table 3: Berners-Lee - The Need for Semantics in the Web (1994)

In fact, although it is often hoped that metadata will be human-readable, the more it becomes useful to computers, the more that it seems to become unreadable to humans. In large part, this is due to its being encoded in languages that make it essential for the reader to understand what is encoding and what is the metadata, but it is also perhaps an artifact of how it is presented. (This problem is overcome, usually, by applications that present the content of the metadata file in human-readable form, just as word processors show only the 'text' of their files.)

Atlases are useful collections of maps, traditionally collected from a range of cartographers (Ashdowne et al, 2000). Such a collection makes more sense, and is more useful, if the conventions for representation used in each map are the same. The way of writing metadata descriptions and terms should be defined in an open way so they can be interpreted by machines and people.

In the research, metadata is used to denote structured descriptions of resources that are organised in a common way and use a common language.

When collecting descriptive metadata for discovery, one usually has a database or repository and specifications for the structure of the data to be stored in that repository that make it possible to ‘publish' the data in a consistent way. In order to share metadata for repositories, it is necessary to have the same structure for all metadata but usually, to make one's own metadata most useful locally, Those who develop such metadata tend to want idiosyncratic structures that suit their local purposes. So local specificity and global share-ability, inter-operability, are competing interests. Sharing of the metadata means that more people can use it whereas local specificity makes it more valuable in the immediate context, where it is usually engaged with more frequently, and where the cost is often borne.

metadata axes
Figure 30: simple/complex; global/local

One of the features of good metadata is that it is suitable for use in a simple way but that it can handle complexity, or extension. Another is that it operates broadly on the dimension of locally-specific to globally-interoperable (Figure 30).

The Dublin Core Metadata Element Set (more recently known as the DC Terms) provides an excellent example of how this might be achieved. It is a formal definition of the way in which descriptive information about a resource can be organised. It has a core set of elements that have been found to be extremely useful in describing almost every type of resource on the Web. Elements can be qualified in various ways for greater precision. In addition, selected elements can be combined with others in what is called an 'application profile' to create a new set for a given purpose. 'Dublin Core' metadata conforms to the formal Dublin Core definition of metadata although there is no requirement for the number of elements that must be used beyond that there must be a unique identifier for the resource being described. DC metadata can be expressed in a range of computer languages and used with descriptions in any natural (or other) language.

Formal Definition of DC Metadata

Originally, DC metadata was used in HTML tags in simply encoded resources. The choice of meaning for so-called core elements was, to a certain extent, arbitrary and based on a pragmatic approach to the high-cost of quality metadata and the experience of cataloguers in the bibliographic world. Some of the definitions were arrived at as a sort of compromise, however, and they were fairly loosely defined, even where some experienced cataloguers knew there were problems being hidden within the definitions. These early definitions can be thought of as terms, arbitrarily chosen, to classify descriptions. Over the last decade, such definitions have been revisited numerous times as the community has learned about interoperability. Today, it is recognised that terms alone are not enough; a structure in which those terms and their relationship with each other, and others, is necessary for interoperability. The DC metadata terms and supporting documentation have, therefore, been slowly improved, always with the need to ensure that this will not alienate existing systems.

Currently, the DC terms are defined as follows:

Each term is specified with the following minimal set of attributes:

Name: The unique token assigned to the term.
URI: The Uniform Resource Identifier used to uniquely identify a term.
Label: The human-readable label assigned to the term.
Definition: A statement that represents the concept and essential nature of the term.
Type of Term: The type of term, such as Element or Encoding Scheme, as described in the DCMI Grammatical Principles.
Status: Status assigned to term by the DCMI Usage Board, as described in the DCMI Usage Board Process.
Date issued: Date on which a term was first declared.

Where applicable, the following attributes provide additional information about a term:

Comment: Additional information about the term or its application.
See: A link to authoritative documentation.
References: A citation or URL of a resource referenced in the Definition or Comment.
Refines: A reference to a term refined by an Element Refinement.
Qualifies: A reference to a term qualified by an Encoding Scheme.
Broader Than: A reference from a more general to a more specific Vocabulary Term.
Narrower Than: A reference from a more specific to a more general Vocabulary Term. (DC Terms)

Defining terms in this way has been made sufficient by the adoption of a model into which this informaiton is fitted. This model is known as an abstract model, following the computer science way of describing such a model. It might also be thought of as the meta-schema for DC metadata; not the schema itself but how the DC schemas operate.

A Formal statement of the Grammar of DC Metadata

Despite the aim of having strict adherence to the original definitions of the DC terms, it became difficult to deal with the many attempts to expand, qualify and otherwise change the DC terms. Doggedly sticking to the original documentation without further explanation and improved interoperability threatened the utility of DC metadata as the technology developed. In 2000, Thomas Baker described the grammar of the then DCMES in an attempt to make it clear how it manages extensibility of elements (Figure 31).

Baker slide of DCMES grammar
Figure 31: DC metadata as grammar (1) (Baker, 2000)

with an example (Figure 32).

baker slide of examples of grammar
Figure 32: DC metadata as grammar (2) (Baker, 2000)

Application Profiles

In 1999, a meeting about how to use DC metadata in educational portals was convened at Kattamingga in Australia (by the author as part of the work to develop the metadata for Victoria's education portal [Victorial Education Channel]). The meeting was attended by some of the leading cataloguers of educational Web resources at the time (e.g. Stuart Sutton and Nancy Morgan from the University of Washington's GEM Project, Jon Mason from EducationAu, responsible for the Australian federal education network [EdNA], Robyn White from the New Zealand educational cataloguing project and one of the two directors of the Dublin Core Metadata Initiative, Stuart Weibel from the research arm of the Online Computer Library Center ([OCLC]. At this meeting, educationalists discussed the suitability of the current set of DC terms for description of learning resources. The international group agreed that there were extra elements they wanted to use and that if there were a way of 'regularising' these, interoperability between educational catalogues (repositories) would be improved.

Ad hoc rules for extensions and alterations of terms were suggested on the spot by the Director of the DCMI, Weibel, who said that any qualifications should:

In addition, there was the idea that certain communities would find particular terms useful and the DCMI should provide for their inclusion, perhaps as a second layer of terms for use by specific communities. Significantly, this was the first formal application profile. An application profile was understood to be a metadata profile, conformant to DC principles, but suited to the needs of the local or domain specific community using it. The development led to the formation of working groups for communities of interest within the DCMI structure, and the Education Working Group was followed by others such as the Government Working Group. The Government Working Group of the DCMI followed the lead of the Education Working Group by developing an application profile. Many years later, the term 'audience' (originally suggested at the Kattaminga meeting) was added to the core set of DC terms [DCMI Terms]. (For sentimental reasons, perhaps, the core is still usually referred to as having 15 elements despite the addition of the audience element.)

In 2000, Rachel Heery and others (Heery et al, 2000) wrote what has become a seminal article on application profiles. They are now established within DC practice. The essence of an application profile is that it allows for the mixing of metadata terms from different schema: the constraint on it is that it should not, itself, define new metadata terms but must derive them from existing schema. When this is not possible because the community in fact wants a new term, this is achieved by the community defining that term in a new name space and then referring to it, alongside other terms used in the application profile.

The DCMI glossary of 2006 offered the following:

application profile
In DCMI usage, an application profile is a declaration of the metadata terms an organization, information resource, application, or user community uses in its metadata. In a broader sense, it includes the set of metadata elements, policies, and guidelines defined for a particular application or implementation. The elements may be from one or more element sets, thus allowing a given application to meet its functional requirements by using metadata elements from several element sets including locally defined sets. For example, a given application might choose a specific subset of the Dublin Core elements that meets its needs, or may include elements from the Dublin Core, another element set, and several locally defined elements, all combined in a single schema. An application profile is not considered complete without documentation that defines the policies and best practices appropriate to the application. (DCMI Glossary-A, 2006)

Dublin Core Abstract Model

In an attempt to further clarify the Dublin Core approach to metadata, the DCMI Architecture Working Group published two diagrams and some descriptions of them in March 2005. Version 1.0 of what is known as the Abstract Model [DCMI AM] emerged after six months of interaction and consideration by that Working Group in an open forum.

It should be noted that its authors, Powell et al., stated that: “the UML modeling used here shows the abstract model but is not intended to form a suitable basis for the development of DCMI software applications”. Elsewhere in the same documenation, software developers were, however, explicitly stated to be one of the three target audiences for the DCAM, the other two being developers of syntax encoding guidelines and of application profiles.

That Abstract Model was a substantial step towards making it easier for implementers to model the DC metadata but it still did not solve all the problems. In 2006, a funded effort to provide an abstract model was commissioned by the DCMI. This produced a more precise, formal graphical representation (Figures 33 - 35).

DC resource model
Figure 33: DCMI Resource Model (Powell et al, 2007)
DCMI Description model
Figure 34: DCMI Description Set Model (Powell et al, 2007)
DCMI Vocabulary model
Figure 35: DCMI Vocabulary Model (Powell et al, 2007)

The diagrams, however, still did not adhere to the strict rules for such diagrams as part of the Unified Modeling Language (UML) and so were not as easy to interpret as had been hoped. Several papers were presented at the DC 2006 Conference (Palacios et al, 2006; Pulis & Nevile, 2006), in which authors argued for a yet better represention in strict UML form, pointing to a number of inconsistencies in the then current version. A new version was commissioned in 2007.

Having more precisely defined models enables profile developers to be more certain about what they need to do. This is important and the lack of a clear model, to a large extent, explains many of the difficulties faced in the accessibility metadata work prior to its release.

In 2008, Nilssen (2008) presented yet again about the use of metadata showing how difficult it is to match metadata from different models and arguing for metadata to be developed at least using compatible models. This led to a new image to explain DC metadata from the author (Nevile, 2008), showing the relationship between the general Semantic Web model, championed by Berners-Lee, and the current DC model. An octopus was suggested by this DC model, as shown in Figure 36.

The octopus model facilitates discussion of many of the complex features of the DC abstract model in a non-technical context (Figure 36).

octopus-annotated
(octopus image from http://www.sportsbettingandcasino.com/files/2008/05/octopus.jpg) Figure 36 the octopus metaphor...

The octopus metaphor operates well for consideration of many of the major rules of DC metadata. The number of suckers might indicate the level of refinement of terms and the number of legs relate to the terms, traditional DC terms or otherwise, that are used. The terms might be considered the dimensions of description, where traditionally DC had 15 but added 'audience' and now 'accessibility' to make 17.

DC as a mapping language for resources

DC metadata can be seen as providing a multi-level mapping of the characteristics of Web resources:

  1. elements
  2. qualifiers
  3. values

with a facility for application profiles that contain combinations of these.

As some might see it, DC is providing for infinitely extensible, n-level mapping of resources in n-dimensions.

In general, the maps of metadata are not read so much as used in the discovery or identification process. But mapping in this sense is analogous to mapping as we commonly think of it in the cartographic sense. There are rules for the co-ordinates (descriptions) of resources and there are structural rules, known in the information world as taxonomies, that are topologies. The browse structure of a Web site allows one to zoom in and out on details, and map intersections and location finders are common.

As the research is grounded in Dublin Core metadata, the preceding explanations have been provided to explain the context for the work. The following sections provide additional descriptions that are considered relevant to the research and its understanding.

Metadata tags

In Web 2.0, resources are distributed and combined in many ways at the instigation of both the publisher and the user (Chapter 2). It is not possible to limit the ways in which this will be done and it is not yet clear how to 'freeze' or later reconstruct any given instantiation of a resource. Arguably, Web 3.0 will be a Web in which this is done by machines (Garshol, 2004).

There is another aspect of Web 2.0 that is relevant to the work in accessibility. Social interaction on the Web is being generated in many cases by what is known as 'tagging' of resources. These resources are often very small, atomic, objects such as an image, or a small piece of text, or a sound file. The increasing availability of atomic objects, or objects in what is becoming known as micro-formats, is expected to increase the accessibility of the Web (Chapter 11).

Usually, words used as values in informal tags are not organised in traditional formal thesauri, as in the case of more structured metadata, but inform what are called folksonomies. These have very different characteristics from the more traditional library subject terms and generally are not structured; that is, users typically add tags with subject, author, format, etc., all mixed in together. This is not necessary, and some users are precise in their use of tags, including encoding them to relate to standard DC Terms, for example (Johnston, 2006).

'Tagging' has become a feature of what many people think of as Web 2.0, the social information space where users contribute to content. This is often done simply by adding some 'tags' or freely chosen labels to others' content. For example, a user may visit a site and then send a tag referring to that site to a tag repository, organised by such as del.icio.us or digg. Typically such tags have values chosen freely by the user and so they may vary enormously for a single concept, and often the concepts associated with tags varying incredibly. In 2006, the STEVE Museum's Jennifer Trant (2006) reported that museum visitors who viewed paintings on a site were prone to submit one tag but a completely different one when they re-visited the same painting remotely or searched for it via a digital image.

In response to the increased use of tags on sites, the author started a community within the Dublin Core Metadata Initiative that is concerned with the relationship between standard metadata and tagging (DC Social Tagging). It is not yet known if tagging is merely a fashion or here to stay as a robust way of getting user-generated metadata but it is of interest to see how users use words, and so might help in the selection of terms for standard thesauri. It is also hoped that the energy available for tagging in the wider community can be harnessed in the future to provide much needed accessibility metadata.

The simplicity with which tags can be associated with content, and simultaneously find their way into a metadata repository, suggest that this might provide a way to capture metadata for accessibility, particularly for popular sites with a number of visitors. The energy that is apparently available for the tagging process is also of interest: can it be harnessed to produce accessibility metadata about resources?

While atomic objects have been on the Web since the beginning, in general they have been published within composite resources where the components have not been separately identified and they have rarely been described in metadata. The move is towards what is known as microformats:

a set of simple open data format standards that many (including Technorati) are actively developing and implementing for more/better structured blogging and web microcontent publishing in general. (Microformats, 2007)

Associated with this move is the practice of many Web users to the use of 'back doors' into information stores. So many people use Google and its equivalent to find what they want and then 'click' their way into the middle of Web sites, that the time has come to think seriously about the role of Web sites. Blogs and wikis as publishing models are increasingly becoming the source of information for many people. The challenge for information managers is to know how to present the available content.

Representations of DC metadata

Metadata has been shown to operate at three levels in the Singapore Framework: elements, qualifiers and vocabularies (Nilsson, 2007). The idea is that for the vocabularies to be shared, their infrastructure must be shared. The Singapore Framework provides a model for this that allows for the use of different vocabulary terms to interoperate and for the extension of any set of terms without loss of interoperability. The rules for the definition, identification and use of elements, qualifiers and vocabularies, ensure such interoperability. They do this in several ways including by requiring that all three levels of definition are referenced in a consistent way.

The constraints in the Singapore Framework mean that increasingly complex descriptions can be developed without loss of interoperability. One can usefully work with the Baker 'grammar' (Baker, 2000) to interpret Dublin Core metadata. This is a 'texty' way of working with metadata. It is very well suited to computer processing that is based on simple logic, including the Semantic Web that is built with simple logic.

The Singapore Framework provides a tight definition of DC metadata. Humans, working in libraries in the past, have been less constrained in some ways. They have been able to use human-reasoning and decision-making when all else has failed. Computers can't do this. A prime example of how humans have simplified their work has been the adoption of the Dewey Decimal Classification [DDC] system for library arrangement of books. The numerical system of subject classification allows for every book to have a place in a library, but it does not cope with the fact that some subjects recur and so where the book is placed is often, in fact, quite arbitrary.

Digital resources do not have to be placed in a unique location. In fact, their location is irrelevant so long as it is clearly identified by a URI. It is not necessary to have a hierarchical or linear organisational structure for digital resources. While this is an advantage in some cases, it does mean that the presentation of resources can be achieved in a number of ways. The focus for information managers, then, is on how to make available, to represent, information about the available resources.

In what follows, some of the new ways of representing the organisation of resources are explained. But before this happens, it is necessary to gain clarity about the terms thesauri, taxonomy and ontology.

Taxonomies and ontologies

In some ways, the three the terms thesauri, taxonomy and ontology have assumed greater prominence in the everyday world in the last decade. Information management is a major challenge for humans and so having ways to work with information have become more relevant to a wider sector of the community than previously.

Alan Gilchrist explains:

Wittgenstein said something to the effect that if you wanted to know the meaning of a word, you should look to see how it is used. Looking at the applications of thesauri, taxonomies and ontologies it is easy to see a progression of ideas that has resulted in some overlapping of detail. Clearly, this potential for confusion has been exacerbated by the presence of different players. In broad terms, one may say that the post-Roget thesaurus has been the domain of information scientists; taxonomies appear to have been generated by a combination of information technologists and systems developers in corporate business together with software vendors; and ontologies have been adapted from the work of philosophers by people working in artificial intelligence (and there are cases of these two disciplines working together on ontologies). (Gilchrist, 2003)

Gilchrist points to the Oxford Dictionary definition and continues:

Thesaurus
A “treasury” or “storehouse” of knowledge, as a dictionary, encyclopaedia or the like. A collection of concepts or words arranged according to sense; also a dictionary of synonyms and antonyms. Taxonomy Classification, esp. in relation to its general laws or principles; that department of science, or of a particular science or subject, which consists in or relates to classification; especially the systemic classification of living organisms. Ontology The science or study of being; that department of metaphysics which relates to the being or essence of things, or to being in the abstract.

These three definitions will be familiar to many, but all three words have now been appropriated and subverted by contemporary workers in the information sciences. In the process, there tends to be, at times, significant overlap between, and even contradiction in, the three words as they are currently used. It would not be sensible to pontificate on the “correct” meanings of these words, but in trying to delineate the central characteristics of these three terminological constructs, it is hoped that the reader will gain a clearer understanding of their differences and similarities, as well as how the three might begin to be used more closely together. (Gilchrist, 2003)

Organization schemes like ontologies are conceptual; they reflect the ways we think. To convert these conceptual schemes into a format that a software application can process we need more concrete representations... (Lombardi, 2003)

For the purposes of the research, the terms and their exact definition are not necessary. The term 'ontology' will be used to represent the organisational structure of terms and concepts. The language of the Singapore Framework (elements, qualifications, vocabularies) is used where possible. Following Hunter (2003), 'higher-level ontologies' will be used to mean vocabularies, in the Singpore Framework sense, and ontologies to mean elements and their qualifications, now known officially by DCMI as qualified elements. The idea behind all these terms is of interest. They are all terms for describing the management of terms and their meanings. With the increasingly graphical representation of metadata, including tags, graphical representations of metadata, metadata maps, are starting to emerge. These can take a range of forms, as shown below.

Other representations of metadata

Tag clouds

Tag metadata is described on the Microformats Web site as follows:

rel="tag" hyperlinks are intended to be visible links on pages and posts. This is in stark contrast to meta keywords (which were invisible and typically never revealed to readers), and thus is at least somewhat more resilient to the problems which plagued meta keywords.
Making tag hyperlinks visible has the additional benefit of making it more obvious to readers if a page is abusing tag links, and thus providing more peer pressure for better behavior. It also makes it more obvious to authors, who may not always be aware what invisible metadata is being generated on their behalf. (Microformats-2)

Tags are presented to the users in a variety of ways including in tag piles (or clouds) as shown in Figure 37:

tag-cloud
Figure 37: A tag cloud (Library Thing)

Tag clouds have no standard structure (see Figure 37 above). They tend to appear simply as piles of words. They can be organised in a variety of ways such as alphabetically or temporally, and may have more popular terms displayed in larger font than less popular ones. The tags being organised may relate to formal ontologies, but do not necessarily do so.

Other systems use graphical representation to show relationships between terms used, displaying the underlying structure in hierarchical, or other maps. Sometimes this is done explicitly, as in the case of the subject terms used in the Dewey Decimal System [DDS] and Topic Maps, for example, or implicitly, as done with the DC terms, in an abstract model that is completed for any set of actual terms.

Topic maps

Lars Marius Garshol describes several types of content organising schemes, starting with a model that is to be implied from the collections:

Data Model - A description of data that consists of all entities represented in a data structure or database and the relationships that exist among them. It is more concrete than an ontology but more abstract than a database dictionary (the physical representation).

Resource Description Framework (RDF) - a W3C standard XML framework for describing and interchanging metadata. The simple format of resources, properties, and statements allows RDF to describe robust metadata, such as ontological structures. As opposed to Topic Maps, RDF is more decentralized because the XML is usually stored along with the resources.

Topic Maps - An ISO standard for describing knowledge structures and associating them with information resources. The topics, associations, and occurrences that comprise topic maps allow them to describe complex structures such as ontologies. They are usually implemented using XML (XML Topic Maps, or XTM). As opposed to RDF, Topic Maps are more centralized because all information is contained in the map rather than associated with the resources. (Garshol, 2002)

Garshol writes:

When XML is introduced into an organization it is usually used for one of two purposes: either to structure the organization's documents or to make that organization's applications talk to other applications. These are both useful ways of using XML, but they will not help anyone find the information they are looking for. What changes with the introduction of XML is that the document processes become more controllable and can be automated to a greater degree than before, while applications can now communicate internally and externally. But the big picture, something that collects the key concepts in the organization's information and ties it all together, is nowhere to be found.

topic map diagram
[Figure 38: Topic maps]

This is where topic maps come in. With topic maps you create an index of information which resides outside that information, as shown in the diagram above. The topic map (the cloud at the top) describes the information in the documents (the little rectangles) and the databases (the little "cans") by linking into them using URIs (the lines).

The topic map takes the key concepts described in the databases and documents and relates them together independently of what is said about them in the information being indexed. ...

The result is an information structure that breaks out of the traditional hierarchical straightjacket that we have gotten used to squeezing our information into. A topic map usually contains several overlapping hierarchies which are rich with semantic cross-links like "Part X is critical to procedure V." This makes information much easier to find because you no longer act as the designers expected you to; there are multiple redundant navigation paths that will lead you to the same answer. You can even use searches to jump to a good starting point for navigation. (Garshol, 2002)

Topic maps need not be just for describing the content of the resource, such as the subject of the resource. They could be used to describe the accessibility characteristics of that content.

Faceted classification, according to Garshol, was first developed by S.R. Ranganathan in the 1930s.

and works by identifying a number of facets into which the terms are divided. The facets can be thought of as different axes along which documents can be classified, and each facet contains a number of terms. How the terms within each facet are described varies, though in general a thesaurus-like structure is used, and usually a term is only allowed to belong to a single facet ...

In faceted classification the idea is to classify documents by picking one term from each facet to describe the document along all the different axes. This would then describe the document from many different perspectives. (Garshol, 2004)

In Rangathan's case, he picked 5 axes. There has been significant work on faceted classification and recently it has been demonstrated as a powerful and useful way to use metadata. Again, this technology could be used to present accessible versions of resources to different communities of users.

Garshol shows that topic maps have a very rich structure for information about an object that is also quite likely to be interoperable. As his example, he gives the image in Figure 39.

a topic map example
colour legend
Identifier Meaning
round discs names of topics
arrows associations
aqua occurrences of a topic
blue vocabulary languages
red query languages
green markup languages
yellow these show the association types
Note: different colours could be read as scope of topic names, or type of topic
Figure 39: Topics maps as an ontology framework

Ontopia's Omnigator is a tool that allows the user to click on any topic name and have it become the 'centre of the universe' with its connections surrounding it. This makes interactive navigation around the graphical maps very simple and intuitive, and seamless across topic maps encoded differently [Ontopia]. The same idea could be used to group resources with particular accessibility characteristics.

Resource Description Framework

The Resource Description Framework (RDF), previously referred to as Berners-Lee's early interest in the Semantic Web (1994), provides a very flexible way of mapping resources. RDF requires the description of properties of resources to be strictly in the form:

resource ----- relationship ----- property

or

subject ---- predicate ---- object

as in

http://dublincore.org ---- has title ---- Dublin Core Metadata Initiative

The theory is that if all the properties are so described, it will be easy to make logical connections between them. Currently, RDF is implemented in XML, as that is the language of most common use today, but the framework is independent of the encoding. RDF maps, like other good metadata systems, are interoperable and extensible. Figures 40 and 41 show how RDF maps interoperate.

rdf-1 and rdf-2
Figure 40: Two fragments of the Semantic Web
rdf-3
Figure 41: The two map fragments in Figure 40 as combined simply by overlaying the matching entities Q-colour-code #r23g67b98i to form the greater map (Nevile & Lissonnet, 2003).

DC metadata and the Semantic Web

The earliest work on the Semantic Web involved a proposal from Tim Berners-Lee that contained an image showing how it might integrate a number of information management activities (Figure 42) (Brickley, 1999).

sweb original
Figure 42: Tim B-L's pic???...

While the original vision is not yet realised, the similarity between it and the claims for Web 3.0 are evident. If such a general information system can be realised, it will provie a context for other systems, such as that of DC metadata. The 'octopus' representation of DC metadata shows how DC metadata fits within the Semantic Web model, sharing many characteristics but with additional constraints. When the constraints are relaxed, there is no distinction between a DC record and other Semantic Web data. The constraints make DC metadata easier to use in traditional ways catalgoue records have been used. The two images in Figure 43 show a typical Semantic Web model and a DC-constrained model, both as graphical maps (Nevile, 2008).

part1
part2
Figure 43: the Sweb and the DC models

The DC metadata, once placed within the wider context of the Semantic Web, can have additional arrows added and so DC (meta)data can be transformed, without loss, into Semantic Web data. This supports perfect interoperability.

Chapter Summary

It may be, as some would suggest (Vickery, 2008), that the most important thing in the Web today is the facility to find things. Metadata, of one sort or another, is essential to this process and hence its significance in Web research and development worldwide. It is not a new topic, but it is attracting unprecedented attention, and the technical complexity of it has grown significantly.

In this chapter, there has been extensive discussion of metadata. First, the meaning and use of the term was clarified. Then the characteristics of metadata were considered. The research is about how metadata should be developed and specified. It is fundamental to such work to know what is considered to be metadata, and how it is described if it, or its use, is to be evaluated. The Dublin Core definition rules and model of metadata were then described. Dublin Core metadata, while not perfect, is the most precisely defined general metadata and the foundation for many metadata systems, such as metadata harvesting standards. Representation of metadata can be of many forms and some of these are described in the Chapter before a quick reference to cross-walking of several metadata schemas is considered.The following chapters draw upon the content of this chapter. Chapter 11 continues discussion of the issues relating to interoperability.

In the next Chapter, there is a discussion of yet more specificity about metadata, this time accessibility metadata. Accessibility metadata is at the core of the research.

Next ->