Wednesday, June 01, 2005

On the Theory of Library Catalogs and Search Engines

Supplementing the talk on "Principles and Goals of Cataloging", German Librarians' Annual Conference Augsburg 2002.

Nothing is more practical than a good theory. A banal statement, considering that a theory should always enable its users to easily derive the statements they need for practice.
But a theory for catalogs or cataloging? Is that really necessary? A question anyone is likely to ask who has never been confronted with the matter nor considered it with any seriousness.

Using Internet search engines, and knowing their operation is fully automated, people tend to view with skepticism all practical and theoretical effort invested in catalogs. Any good search engine, however, has to be be based on a good theory - though that one may differ quite a bit from a catalog theory.

What do libraries and the Internet have in common?

Both provide access to collections of recordings. One need not use the difficult-to-define concepts of information and knowledge here. We may leave it open whether or not an "information society" exists, or a "knowledge economy", and whether everything is information or knowledge that is squeezed between book covers or on Web pages. The "Pisa Studies" have reminded everybody that knowledge doesn't come without learning. Being in the possession of printed matter does not mean to possess knowledge, but printed text turns into living knowledge only after reading and understanding, and then this knowledge will sit in the head of the reader and not on the paper or screen. Nobody will doubt that ours is a learning society, and recordings of experience and insight are of central importance for learning. One learns from direct interaction between humans, by one's own doing, by observation, or through studying. Which mostly consists of taking in what others have recorded.

In many cases, suitable recordings have to be found first. Millions of humans, over millennia, have recorded their experience and encounters, their findings, their insights, and their inspiration. When this started with the Greeks, Platon saw in it a symptom of decline: people wouldn't exercise their memories any more because they would now rely on inferior surrogates. But people did not stop at making use of their own recordings, they started using those of others as well. Collecting began. Libraries were created. After collecting more than a few hundred papers or papyri, a system of ordered shelving had to be invented or else the usefulness of the collection would have suffered.

How did cataloging come about?

Once several thousand items have been collected, their physical arrangement, whatever the system, becomes tedious. One will need finding aids, i.e., secondary recordings (or meta-recordings), which will reveal where in the collection a particular item is located. This is the birth of cataloging: it shifts the process of ordering from the shelf to paper, to files or, nowadays, to databases. Unless one also invents a nice theory along with this, the usefulness of the catalog will diminish with its size rather than increase.

Once one has millions, the assembling of the finding-aids in itself becomes quite a considerable effort. No wonder there are attempts at automating the process, at least for collections that exist in digital formats. The metaphoric term "search engine" suggests, misleadingly, that a machine peruses the documents as such, focusing on their content. The actual searching is, however, always performed on surrogate files the system constructs for this purpose. Software can only match character strings, not concepts or ideas. This cannot be done in just some arbitrary way, but there has to be a systematic way, an algorithm, which means a theory.

Contents of libraries and Internet

Combine libraries, archives, and the Internet, and they comprise nothing less but the accumulated intellectual and artistic recordings of humankind, inasmuch as these survive, from all periods, all countries and cultures, in all languages and scripts and about all subjects, by all individuals who ever wished to make a contribution. The size and the complexity of this is staggering. It is naive to expect that navigating this multidimensional universe might be easy or might be made a simple matter. One may try to simplify the description of the world, but the world itself will not become any simpler that way. Note that the initial enthusiasm of the metadata movements has softened a bit...
A catalog attempts to help with finding documents and with orientation among documents, and internet search engines strive to do the same. The question is: in what way, with what principles and methods, by what theory can or should they work in order to help the most people in the largest number of cases in the best possible way. No one single method can serve all purposes and all searchers all the time - everybody will know this who has tried to find anything on more than one occasion.

Books or Internet - a matter of taste?

There is not really an either-or situation. Only the combined contents of both worlds constitute the complete universe of recorded knowledge and achievement. Library catalogs on the Internet do not change this, be they as comfortable as they may, because catalogs carry only descriptions, not the publications themselves which exist only on paper or in microform. To digitize all of these and make them full-text searchable is presently utopia: there are many millions of texts and new ones continue to be produced by the tens of thousands per year, a great many of which are not in machine-readable formats, Google's efforts notwithstanding. Catalogs contain only very brief and standardized descriptions of the documents, whereas for internet content full text is the norm. But: diversity is enormous, and most documents lack a standardized description (a.k.a. "metadata"). From this it follows there will be a number of differences between catalogs and internet search engines. In libraries, we not only have to understand that but we should also be able to transfer this knowledge to our readers.
Further down we make an attempt to juxtapose catalogs and search engines in a table.

First, however, let us look at catalogs as such, and at the difference between the contemporary device, the OPAC, and the card catalog (now gathering dust, if not discarded). We also have to ask what consequences should be envisioned for cataloging rules. It goes without saying that the OPAC is here to stay and that card catalogs are history, but one may still learn from a comparison.
For readers who want more detail, there's the introductory chapter of an outstanding book: Martha M Yee's and Sara Shatford Layne's "Improving Online Public Access Catalogs" (ALA, 1998. - ISBN 0-8389-0730-X).

What is the principal problem today in searching?
The true problem with OPACs is no longer, as it was for cards, that users have difficulty just finding something at all. Instead, for most queries the OPAC does bring up some results - then, however, there is no easy way of knowing if this is all there is and whether the best-suited items have at all been brought up. Users, in other words, cannot know they have missed something or even a lot or even very important things. Their awareness of this is generally low, and catalog use studies have shown that it is very difficult to entice users into making several different attempts - or briefly, to set them thinking. Their confidence in technology, in other words, is improperly high. What most of them use is just the standard or default options, making barely more than one attempt. This is probably based on an overall least-effort tendency, or on the unreflected assumption that what's offered as default is also the best possible way and others are inferior. The catalog itself cannot overcome this. The catalog may be as good as it gets, that's not the point. Users have to think and judge for themselves, today as much as yesterday, and this is not going to change with any new generation of technology. And they should even be happy about this, for otherwise they themselves might be replaced by machines... Be that as it may: there certainly have to be easy ways of searching for simple questions, but ambitious and knowledgable users should also be provided with and invited to use sophisticated techniques.

What is a good catalog?

From all we know, we may characterize it like this (formulated originally by LC's Thomas Mann, as quoted in M. Yee's book):
  • Reliability: Starting from a citation, one should be able to ascertain quickly and with certainty if the item is in the collection or not. (In an unreliable catalog, esp. the latter may require many trials before one can be sure. Catalogs need to have that feature because of acquisitions checking, for example, or to find out whether an ILL order is necessary.)
  • Serendipity : Browsing functions are essential, firstly because one doesn't always have precise search criteria, and secondly because chance findings are sometimes valuable. That's one reason why users tend to go to the stacks first when they know the arrangement. Catalogs should therefore make related materials browsable - the question of course being, what exactly is "related"? OPACs can, for example, support browsing in these ways
    1. provide alphabetical indexes of names, terms, titles etc., browsable up and down,
    2. present result sets in more than one arrangement for the user to choose, and
    3. make related publications accessible via hyperlinks (for subject terms, classification codes, names).
  • Depth : This covers two aspects that are not exactly part of cataloging:
    1. a policy saying what materials or objects are subject to cataloging. Classically, these are books, meaning self-contained knowledge packages. More often than not, however, a book consists of several or even many packages of recorded knowledge, each of which representing a unit that might become the subject of a bibliographic record itself - because someone might well be searching for it. Just think of proceedings volumes or festschriften, not to mention periodicals. With the exception of belles-lettres, readers will, in many cases, be interested and thus actually be looking for a chapter or parts of a book rather than the whole volume. If cataloging restricts itself to title page information, the catalog will be completely oblivious to all the constituent parts of books. For economic reasons (labor, space), not many libraries have ever done chapter-level cataloging. One important case are "multipart publications" with individually titled volumes: are these to be cataloged as a whole or each volume separately - or both? The focus of European cataloging seems to have been heavily on the parts, whereas American catalogers have more often only perceived the whole.
    2. a concept for subject indexing. Is it enough to assign a few subject terms and/or classification symbols to a document to nail down its content matter, or should the aim be to index every subject that is actually dealt with in some part of the publication? There are experiments, for example, with tables of contents of books (as in OhioLink). There are also experiments with automatic assignment of additional terms or notations by software.


From one dimension to many

The most decisive difference between conventional and online catalogs is this:
(We are not talking about technical differences here, like availability around the clock around the globe, just catalog theory!)

Card Catalog: a linear sequence of entries, i.e., a one-dimensional space, the ordering principle being the alphabet on the lowest level, names/titles/subjects on an upper level. Some libraries had several catalogs for two or more time periods or for otherwise defined parts of their holdings. Every document can be represented by more than one card in several places of the sequence, one of these being called the "main entry". It served two purposes. Firstly, that of collocating related works in one place (like an author's works under an established form of his/her name). Its second and probably more important function was to provide a predictable location for the item in the catalog: if one knew the principle, one was able to find with certainty what one was looking for in just one attempt. Practicability limited the number of cards per item to an average of well below ten. There are many conceivable ways of arranging a card catalog, and in particular, of determining the entries to be represented in it. The pattern, once chosen, has to be followed consistently in order for the catalog to be reliable. Therefore, a card catalog is the utmost extreme of pre-coordination. Very elaborate rules had proved to be necessary in order to establish the pre-coordination.

OPAC: in principle, it contains an unordered mass of structured records. Software, however, can easily produce a dozen or more different indexes, each being a linear, sorted sequence of certain parts of the records. Logically, these are still quite like card sequences, but then software, processing a user query, can extract arbitrary subsequences and merge or intersect them with subsequences from one or more of the other indexes, yielding subsets of the database which can then be presented in one or more different meaningful arrangements. Criteria like names, titles, numbers, subject terms etc. may thus be combined in all conceivable ways. Indexes are thus like the axes of a multi-dimensional space in which software enables the user to navigate. Multi-dimensonal spaces are abstract, mathematical entities and therefore present a challenge for many users to comprehend. As opposed to card catalogs, it means that OPACs rely heavily on post-coordination.

The actual arrangement of the pre-coordinated card sequence results from two decisions:

  1. Entries: What are the criteria for the selection of entries - which persons or other entities are to be represented by a card for a main or added entry, and which not?
  2. Headings: What is the exact spelling of the card headings for the selected entities?
    The difficulties encountered here gave rise to the whole edifice known today as authority control.
Metadata schemes, as an aside, seem to neglect the second question more often than not, at least when it comes to names and titles. This relates to the assumption that OPACs do no longer require the elaborate edifice of rules that had been necessary for cards because now every detail can be made searchable, so if one access point fails one can try another.
This is, however, a premature conclusion, becoming apparent when looking at the situations in which a catalog is consulted:


Standard situations of catalog use

The situation most frequently encountered is probably the factual search: For this, catalogs are not very helpful because they contain descriptions of reference works only, not their contents. Search engines, however, index the available documents directly and in their entirety and can thus lead immediately to the facts contained therein. When looking for facts, search engines are therefore the first stop for most anybody these days: the engines serve as directory, dictionary, encyclopedia, atlas, calendar, timetable, picture book, etc. Catalogs can only point users to all those reference tools , which makes the search for facts more cumbersome and time-consuming.
If, however, we turn to document searching, we can observe at least three broad categories of situations frequently encountered when people use a catalog or search engine:

(a) Known item search ("I know exactly what I need"): looking for something cited or referred to in some other place, like a bibliography (before the advent of hyperlinks).

The user then has to know what data elements are likely to yield results. Rules for the selection of these search criteria are called "entry rules".
For cards, these rules had to be very restrictive because, for economic reasons, one could always only produce and file a very limited number of cards for any given item. In contrast, OPACs produce and arrange their indexes automatically. Index entries, and thus access points, can therefore be very numerous. As one attempt fails, for whatever reason, another and yet another can be tried in rapid succession. Before soon, a lack of reliability will be perceived, leading to the desire to have more things standardized (or under authority control) than ever before, like publishers' names or place names.
In addition, there have to be rules governing the description of items. Descriptions have to be brief but to the point: they have to ensure that the database user will be able to differentiate between dissimilar items, like different versions or editions of a document. The important principle is: meticulous transcription from the item at hand.
(b) Collocation search ("I want everything written by XYZ"): What the user knows is, for example, little more than a name or title, or one single document. Starting from this, they want to find all logically related items, like other editions or versions, translations and so on, or all of the output of one author. This objective calls for rules that bring together what belongs together. Such rules are traditionally called "rules for headings" because it was the card headings that eventually brought all the cards together that described one author's works and such. Roughly, headings rules prescribe that a name or title be spelled in exactly the same way all the time. Related items do not come together all by themselves when names or titles are different. Many a name and title therefore has to be spelled different from what's printed on the title page or equivalent - which may sit square with situation a), requiring precise transcription. Sometimes, because of this, a name or title has to be recorded in both the standardized form and the form found in the piece itself. For card catalogs, this led to the invention of reference cards (like Samual Langhorne Clemens: see Mark Twain). For databases, references are collected in "authority files". An authority record for a person contains all the different forms of a name encountered. With an OPAC properly set up, this should then lead to the same result for any query using any of the different forms. For every single document then, just the authority form or its id-number have to be recorded, plus the form found in the piece itself for proper identification and distinction. Some authority records contain as many as 30 or more forms, for example for names like Chechov or Tchaikovsky.
The only authoritative authority file in the AACR world is the one of the Library of Congress, for names of persons and corporate bodies. For persons, this file also contains the titles ("uniform titles") of many works that have been issued in numerous editions and translations.
In Germany, the Deutsche Bibliothek is running similar files, based on German cataloging rules (RAK = Regeln für Alphabetische Katalogisierung).
(c) Subject search ("I'm looking for material on xyz"): Very often, someone embarks on a search without prior knowledge any specific title or any author related to the subject. This situation is, in principle, much more problematic than (a) or (b). "What is this book about?" is a question that very often cannot be answered with a brief list of terms (see above, remarks on "depth"). Books are normally not full-text searchable for lack of access to the source file. Situation (c) is, however, likely the most frequent and important one for many end-users, who tend to perceive a) and b) as rather unproblematic. There are authority files for subject terms just like for names and titles: the Library of Congress Subject Headings (LCSH) for English-speaking countries, the SWD maintained by the Deutsche Bibliothek for German libraries.Situation (b) and its aspect of "editions of a work" often gets overlooked or is not given much attention. It may occur less frequently than the others - how many works, after all, run into two or more editions? One gets more of a sense for it when considering the following search situations, all of which can only be successful if the catalog does indeed "bring together what belongs together":
  • Some users don't know there is a newer (better, more complete) edition than the one he/she has been referred to.
  • A citation may be imprecise but still good enough to find at least one edition - this one should then lead to the others.
  • Users are sometimes happy with any edition of a cited work, no matter the real title.
  • Users may enjoy the serendipity resulting from being presented with more than one edition.
And something else: the fact alone that a translation exists or that several editions have been produced may be viewed as a quality indicator. The card catalog made this readily apparent when editions were all filed under the "uniform title" (and referenced from the various real titles). For OPACs, it might be considered to use the presence of edition statements and uniform titles for ranking in result sets. If this has already been done somewhere, not much is known about it. OPACs can (and should) of course provide a link to "related editions/versions" based on the presence of a uniform title.

Perfection, however, is out of reach: for example, very often a library has only one edition of a work and the cataloger is unaware of the existence of others (esp. ahead of time before other editions would be published!). Then, only this edition can be found in the catalog, but not under any other title by which it may be known to a searcher. Such cases are less frequent in large, shared databases.

Plus ça change, plus n'est pas la même chose ...

Technology enabling proliferation like never before, it is now very common to encounter diverse "manifestations" of a text: the same content can be presented in different versions or file formats and with all sorts of modifications. This can aggravate the difficulties with collocation searches (situation (b)). And titles, though being the most important element identifying a document or work, are not handled with a lot of care in the Internet.
Classically, the manifestation problem varies from one discipline to another. It is probably least virulent in the sciences and in technical disciplines, for it is rather an exception for a document to live through more than one edition. In belles lettres, it is more common, but music has arguably the most and the worst examples: many pieces can be found in dozens of interpretations, titles changing all the time as well as the forms of names (Tchaikovsky!). The "uniform title" is therefore nowhere as important as in music to bring all editions or versions together.

AACR are concerned with the formal level, not the subject level!

The AACR code of cataloging rules, like the German RAK, deals with situations (a) and (b) only. These pose problems that can be solved by purely formal or descriptive means, whereas (c) requires attention to the content of things cataloged.
In the world of cards, there were sometimes (in Germany, nearly always) separate catalogs for situation (c). OPACs, however, always combine formal and subject access points in the same database if not generally in the same index. They can differ in having or not having an "anyword index" actually combining all words (but not phrases) occurring in bibliographic records. In any case, it seems important to have uniform access forms for personal and corporate names, serving for both kinds of accesses. German rules are not yet fully unified in this regard.

The problems described here have been known at least since Antonio Panizzi's work at the British Museum in the 19th century (his "Ninety-One Rules" were published in 1841). He had set himself the task of setting up the first complete catalog for the library. His employers found his ideas somewhat overly complicated and were reluctant to support him. This situation keeps repeating itself...

Attempts at formulating international principles for cataloging set in only in the mid 20th century, the all-time highlight being the IFLA Conference of 1961 in Paris. The "Statement of Principles" promulgated there became the foundation for AACR as well as for RAK. Only as late as 1999, IFLA came up with a new milestone paper, entitled "Functional Requirements of Bibliographic Records", which is gaining ground not just in library circles but also in metadata projects. Some of its main points are presented in a separate paper, "What should catalogs do?", for the German Annual Conference in May, 2002, Augsburg.

Is AACR2 inextricably intertwined with MARC21 (and RAK with MAB)?

The MARC21 and MAB exchange formats were created to serve the exchange of library data. The Deutsche Bibliothek creates RAK records in MAB2 format, the Library of Congress produces AACR2 records in MARC21. However: the Deutsche Bibliothek can and does deliver the same data cast into the MARC mold. Format and rules are not inextricably intertwined: a data format is nothing more than a container. With a bit of goodwill, wrinkles can be ironed out. A worldwide, unified exchange format can be envisioned, despite rules remaining different. UNIMARC was created for this purpose, but it has not caught on. Some samples have been set up for demonstration.

Catalogs and search engines

Time and again, catalogs and search engines are juxtaposed in a pears vs. apples comparison.

The intention here is not to find out which is the better gadget but to show what differences exist. Not just librarians may be interested to get a clearer picture of strengths and weeknesses.

There is actually no competition, for catalogs and search engines cover different ground. Most print material remains offline and thus inaccessible for harvesters, and on the other hand, many online resources have unprintable characteristics and thus could not be published in print.

There are, however, widening "grey" areas: Genuine internet resources are being cataloged to enrich catalogs. And search engines index files that contain book reviews, abstracts, whole chapters, descriptions, etc. Some categories of publications, like preprints and dissertations, which used to appear in print are now mounted on webservers. Important older books no longer subject to copyright are digitized and made freely available. The works of "classics" in many languages are freely available as text files, most prominent example being the "Project Gutenberg". And reference works that used to be published in book form are increasingly made available online and turned into databases or (in library cataloging parlance) "continuing integrating resources". And then, last but not least, there is Google's effort to digitize books on a grand scale. At the time of this writing, one cannot do much more than speculate about the potential of this project.



Catalog
Search Engine
Document base, Coverage
Describes a particular collection, predominantly books, located in one or several buildings.Indexes documents distributed all over the planet. The majority of these "resources" are not very much like books.
Size
The collection is a selection from a much larger number of existing documents. The selection will mostly be by objective and quality criteria but it can also be subjective. However, lack of funds can cause the lack of important materials.
Union catalogs describe many more items than individual catalogs, but not everything is easily accessible.
The intention is for comprehensive and global coverage, but in reality no more than some 30% of accessible materials are indexed by any one search engine.
Selection for quality is generally not possible.
Size and currency of coverage are not obvious to the user, selection is an automatic process. Many documents covered have never been published conventionally, and most conventionally published material is not on the web.
Objectives
A catalog has clearly defined goals (RAK §101) one of which is to ensure reliable access for some types of queries. "Known item searches" and "Collocation search" are deemed particularly important. In many cases, one has to know the right search terms with some accuracy in order to be able to ascertain presence or absence of an item in the collection.Guiding principles for search engines would be difficult to work out, at least in the sense than one could know with a high degree of certainty how the presence or absence of something can be ascertained. In particular, "Subject searches" and "Collocation searches" are technically impossible to be made reliable. For "Known item searches", the situation is better: knowing two or three characteristic and not too common words the text mustcontain, an AND search is very reliable. The dominating use, however, may well be the factual search: with some luck, it is nowhere else that one can so swiftly find an address, a statistical figure, a historic date, a word's meaning, or a picture.
Expectations of users
Holdings of a library are usually smaller than users expect for their fields of interest though libraries usually try to build balanced collections of quality materials of long-term value. Union catalogs may be viewed as catalogs for a much larger yet virtual collection.The number of "documents" indexed may be much larger than any user would imagine, but valuable resources are side by side with utter ephemera and all sorts of useless matter. There are various attempts to use formal criteria for "ranking".
Nature of data
Data consist of highly standardized brief descriptions, following elaborate codes of rules. The most widely used codes are AACR2 and MAB. Every item is represented by a structured record containing well-defined data fields. The data formats have been designed to accomodate all elements prescribed by the rules. The most widely used formats are MARC21 and MAB.
Some examples are provided to illustrate how code and format complement each other.
There are no standardized descriptions of the documents indexed. The database consists of nothing but large inverted files, derived directly from the documents but never shown as such. Standardization in the sense of authority control is not possible because of a general lack of standardized metadata.

Even where metadata exist, they are not always helpful: they are insufficiently standardized, too simple and meager. The most widely advocated semantic standard is the "Dublin Core", but this is a container, like MARC, and what matters is its content. But for content, any standard like AACR2 is mostly absent.

Creation and content of the database
Full texts are not available for direct access or automatic indexing. Catalog records are just very brief and artificial surrogates.

Descriptions are based on title pages or equivalents and little else.

Record structure is still related to traditional catalog card structure in terms of content and layout.

Automatic cataloging (scanning title pages etc.) is not feasible, descriptions have to be prepared by manual and intellectual input.

Some search engines index the entire text of web documents. Things like title pages do either not exist or are not detectable by software. Programs can, however, evaluate the proximity of words, their being highlighted or specifically tagged (headlines, image tags)
Search criteria
Searches can be restricted to certain fields and boolean combinations thereof: names, title words, title phrases, subjects etc., some OPACs have an "anyword" index allowing for keyword searches in the entire text of the records.

With regard to books and similar documents, search criteria relate to a book as a whole, not to any of its parts, like chapters or contributions.
(the "depth" of indexing, in other words, is rather limited).

Full-text searching is the default. There are mostly no fields for titles, names, subjects, so these do not exist als search criteria. If a title search is possible, then it operates on the titles "as is", and not all web sources do have proper titles. Searches for URL components can be a useful complement.
Because of the full-text searching (which means more "depth"), using combinations of not-too-common words can often yield good results where no library catalog would turn up anything, but one can just as well get scores of irrelevant items.
There may be additional functions like, for example, image searching, based on image tags in HTML text.
Some engines do a kind of ranking that attributes more weight to words in the opening section.
Browsing
Instead of direct queries, most OPACs also offer index browsing (up and down, in sorted lists of terms).

Browsable indexes can assist in finding words and names the exact spelling of which is not known. Also, it can be useful to see which inflected forms exist (Plural, Genitive, etc.) For an untruncated word search will find only that particular spelling, but titles can contain other forms. English may be the least afflicted language in this regard.
Serendipity can also be helped a lot by browsable indexes.

Search engines generally do not feature browsable indexes. Although rarely noticed, this would be very helpful because of the total lack of authority control. The enormous amount of data may make the production of browsable indexes unfeasible.
Because of full-text indexing, the inflection problem is less serious: the important words will usually occur
in several inflected forms in any given text.
But: there are prominent search engines not yet featuring truncation...
Result set arrangement
Result sets are usually shown sorted by author, title or reverse chronologic.
Some systems offer a choice.
For ranking, an OPAC might employ word proximity, language, number of pages, or facts like existence of a uniform title or edition statement. Not many OPACs presently apply any ranking technique. This may be because the very brief textual content of catalog records severely limits the applicability of techniques developed for search engines.
Some engines present results in no predictable order.
Some talk of relevance ranking, employing various formal techniques. Strictly speaking, relevance can be judged only by the person searching, not by a machine. The word is used only as metaphor, like so many in the computing field. One ought to make users aware of it.
Search engines can, however, use criteria like link evaluation that have no parallel in catalog data.
Ordering by date or alphabet are not possible because there are no corresponding data fields. Standard HTML files do not even contain a creation date, and the

No comments: