OCLC/BIBSYS — what the world doesn't know

BIBSYS, academic libraries in Norway have enjoyed community cataloguing for pretty much every educational institution and a common user interface (both front and back end). This same system was maintained as a subscription-based, cloud service with a take-your-data-with-you policy for libraries that wanted to pull out.

What the world doesn't know is that this system dates from 1988 (or indeed 1975 if you think of the pre-Web days), and has been used, abused, loved and hated by many thousands of users. BIBSYS is synonymous with "the library" at most institutions. The collective catalogue and common interfaces aren't BIBSYS' only product, but it is far-and-away their most important product. The system was developed in-house here in Trondheim, funded by the state and subscriptions.

In 2010 following a laughable process, BIBSYS selected OCLC as a development partner for a next generations library catalogue. What made the process laughable is that the report specifying the requirements for a new system was seemingly penned by people who were as ignorant of BIBSYS and its function as the man in the street outside Norway. This was obviously a bad start, however, handed over to the contract lawyers and representatives for the rather conservative academic libraries, what was once laughable became everything but.

The agreement with OCLC, for all its talk of something new is delivering something Norway already had: BIBSYS, just based on other technologies. Given that the world has moved on since 1988, you might think that the time has come to think anew; it seems that this is a difficult task when you haven't had these technologies before, and who learns from other people's mistakes? You pay your money and you get what you had, just recoded.

Now, I'm obviously not happy about this — I said long ago (and I am not the only voice to raise this sentiment) that the last thing we need is another library catalogue; academic libraries don't do books, they do "stuff" (from the entire 50 million kroner (USD 8.7 million) literature budget, NTNU university library bought 7500 books). Library catalogues don't do "stuff".

Now is the time to be a bit radical — you don't need OCLC or BIBSYS to share and collaborate, you need to SHARE AND COLLABORATE. Now. Go away and do that.

The semantic web stack: complexity and incomplexity

[Yes, it is a word (I checked the OED)]

I'm currently helping some folk I know — database folk — work with semantic technologies, and I'm noticing a few things that are maybe worth mentioning about how difficult the semantic web is for established IT people.

I take for granted that people understand HTTP, we are after all in a world where content negotiation and RESTful APIs serving all manner of JSON and XML and other established services are commonplace. Except they're not. For the majority of IT people, these things aren't actually staples of their working life — it is the absolute minority of people that use these technologies daily in their enterprise. And the majority of these are Web people, not enterprise IT (who make up the vast majority of IT people).

In reality, people work with SOAP/WSDL and all the misery that entails; they use proprietary APIs that have nothing to do with REST and HTTP. This isn't the stone age, it's commercial, enterprise IT. It seems like madness to anyone coming from a Webdev background, but it's there because it works.

Working with people who develop and maintain large databases, who code non-serious applications and who design architectures for their enterprise do things I can understand because this is "normal IT"; when it comes to explaining the semantic web stack to them, they understand the words and technologies (maybe with the exception of RDF, OWL, RIF, SWRL and the like), but how they all add up together to create something different is a place where they have no footing, no point of reference.

This isn't odd, because while a HTTP-URI is a recognizable thing, changing its status from what people commonly understand them to be to the name of a thing is kind of a leap; similarly content negotiation is understandable, but it's hardly easy to understand in light of the internminable discussions on the mailing lists. Sure, there are analogues: you get the feeling that a triplestore can be seen as a database, SPARQL is really SQL…

But it dawned on me: the real problem is just the core concept of semantics: what is it that you are doing when you're doing semantics? You're naming a thing and you're describing it in a self-documenting and hopefully logical way. And it's here more than anywhere else that the problem arises because, even with an understanding of all of the other complexities (which are in essence technical detail), it's the actual semantics and how you put it to use in your particular case that is the problem.

To understand the semantic web in an abstract sense is pretty easy — it's not difficult to find some concise definition of what the semantic web is, but putting this into some practical context that a person who is used to having a grasp on what they do is far more difficult. An IT person is used to getting stuff — by instinct and by reading the documentation — however no amount of reading documentation can make you see the practical consequences of doing semantics until you actually have a model that lets you see what you can do (and what you've already done wrong).

Linked data in libraries & why it isn't going to work

I guess the major problem with linked data in libraries is that there is no real need for it; let's face it, the systems that exist today serve their purpose for the majority of use cases, and probably all of the actual use cases. We built the systems we made on sound foundations, and we keep them running because they do the job we want them to do.

Talking about linked data in this kind of environment isn't going anywhere. It's rather counter-productive, because changing the systems we have for linked-data-based systems is not cost effective, nor is it realistic. Currently, the linked data "system" does not actually exist; there are some "products", but nothing that does what a library wants them to do without prohibitively expensive development.

It's nevertheless tantalizing to imagine a world of linked data that makes it possible to find any information quickly and with a cool interface, and I'm sure that this is the kind of experience libraries are willing to pay handsomely for. Truthfully, this kind of thing is exactly what we're seeing from the major players in the systems marketplace; it's no coincidence that the likes of Ex Libris talk about linked data in positive terms — it has been seen as a way forward (generally and specifically — by the Library of Congress).

Contrast this however with the outpourings of the "SuperMARC" community — surely a pastiche of thinking within the library IT community — and you see an issue that isn't going away: people are satisfied with what they've got, because even if it's a grandparent, it still works for them. Any plausible change to the status quo is just an addition to what is already there, nothing more, nothing less. Take BIBSYS' assertion that the new library system from OCLC gives us "a system with all of the necessary functionality to run a library".

This is codification of the values and aspirations of libraries — it might seem ridiculous to many, but what is being said echoes the outcome of hundreds of focus groups in libraries around the world that have brainstormed "what do we need to do to be relevant". These sessions return results like "be present in Google", "be where the user is", "smart interface", "search everything at once". The thing is that the SuperMARC-ers are right, all of this is possible with current technology. It's current licensing that's the problem.

Looking at the many projects that people are working on at the moment, and the statements from the Library of Congress, you can clearly see that we're nearing the end of the first wave of linked data, where we collate and convert MARC records to linked data. The real problem is that we're doing this. We have assumed that the data is actually worth keeping. It isn't.

The record is the issue; the cohesive concept of a thing that is a manifestation of a work that can be described by reference to properties and values that were developed to help identify physical things in the real world don't translate well to the digital reality. Here, it's possible to go beyond the the physical and into the conceptual, to dig deep and find new relevancies, connections and contexts. In this sense, no work has been done (FRBR, by the way should simply be ignored because it isn't a useful model in theory or in practice as it builds on thinking that draws its hypotheses from the long tradition of library and information science).

I suggested a topic for LODLAM in San Francisco in July this year; "What are the objectives of linked open data for libraries, archives and museums? From the perspective of libraries, the objectives of providing metadata have traditionally been “finding, identifying, selecting and obtaining”; within other domains, I am sure that the same kinds of principles apply, however, there seems to be something lacking in these objectives seen from the context of the semantic web. What are the new objectives when creating semantic metadata? At NTNU, we think that contextualizing (what contexts surround the item, its facets), comparing, sharing are important, but there is surely more". No takers.

What's going on is a continuation of a theme, and the tune is all wrong, the stage is set for a different production; in fact, we're brewing beer. It's a real issue, because linked data offers nothing unless it is used in an appropriate way, and I don't think we're even close to finding out what this actually means.

Sure, if you're in a position to do it, using the linked data stack is a great idea — it makes for speedy development and has powerful logic, but it isn't a panacea of all things good, and just doing what you're doing now with linked data is not a good idea — there are better, more cost-effective ways of doing this. But, face fact: you aren't going to create a killer app this way.

What would make a killer app is having the time and dedication to go and reinvent what is bibliographic data, add the contextual information — not using algorithms (how do you know what to program?), but by hand — find the links that turn the data from "distributed library card" to "starting point for the rest of my career". I desperately want these things for libraries, but I don't see them happening anyplace.

When all that is said, we at NTNU have tried to create the data "the way it should be done", we have added contexts and created data-driven applications. The problem is that this isn't mainstream, it's backwater — we're alone. We have plans with other Norwegian institutions, but they struggle with the same issue.

So, tomorrow, I sign a contract that will see me leaving the library and move into industry to work with linked data; I passionately believe in linked data and openness, and I think we'll be doing great, great things. I'm excited, but I'm also down because I also love libraries and the work we do. So long and thanks for all the MARC.

Of BLOBS, Institutional archives/repositories and CMSes — an exercise in bad design (by design)

One of the recurring themes in my life is my contempt for BLOBs; not as bad as it sounds — in one respect — since I mean BLOB as a data type. I can't stand them, thy annoy me and I never want to see them. I have thought this since 1998 (the last time I used a BLOB).

Most people don't get so worked up about data types, however, I feel they should; if they knew how much misery these things wreak on us every day.

Let me explain: a BLOB is basically a large, amorphous lump of data that your store in your database; in most cases this is an image or a video. Why is this so bad? Well, for starters, databases are meant to contain data structures, not files. You can tell this because as soon as you start dumping lots of binary data in there, your database application grinds to a halt. (Oh, it doesn't you say? No, it does, it's just that you can't — or won't — see it.)

This is just a little bit sad, because sure, speed isn't everything. It gets worse. There are a number of activities that are trivial when, for instance, storing things in the file system that become non-trivial when stored as BLOBs — things like read-write operations, which you'll see become a) disasterously slow and b) require an insane infrastructure that isn't necessary or smart.

This is why CMSes are horrible, why we need stuff like SWORD to make institutional archives and repositories bearable and why our content is largely at odds with Google and friends. Simple things like to linking a file (and the godawful library concept "deep linking") become exercises in soul-attrition for sane developers.

So, the next time you review a system, please, please do yourself a favour: if it uses the BLOB data type, walk away.

NTNU linked open data initiative

NTNU, headed up by the linked open data team at by NTNU University Library has been working with linked open data since 2009, and has driven the production of multiple data sets and systems that use these.

Examples of data sets produced by NTNU include:

  • Theatrical productions from Sør-Trøndelag theatre
  • 1.5 million name authorities (Rådata nå!)
  • Academic conference authorities
  • NTNU Special collections
  • Norwegian Scientific Disciplines (NVD)
  • TEKORD (Norwegian controlled vocabulary for science with UDC numbers)
  • Norwegian MeSH
  • Journals metadata (including Sherpa/ROMEO, NVD-metadata and Norwegian "scientific level"
  • Research database descriptions
  • NTNU research experts database

The largest data set, "Rådata nå!" (Norwegian, raw data now!), provides 1.5 million descriptions of name authorities including links to biography and related data from dbpedia and library data sets, was produced in a collaborative project with BIBSYS funded by the Norwegian archives, libraries and museums authority. This was one of the first large Norwegian public data sets to be released under a liberal licence (ODC PDDL), and possibly the first truly large RDF data set to be released in Norway (>10 million triples).

The work on these data sets has been part of a key exploration of semantic technologies for use in the daily operations of NTNU University Library; examples of this include the tools for semantically enabled journals search for distribution on different platforms, databases interface, theatrical role and show search and especially the NTNU special collections search interface. This latter system is entirely semantically driven from from the bottom to the top. Every layer from data to user experience and integration with external systems employs linked data as its backbone, drawing together non-library data with library data to supplement and improve the quality of the cataloguing and create a growbed for unique approaches to unique materials.

Continuing work on these and other projects shows exactly how powerful semantic tools really are, and how they change the way we work. A commitment to semantic technology is essential for survival in an age of intelligent data.

Special collections: archives, photos, manuscripts, maps and stuff

I'm currently working on NTNU's special collections, modelling the metadata for manuscripts and journals and creating a feature rich skin that will allow people to experience the content via alternative browsing and navigation methods. I'm also working on the database of productions and roles from Trøndelag Teater that is maintained by NTNU. In this work, I use linked open data, essentially RDF with weblinks to other sources of raw data, which means I'm providing others with our data for free, but we are in turn accessing classification systems, geodata and biography for free. One of the major benefits of this — aside the really obvious ones — is that you get multilinguality for free.

The greatest benefit of working in this way, however, is that we can create unstructured, non-normalized data and use it as if it were structured and normalized; if we need an extra descriptive field on an item, we just add it. In fact, because we're describing things as things (rather than trying to press them into an A4 framework), we can have totally heterogenous data residing in the same system without any problems.

The back-end we're using is Talis platform, a data store for RDF that allows you to create streams based on logical queries in SPARQL; the streams can be provided in a multitude of formats, but the ones we're using are JSON and XML/RSS. These simple techniques allow us to get access to our data via HTTP, which means that the data is integratable with virtually any system, and particularly CMSes that restrict access to lower level programming languages (in fact, you can do pretty much everything we do with Javascript alone — and we largely do).

The hard sell for me at the moment is that it is always difficult working in a vacuum; so I'm interested in hearing from others that would like to do similar things. Because there is no requirement that data is structured in any particular way, just provide some RDF for us to load into our store (I can help with this) and we can see how you can use the data in your systems.

Because a datastore like Talis platform is a powerful tool, the data-crunching is done by the storage, relieving the pressure on the programmer at your end, which means that you can implement creative solutions easily. Nevertheless, I can help with this too if needs be.

It's important to note that this approach ultimately offers not only federated searching, but also interlinking between related stuff — we have one version of the constitution, another institution has another — it would be nice to be able to refer to one another using the simple mechanisms of linked data.

New Ideas: Institutional refund policy

Having a quick browse about the Interweb, I came across an article that the publisher wanted money for and next to it an in-brackets link, "(refund policy)". The publisher is Ingenta, and the policy seems both reasonable and sane. In fact, I'd go as far as saying that I would probably do that if I had been in the market for acquiring articles for use on a pay-per-view basis (I'm not, as I work at an institution that pays for the content I require).

Now apply this to institutional acquisitions...

Open data

My take: Open data is a way of providing the raw materials (data) for people to create systems and applications that can help you do the things you need to do. Examples of this include:

  • Mixing together map data and planning office data to create a way of finding out who is building what in your neighbourhood
  • Using weather data to provide personalized information for places you're interested in
  • Using library data and research data to create easy to use reading lists for medical professionals
  • Using government spending data and map data to show which local councils are doing the best job

It's important to note here that this data is "open" in the sense that you're free to use and distribute it under a liberal licence. Of course, some people might say that this data is not free, however, in the majority of cases the data is already created and paid for by tax-payers' money (in the case of maps, weather, library and government data), but there is also a business case for commercial enterprise: making your data available means that you're creating another way of channeling people into your services and products (of course, you're not going to publish sensitive data!) In fact, people using the data are probably creating the services that you would otherwise be paying for.

Open data is important because of exactly this kind of thing: people getting together and creating the services that people want. An original idea can occur to anyone, but it is unlikely to succeed commercially, however, a simple application can be knocked up based on existing data without investment and the usefulness of this product can then be assessed. Nine times out of ten, this kind of thing won't be of wide appeal, however, in the cases where it does work, the service acts as a brokering service for those services creating the data. Consider, for example, Amazon.com's model, whereby book jacket metadata is sourced by them on the understanding that the use of this data leads to more traffic on their site.

From a societal point of view, the main product you want to channel users into is democracy and an informed choice regarding the policies and people that work well. The provision government data such as ministers' spending reports and provision of healthcare services can inform voters as to how their local an national government is performing. This has its benefits.

For you individually, open data is a tool for creating the information you need. If you're a business owner, you can take others' open data and mix it together with your's to show how and where you should be creating your market advantage. As a private individual, you can help yourself get a better overview of what's going on and where you can get the best deal (both in terms of money and experience). If you use open data, you can create a richer experience both for yourself and others; and others can create their own rich experiences of your data if you publish open data.

The first steps to publishing open data are:

  • Getting permission to publish your data (if you own the data, then great, if not…ask your boss)
  • Licensing your data (it's important to make sure that the data can be used)
  • Making it available (just putting it out there might not be enough, maybe people need help to use it)
  • Letting people know about it (CKAN/The data hub is a great place to start)

You'll find that the benefits of open data outweigh the perceived negatives. In fact, it's odd that we haven't always done things this way.

SPARQL: Oh yes, you can!

I work with library management systems, and this means occasionally venturing into userland (hell, someone has to). One of my favourite questions to pose members of staff at the help desk in my library is "I'd like a green book with four t's in its subtitle", which mich sound like an odd question, if you've never worked a library help desk.

The problem is that there is no search technology in our systems — and I suspect any commercially available system — that would allow a library patron or even a back-end user to query bibliographic data in this way. It simply wasn't feasible that anyone would want this kind of information, and the bibliographic data stored in the database won't contain information about the book's colour anyway, even if you could find books with n t's in its subtitle (in fact I guess that you can't even restrict searches in our systems to subtitles, even though this data is technically available in the data store).

This is one of the major problems with the way bibliographic data is managed; the formats used are so arcane and domain specific that they don't work in reality. Sure, my "odd" question can't be answered perhaps because it is a bit "odd", but for each such "odd" question, there are a number of similar real questions that library systems prevent from being answered partly by design and partly through ignorance of what people want (and that this changes over time).

In a standard interface, a user is typically presented with a simple search box, mostly a box that blindly searches a set of specified indexes in the data store. Not very smart. An "advanced search" that allows a mockery of Boolean logic to be applied fieldwise (remember folks, no parentheses, no logic or someone else's logic). For expert systems, things like CQL are available, but these are still dependent on the data in the store, which is limited as pointed out in its content.

So, given that our armory consists of things like CQL and not much else, your only hope is to dump the data store using another arcane standard from the Z39. spectrum, which means a lot of work understanding the content and using typically regular expressions to do your search.

That said, you're never going to get away from the initial problem that the data is largely broken because it doesn't take into account what people using the data actually want (they want to find stuff, sure, but they want to find stuff in different ways, and certainly the ways in which Google records data about things shows us that we have a lot to learn in this field.

Given that libraries have used the essentially same format for storing their data since the 1960s, it isn't a surprise that there are better alternatives; in fact, given that libraries now record data for many diverse information types, it's about time that the data format that is extensible enough to represent a) diverse things and b) do so in a precise, self-documenting way. I offer you RDF. 

RDF is extensible and comes with methods for defining relations between a thing and data related to that thing; RDF can therefore be used a) to replace the format of records and b) model the way in which we think about data about things. Goodbye MARC and goodbye AACR2 (and RDA for the perverts among you). RDF has two obvious benefits — it's a domain-independent web standard which means bibliographic data is directly relatable to other data. Secondly, it comes with a strong, yet simple query language, SPARQL.

SPARQL can, given the availability of the information in the source data, provide the answer to any question because it implements logic at a high level and provides tools for textual string parsing and comparison. Sure, the majority of users will never use anything like this, but it provides system designers and advanced users with a better toolset than they possess today. Another benefit of SPARQL is that it allows queries to be federated to diverse data sources. I can for example query data stores containing articles simultaneously with sources for bibliographic data, simply.

So, the next time you get the "odd" question, sigh and think of RDF & SPARQL.

 

LOD-LAM summit experiences

The LOD-LAM summit, 02–03 June 2011, San Francisco, USA.

LOD-LAM is an abbreviation of Linked Open Data for Libraries Archives and Museums, and the summit brought together people from all over the world, many from the US, but also many from Europe, Asia and Oceania. The summit was set up as an unconference utilizing "open space technology"; what this in essence means is that the organizing committee brings together the people, provides the physical space, logistics and some structure to ensure that things end and start at the same time, while the participants create the content by suggesting topics and leading discussion groups.

I attended LOD ABC, a session on the history and concepts behind RDF and linked open data. This was a useful and interesting illustration of how well an unconference can work — I attended this session mainly because I know something about the topic and have opinions about it. It's always good to see how other people think about these things and see if there are new perspectives to be had or things to learn. I think the discussion here was at a level that made the experts think hard about way in which they present things because there were many novices who wanted to understand the topic.

The second session I attended dealt with historical/geographical information in LOD, this session brought together a group of people who were interested in representing information about concepts that relate to information that represents concepts that aren't used today — an example here being placenames that were used historically. This session showed that there is a lot of work that needs to be done creating good ways of representing this kind of information, and that we need to do some collaborative work on converting existing information sources to and creating new ones in linked data.

RDFa in EPUB brought together a group of people who were interested in producing something tangible — a semantic layer in the EPUB format, whereby individual concepts can be marked up semantically. This idea could lead to rich interfaces that can for example help readers understand texts or create a contextually richer experience than that offered by simple texts. The discussion in this group got slightly bogged down by the recent announcement by Google, Microsoft and Yahoo! that they will be supporting the Microdata format, however, it was a dedicated bunch who have worked outside the framework of the groups in order to achieve their goals — based on RDFa. The experience here was good in that it was nice to see people who are very passionate about textual mark-up.

On the last day I ran two groups on practical modelling and creation of linked data; here we looked at some theatrical metadata and tried to model this. The two groups I worked with proved very astute in their understanding and asked a lot of good questions. A special mention should go to Shawn Simister from Freebase who added a lot to the modelling discussions. Based on this, we produced over 2000 triples from our original metadata (we did not implement the entire model).

Additionally, there was a session at the end of the first day where people presented Dork shorts, which are two-minute presentations of things that people are doing; I presented the work we've been doing at NTNU during this session.

 

Outside the main conference we had the food was provided by a diverse array of caterers in the area of San Fransisco where the summit was held. This was very well thought out and included — on the first day — insightful and engaging talks about the area and its culture. We also had a tour of the Internet Archive, which was well worth a visit.

All in all, this was a very worthwhile experience; it is great to meet the people here — some old, some new — and share ideas about things and the work we're doing. It seems as if now is a very exciting time, and that we're moving very quickly towards a critical mass where linked open data is a normal and usable technology. At NTNU, we're comparatively conservative and our progress matches this; I believe however, that we could perhaps benefit from a more radical effort regarding openness and linked data/RDF. We have a lot to learn here.

 

The summit worked well, in no small part to the boundless enthusiasm of the organizers, but also because of the engagement of the attendees. Because people were attending on merit (you needed to apply and be vetted to attend), the level of interest in this relatively niche topic was immense; this was something that made the entire experience very good and even worth the jet lag.