Wednesday, May 13, 2009

LCSH as linked data: what is an LC Subject Heading?

The Library of Congress Subject Headings have been placed online in SKOS. You can search within the set or download the entire thing in RDF/XML or a n-triples. This is a welcome development.

I must say that I would also welcome some documentation on the decisions that were made, as viewing the actual data has left me with a number of questions. I'm going to begin my comments with a question about scope, and some confusion that is causing me as I think about how I would want to use this data.

What's an LC Subject Heading?

It appears that the LCSH file that is online represents those authority records whose LC control number begin with "sh", as in: sh 00009880. (Numbering 342,684 records.) However, if you do a Subject Authority Headings search in the LC authorities database you will retrieve any authority record that can be used as a subject. This means that you will retrieve personal names, corporate names, and geographic entities that can be used as subjects. (Note, this is probably a large portion of the name authority file.) This is a mixture of records with LCCNs that begin "n" (for name file) and those that begin "sh" (for subject heading file). I'm at a loss to explain/understand what determines whether a heading has an LCCN beginning with "sh" and would love to get an explanation.

The result is that a search in the LCSH file on the word "Italy" brings up 3,516 headings, with the word somewhere in the heading. However, the heading "Italy" alone is not included. You do have:
Italy, Central
Italy, Northern
Italy, Southern
and you have:
Italy, Northern--Civilization
Italy, Northern--Civilization--Germanic influences
etc.
But not "Italy."

A search in the name heading database on LC's online authority file yields a name heading entry for "Italy." That database (whose response is in the form of a browse list) has innumerable pages for corporate names under the initial term "Italy":
Italy.
Italy. Ambasciata (India)
Italy. Confederazione fascista degli industriali.
It also includes "Italy, Southern" with its LC control number "sh 85069035".

The upshot is that the LC Subject heading file at http://id.loc.gov is not the same as a subject heading search in the online authorities database. It also isn't always logical which file headings fall into. The "Italy. Ambasciata (India)" is in the name heading file as a corporate name, but "Palazzo Dell'Ambasciata di Spagna (Rome, Italy)" is in the subject heading file as a corporate name. There undoubtedly is a set of rules that explains all of this, but it seems to me that a separation of the subject file and the name files creates a split between headings that will not be mirrored in actual use.

This may not matter if the files are combined in the end, and the URI makes it look like all authorities will have ids that directly follow "/authorities/" in the URI. However, although they are both coded as corporate names, the "Palazzo... " record gets the "cool URI" http://id.loc.gov/authorities/sh2002000509#concept. Note the ending in "concept". I don't know what hash ending will be given to entries from the names file, but I do find it odd that corporate names ccould have two different hash endings, depending on which file they are from. To be frank, especially since the division into different files doesn't seem terribly logical, and that many items in the name file can also be used as concepts, I would prefer that the "#" indicate the type of heading (personal name, corporate name, conference, geographical name, topic) rather than the file that it comes from. That is, that the "#" would reflect the MARC tag - 100, 110, 111, 150, 151.

12 comments:

Jonathan said...

I think maybe they'd ALL get #concept endings, as SKOS 'concepts' either way?

But the semantic web URI stuff confuses me, I admit. I still don't really understand why any of those have #concept in the first place, instead of just being a straight URI path without the need for a fragment identifier.

dchud said...

These might be good questions to ask on the list set up for this purpose. (third paragraph just above the form)

Karen Coyle said...

Thanks, Dan. Hmmm. I never would have thought to look for a list under "Contact us". I'll join and post in both places (since I suspect that few others have found the list).

Jeffrey Beall said...

Jurisdictional place names are excluded, and I agree, it would be great if they could be added.

I also with Flickr would handle this data more neatly.

Ryan Shaw said...

Jonathan, the #concept is an example of the Hash URI pattern, intended to distinguish between an abstract concept and an information resource representing that concept. It really doesn't serve any other purpose, and I agree it is confusing. I personally would prefer it if the LOC used http://id.loc.gov/authorities/sh2002000509 as the identifier for the abstract concept, which when resolved would 303 redirect to http://id.loc.gov/authorities/sh2002000509/page (or something similar), which would be an HTML page with RDFa embedded. But whatever, I'm just glad they finally got it back online and will be even happier when the various name authorities are added.

Bruce said...

Jurisdictional place names and personal and corporate names aren't really concepts, though. There's also the issue that the library world tends to have a rather different view of these issues than does most of the rest of the world. E.g. names refer to people and organizations. So that might be a different effort than the subject headings (for which SKOS is the obvious vocabulary).

Karen Coyle said...

Bruce, I think you and I are in agreement. If your book is about the Grand Canyon, then that geographical entity is the subject of that book. It doesn't make the Grand Canyon any less geographical, but you need to be able to say that the book is "about" whatever it is about.

The way I see this is that the Grand Canyon is always the Grand Canyon -- that is its essence. It being the subject of a book doesn't change it from the Grand Canyon to something else. I'd like "subject of the book" to be a relationship between a book and an entity. The entity could be anything. The topical entities (those that aren't things in the real world) can be in a SKOS vocabulary. What would be the best way to define the rest? OWL?

Ryan Shaw said...

Bruce and Karen, I don't think the distinction between "topical entities" and "things in the real world" is as clear as you assume it to be. There are a lot of different ways to think about these things, but from one perspective "The Grand Canyon", "Napoleon Bonaparte", and "Socialism" are all just names we associate with groups of sentences. So I don't see any problem with having them all in a SKOS vocabulary. That doesn't preclude having them in other kinds of vocabularies or ontologies for other purposes.

Jonathan said...

But "Jurisdictional place names and personal and corporate names" definitely are "real world resources" not "information resources", right?

So, Bruce, you disagree with Ryan, and think that in addition to the hash tag being used to differentiate "real world" from "information", different hash tags need to be used to differentiate "concepts" from "other things"?

Man, I think these are ontological quesitons which individual implementers need to be free to figure out on their own, there's no way to embed them in common infrastructural 'standards'. Different contexts will have different ideas on whether Italy is a "concept" or not. Not to mention that it's all _really confusing_, which doesn't bode well for it's wide spread adoption.

I have to admit that I'm no fan of httpRange-14, as heretical as it is say that these days. I think it's overly complicated, abstract, and confusing, an attempt to find an elegant theoretical solution to a problem, which doesn't really solve the problem at all, it just puts some theoretical frosting on top.

What's wrong with Ryan's suggestion to "use http://id.loc.gov/authorities/sh2002000509 as the identifier for the abstract concept, which when resolved would 303 redirect to http://id.loc.gov/authorities/sh2002000509/page", (or maybe http://id.loc.gov/authorities/sh2002000509/html and http://id.loc.gov/authorities/sh2002000509/rdf).

That would seem a LOT less confusing to me, and avoid the issue of whether something "really" is a concept or not -- shouldn't what it really "is" be asserted in RDF, that can be changed, and that can be asserted differently in different communities, rather than be embedded in a URI that hopefully will not change and will be used across contexts and communities? What would you lose by doing things this way? I don't get it.

Karen Coyle said...

Thanks for this great discussion, folks!

I think we've got two threads here: what will make LCSH and LC names work better, and what would be our ideal system for identifying concepts. LC already has divided the world between "names of things" and "topics" only I think it has done so rather inconsistently (or at least, I can't find the logic). I also think it doesn't work as linked data (but that's the next post, which I'm still working on).

The advantage of identifying entities (regardless of realness or not) and then allowing them to have relationships is that you can share the identity even if you and someone else want to use them in different relationships. If we start giving "Italy" separate identities for Italy as a subject and Italy as a geo-political entity, then we're going to end up with a lot of different identifiers for the same thing. The definition should be neutral so that the entity can be used in a variety of relationships ("is author of" "is subject of" "is owner of").

If your definition is about the entity, not about its use in a particular set of metadata, then it can be re-used by others. So the Ryan/Jonathan URI without the hash, along with a combined authorities file from LC, would get my vote (with caveats that are to follow).

Meanwhile, I still want to know if LC coded the term as a personal name, geographic name, topic, etc. This is going to help me in using these terms in my own metadata.

Bruce said...

Just to go back to this as I come across it again to clarify my point: I'm just saying that "Grand Canyon" isn't a "concept"; it's a geographic place. "Napolean" isn't really the name, but the person.

Karen Coyle said...

Bruce, I've just finished reading FRSAR and it confirms for me that in the library world the name is the object of the subject heading, not the real world thing or even the concept of the real world thing. I disagree with this because names are ambiguous, but it's clear to me that what library cataloging cares about is the name. So the name "Napoleon" is the object of the heading, not the person. This definitely separates the library world from the Semantic Web. I'll post on FRSAR shortly.