Geospatial Metadata Reconciliation

At UC San Diego, we are currently in the midst of our subject reconciliation project for the DAMS. Nearly all of our over 12,000 local subject authorities (topics, people/corporations, places, species, etc.) will be adding linked data URIs. I’ve outlined the process in previous posts, but it generally follows the process of:

  • Clean the existing labels via some simple regular expressions (Note: Rawson and Muñoz have a wonderful post about how we should be more precise about the word ‘cleaning’ with regard to data)
  • Reconcile to FAST via an OpenRefine reconciliation service (which will get us FAST URIs for topics, but also provide us with handy MARC tags to classify as many other subject types as we can)
  • Break up complex subjects that didn’t reconcile, and iteratively attempt to reconcile / break them up again until matches are made
  • Reconciling to different vocabularies for appropriate types (id.loc, GeoNames, VIAF, etc.) if need be, educated by both our own RDF predicates as well as information (like MARC tags) brought back from the FAST service.
  • More cleanup to our local authorities brought from the reconciliation process, specifically our bad labels.

The most difficult part of the process (besides automating reconciling to ever-ambiguous name authorities) so far revolves around places: the geospatial. A really good intro to some of the issues in geospatial metadata in libraries is in this post by Christina Harlow. Thankfully, some of this existential angst is not applicable to our situation, as we are not dealing with MODS and XML. Our authorities live in RDF already, so simply saying that “Jalisco” is the same as “http://sws.geonames.org/4004156/” is sufficient, as we can then pull whatever information from that external resource as needed, like geocoordinates, variant labels in different languages, etc.

However, the data we’re dealing with to reconcile is a mess. As Christina points out, library cataloging practice means many of the abbreviations for states are ancient, and GeoNames has no entries that remotely conform to them. As you can imagine, we have a lot of collections about places in California. This means we have thousands of geographic subjects that end in ‘(Calif.)’. Worse is when there are multiple levels in the parentheses, like ‘(San Diego, Calif.)’.

I spent quite a while figuring out how the GeoNames reconciliation service could overcome this hurdle, and none of my experimentation yielded results better than around 50% reconciliation. Although you might think that is a satisfactory result, often times it meant that the service was looking for terms that had the same number of characters as the number of characters in the parenthesis, which meant some pretty bad results from the service. In addition, OpenRefine services like these in general are limited to display three results at a maximum, so often we were faced with three very low-quality matches. We knew the ‘correct’ place was in there, it just wasn’t viewable.

No matter what I seemed to try, the results almost always were near 100% reconciliation to id.loc.gov, 75% to FAST, and 50% to GeoNames.

I ended up making a very clumsy but effective solution. Since the content of the parentheses was causing the GeoNames service issues, I decided to try simply deleting anything within parentheses. This was a rather brute force regular expression, \(.+\). I did this solely to experiment and possibly establish a baseline I could then work from, but to my surprise, the reconciliation of about 150 locations went from ~50% to nearly 85%. Furthermore, the results for ambiguous locations that came back were high quality, with only a few that were so ambiguous that even three high quality matches meant the correct match was not displayed. In the end, it was closer to 90% reconciliation after a bit of selection from the matches brought back. Applied to thousands of subjects, having to review only 10% of places manually instead of nearly half is a rather big deal in a project this large.

Obviously, though, this larger situation is not where we want to be. I think we’re starting to see that the needs and use cases are pretty well articulated at this point, but the solutions are still not available. BIBFRAME’s cataloging interfaces seem to grasp this need fairly well, as when a cataloger starts typing into fields that are controlled by authorities, it will automatically start querying linked data sources and bring back suggestions, which would then be stored as URIs. Unfortunately we are not there yet for our systems, although Hydra as a community certainly understands the need here, and I have learned that gems like questioning authority are still alive that would provide a similar experience. It still remains to be hammered out how you would provide that kind of on-the-fly autocomplete matching to authority URIs in a batch situation, but I have no doubt that will eventually be explored.

You also might question why I’m reconciling to GeoNames and not some other source. I am actually very open to alternatives. I place GeoNames as a “good service for now” level of service. And according to some recent discussions with experts in our supercomputing center, we need to talk with them to perhaps explore different linked data sources for geospatial authorities.

Reconciliation is still a very new process. While it is fun to experiment like I have, inevitably you run up against the limitations of tools like OpenRefine or your own coding skill. I know people like Christina Harlow and others are looking into making some better tools specifically for reconciliation, making them more comprehensive, more agnostic (OS-wise, hardware-wise, langauge-wise), and overall, more valuable. But in the meantime, we have to just bash things together until it works.

Written on July 14, 2016