Saturday, December 10, 2016

Aggregating institutional repositories - A rethink

In recently months, I've become increasingly concerned about the competition faced by individual siloed institutional repository versus bigger more centralised repositories like subject repositories and commercial competitors like ResearchGate.

In a way the answer seems simple, just get someone to aggregate all the institutional repositories on one site and start building services on top of that to compete. Given that all institutional repositories already support OAI-PMH, so this seems to be a trivial thing to do. Yet I'm coming to believe that in most cases, creating such an aggregator is pointless. Or rather if your idea of a aggregator is simply getting a OAI-PMH harvester , point it at the OAI PMH endpoints of the repositories of your members and dumping everything into a search interface like VUFIND or even using something commercial like Summon or EDS without any other attempt to standardise metadata, and call it a day, you might want to back off a bit and rethink. For the aggregator to add value, you will need to do more work.....

A simplistic history of aggregation in libraries

Let me tell you a story...

In the 90s - libraries began to offer online catalogues to allow users to help themselves find out what was available (in their mostly print) collections. These sources of informations were siloed and while they were on the web, they were mostly invisible to web crawlers. The only way you could find out what libraries had in their collections would be to go to each of their catalogues and searched.

So, someone said "Why not we aggregate them all together" and Union catalogues (including virtual Union catalogues based on federated searching) were built e.g Copac. People could now search across various silos in one place and all was well.

Librarians and Scholars used such union catalogues to decide what and who to do ILL from and to make collection decisions. Many were still invisible to Google and web search engines (except for a few innovators like OCLC), but it was still better than nothing.

By the late 90s and early 2000s, libraries began to create "digital libraries" (e.g Greenstone digital library software). It was the wild west and digital libraries at the time build up digital collections consisting of practically anything of interest such as digitized images of music scores, maps, photographs -  anything except for peer reviewed material. Most material on digital libraries was often difficult to find or invisible via web search engines for various reasons (e.g. non-text nature of content, lack of support of web standards etc) and it made sense for some degree of aggregation at various levels such as national or regional levels.

Today larger collections like Europeana exist and all was well.

Then came the rise of the Institutional repositories, and by 2010s, most universities had one.

Unlike it's predecessors, the main distinguishing point of institutional repositories was that for many it was designed around distributing Scholarly peer reviewed (or likely to be peer reviewed) content.

While it's true many institutional repositories do contain a healthy electronic thesis collection and some even inherited the mission of what would be earlier called digital libraries and carried grey literature and other digital objects such as data the main focus was always on textual journal articles.

The other major difference is that by then all Institutional Repositories worth the name supports the OAI-PMH standard which making harvesting and aggregating metadata of content in them easy....

And of course , the same logic seem to suggest itself again, why not aggregate all the contents together? And today, we have global aggregators like CORE (not this other CORE - Common Opens Repository Exchange) , BASE and OAISTER as well as regional aggregators built around associations and organizations both national and regional.

In my region for example there's the AUNILO (ASEAN University Network inter-library online) institutional repository discovery service that aggregates content from 20 over institutional repositories in ASEAN. Most University libraries in Singapore are also part of PRRLA (Pacific Rim Research Library Alliance) formerly PRDLA., which also has a Pacific Rim Library (PRL) project built around OAI-PMH harvesting.

I'm sure similar projects exist all around the world based on aggregating data by basically harvesting via OAI-PMH harvestors. And yet, I'm coming to believe that in most cases, creating such an aggregator is pointless, unless additional work is done.

Or rather if your idea of a aggregator is simply getting a OAI-PMH harvestor , point it at the OAI PMH endpoints of the repositories of your members and dumping everything into a search interface like VUFIND or even using something commercial like Summon or EDS, and call it a day, you might want to back off a bit and rethink.

I argue that unlike UNION catalogues or aggregation of digital libraries (by this I mean not the traditional Institutional repository of text based scholarly articles), aggregation of institutional repositories is likely to be pointless, unless you bring more to the table.

Here's why.

1. Items in your institutional repository are already easily discoverable

Unlike in the case of most library catalogues, items in your institutional repository are already easily findable in Google and Google Scholar. There is little value in creating an aggregator when such an excellent and popular one as Google and Google Scholar exist.

101 Innovations in Scholarly Communication - 89% use Google Scholar to search for literature/data

Given the immense popularity of Google Scholar, what would your simple aggregator based around OAI-PMH offer that Google Scholar does not that would make people come to your site to search?

2. Most simple repository aggregators don't link reliably to full text or even index full text

Union catalogues existed in a time, where it was acceptable for users to find items that had no full text online. You used it to find which libraries had the print holdings and either went down there to view it, or used Interlibrary loan to get it.

In today's world, direct to full text is the expected paradigm and you get undergraduates wondering why libraries bother to subscribe to online subject indexes that show items the library may not have access to.

Now how much worse do you think they feel when they search one of your repository aggregators and realise they can't figure out which item has full text or not until they click on it? This is where a glaring weakness in OAI-PMH rears its head.

I first encountered this problem when setting up my Web Scale Discovery Service - Summon a few years back, and I was surprised to realise that while I could easily harvest entries from my Institutional Repository (Dspace) into Summon via OAI-PMH, I couldn't easily get Summon to recognise if an item from the Dspace repository had full text or not.

I remember been stunned to be told that there was no field in the default Dspace fields that indicated full text or not.

This sounds crazy by today's standards. But a little understanding of the context of the time (1999) when OAI-PMH came about helps. It's a long story, but correct me I'm wrong but it was conceived at a time where preprint server Arxiv was the model and it was envisioned repositories would be 100% full text items, so there was no need for such a standard field. Today, this is of course not what happened, due to varying goals on what an Institutional repository should be and reluctance of researchers to self deposit we have a mix of both full text and metadata only items.

Another quirk about OAI-PMH that might surprise many is that it only allows harvesting of metadata only not full-text. Again in today's world where full-text is king and people are accustomed to web search engines (and library full text databases that have followed their lead) matching in the whole document and have search habits designed for that, they find aggregators based around OAI-PMH that only contain metadata odd to use. This is the same problem many students have with using traditional catalogues.

I understand there can be algorithmic workarounds to try to determine if full text exists and some aggregators try to do so with varying results but many don't and just display everything they grab via OAI-PMH.

To top it all off, Google Scholar actually has none of these problems. They can pretty reliably identify if the full text exists and where and combine that with the library links program you can easily tell if you have access to the item.

They crawl and index the full text, and can find items based on matching in full text and can often provide helpful search snippets before you even click into the result.

A vanity search of myself allows me to see where my name appears in context in the full text not just in abstracts

3. Aggregation doesn't have much point due to lack of consistency in standards

Think back to Union catalogues of traditional catalogues back then called OPACs. The nice thing about them was most of them were created using the same consistent standards.

There was MARC, Call number schemes like LCC/DDC/UDC, subject headings standards like LCSH/MeSH that you could crosswalk etc. So you could browse by subject headings or call numbers etc.

I'm probably painting a too positive view of how consistent standards are, but I think it's fair to say that in comparison institutional repositories are in an even worse state.

Under the heading for "Minimal Repository Implementation" in "Implementation Guidelines for the Open Archives Initiative Protocol for Metadata Harvesting" we see it advises that "It is important to stress that there are many optional concepts in the OAI-PMH. The aim is to allow for high fidelity communication between repositories and harvesters when available and desirable."

Also under the section on dublin core which today is pretty much the default we see "Dublin Core (DC) is the resource discovery lingua franca for metadata. Since all DC fields are optional and repeatable, most repositories should have no trouble creating at least a minimal mapping of their native metadata to unqualified DC. "

Clearly, we see the original framers of OAI-PMH decided to give repositories a lot of flexibility on what was mandatory and what wasn't and only specified a minimum set.

In addition the "lingua franca for metadata", unqualified dublin core perhaps on hindsight was not the best option, not when most of your content is journal articles.

Even Google Scholar recommends against the use of Dublin core in favour of other metadata schemes like  Highwire Press tags, Eprints tags or BEpress tags.

On the section of getting indexed on Google Scholar, they advise repository owners to "use Dublin Core tags (e.g., DC.title) as a last resort - they work poorly for journal papers because Dublin Core doesn't have unambiguous fields for journal title, volume, issue, and page numbers."

Even something as fundamental today as doi (and in the future ORCID) isn't mandated.

I recently found that out when I realised the very useful service that allows you to input a DOI and find a copy in repositories (among other ways it works is that it searches for items indexed in BASE) failed for our institutional repository because doi indentifer wasn't in our unqualified Dublin core feed and that was picked up by BASE. The lack of standards is holding repositories back.

Leaving this aside, I'm not sure why this happened (I have a feeling that up to recently the same people working on institutional repositories were not the same people working on cataloguing) but most institutional repositories content do not use controlled vocabulary for subject headings or for subject classification, though they could easily do so.

As a result, unlike in catalogues, once you have aggregated all the content, you can easily slice the content by discipline (e.g. LC call range) or by subject headings (e.g. LCSH).

With aggregators of repositories you get a mass of inconsistent data. Your subjects are the equivalent of author supplied keywords and there is no standardised way to filter to specific disciplines like Economics or Physics.

 The more I think about it the more this lack of standardisation is hurting repositories.

For example, I love the digital commons network that allows me to compare and benchmark performance across all papers posted via digital commons repositories in the same discipline. This is possible only because digital commons has a hosted service has a standardised set of disciplines.

What should your aggregator of repositories do?

So if you read all this and are undeterred but still want to create a aggregator of institutional repository what should you do?

Here's some of the things I think you should shoot for beyond just aggregating everything and dumping it into one search box.

a) Try to detect reliably if an entry you harvested has full text

b) Try to index full text not just metadata

CORE seems to match full text in my search?

One way to detect reliably if full text exists or not is to decide on a metadata field that all repositories you are harvesting from has a metadata field indicating full text. But that won't scale currently at a global level. Another way is to try to crawl repositories to extract pdf full text.

Ideally the world should be moving away from OAI-PMH and start exposing content using new methods like resource-sync so not just metadata alone is synced. I understand that the PRRLA is working on a next generation repository among it's member that will use Resource-Sync.

c) Create consistent standards among repositories you are going to harvest

If you are going to aggregate repositories from say a small set of member institutions, it is important to not just focus on the tech but also focus on metadata standards. It's going to be tough, but if all institution members can agree on mapping to a standard (hint look at this), perhaps even something as simple as providing a mapping to Disciplines, the value of your aggregator increases a lot.

d) Value added services and infrastructure beyond user driven keyword discovery

Frankly, aggregating content just for discovery isn't something that is going to be a game changer even if one provides the best experience with consistent metadata allowing browsing, indexes full text etc as services like Google Scholar are good enough already.

So what else should you do when you aggregate a big bunch of institutional repositories? This is where it gets vague, but the ambitions of SHARE . while big show that aggregators should go beyond just supporting keyword based discovery.

See for example this description of SHARE

"For these reasons, a large focus of SHARE’s current grant award is on metadata enhancement at scale, through statistical and computational interventions, such as machine learning and natural language processing, and human interventions, such as LIS professionals participating in SHARE’s Curation Associates Program. The SHARE Curation Associates Program increases technical, curation confidence among a cohort of library professionals from a diverse range of backgrounds. Through the year-long program, associates are working to enhance their local metadata and institutional curatorial practices or working directly on the SHARE curation platform to link related assets (such as articles and data) to improve machine-learning algorithms."

SHARE isn't along there are other "repository networks" include OpenAIRE (Europe), LA Referencia (Latin America) and Nii (Japan), that work along similar lines , trying to standardise metadata etc.

Others have talked about layering a social layer over aggregated data similar to ResearchGate/, or provide a infrastructure for new forms of scholarly review and evaluation.

Towards a next generation repository?

In past posts on institutional repositories I've been trying to work out my thinking on institutional repositories and it's a complicated subject particularly with competition from larger more centralised subject and social repositories like ResearchGate.

I'm coming to think that to counter this individual smaller repositories need to link up together but yet this cannot be currently done in an effective way.

This is where "next generation repositories" comes in and they may have probably heard about this most prominently under the umbrella of COAR (Confederation of Open Access Repositories).

What I have described above is in fact my layman's understanding of what the next generation repositories must achieve (For a more official definition see this) and why.

Officially the next generation repositories focus on Repository Interoperability (See The Case for Interoperability and The Current State of Repository Interoperability )- which includes working groups on controlled vocabulary and open metrics and even linked data.

All this is necessary for institutional repositories to take their place as necessary and equal partners in the scholarly communication network.


I had the opportunity to attend the Asian Open Access Submit in November at Kuala Lumpur and learned a lot, particularly the talk by Kathleen Shearer from COAR, the Confederation of Open Access Repositories on repository networks helped clarify my thinking on the subject.

blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...