Enabling Computational Access at Scale: Are Repositories Serving Collections as Data?

Always Already Computational at Open Repositories 2018: Report from Bozeman, Montana

The annual international Open Repositories conference offered an opportunity to hone in on the technical infrastructure and user-centered design behind collections as data efforts. On June 6, 2018 Always Already Computational: Collections as Data project team members Hannah Frost (Stanford University) and Sarah Potvin (Texas A&M University) convened a panel that featured collections as data work from MIT, the University of Pennsylvania, Simon Fraser University, the University of British Columbia, and the Max Planck Institute for Psycholinguistics. This set of featured projects included homegrown applications and evaluation and use of some of the most popular open source repository platforms: DSpace, Samvera/Fedora, and Islandora.

Panelists Helen Bailey (MIT), Kate Lynch (UPenn), and Mark Jordan (Simon Fraser) addressed the following question:

How do today’s open source repository systems support or inhibit working with collections as data?

For the full experience watch a recording of the panel and reference panelist slides.

livestream recording

Hannah Frost opened the panel by introducing the concept of collections as data, providing an update on the Always Already Computational project, and acknowledging complementary efforts such as the Library of Congress Collections as Data meetings and COAR’s Next Generation Repositories report (2017).

Helen Bailey spoke about the evolution of MIT’s attempts to provide API access to their collection of electronic theses and dissertations (ETDs), which were published in DSpace@MIT. She described the group’s tests of Fedora 4 in their initial work, and their development of Backrest, a backport to enable API REST access to their older version of DSpace. Bailey concluded by previewing ongoing work to support MIT’s API-first approach to service engineering. For more on MIT’s Text and Data Mining project for ETDs, see Richard Rodgers’s Collections as Data Facet and view Bailey’s presentation to our 2018 Forum.

Mark Jordan offered three examples of collections as data implementations. The first two - Simon Fraser’s “Fetch Collections” tool and The Max Planck Institute for Psycholinguistics’ Language Archive implementation - use Islandora, and are not yet in production. The third - the University of British Columbia’s Open Collections project - is a bespoke interface that unifies digital collections managed in several repositories. The Max Planck Institute example serves a targeted need: users seeking to train software, typically in speech and speaker recognition, using the Institute’s collection of researcher data, which includes audio and video language corpus data representing languages from around the world. Users are able to select collections in Islandora and request a zip file, which is generated using an Islandora ZIP Download module developed by Discovery Garden. Simon Fraser’s “Fetch Collections” tool is designed to be downloaded by users to run functions locally on Islandora collections. The University of British Columbia’s Open Collections project delivers digital objects and metadata in a variety of forms and formats. This fully-fledged collections as data implementation packages API access with documentation and tools for building queries to run across output.

Kate Lynch rounded out the panel by detailing and demoing an ongoing project at the University of Pennsylvania to bring the OPenn collections as data implementation into a repository. The OPenn open access manuscript collection, currently housed in a locally-developed, standalone application that provisions multiple modes of access, will be replicated in Colenda, Penn Libraries’ Samvera-based repository. The Libraries will integrate storage, delivery, and preservation across the two applications. Colenda’s emphasis on long-term preservation and version tracking will provide enhancements such as persistent opaque identifiers and IIIF protocol implementation. (For more on the OPenn project, see Dot Porter’s Collections as Data Facet and view Porter’s presentation to our 2018 Forum.)

The five cases detailed by Bailey, Jordan, and Lynch offered insight into how repositories are being adapted and extended to provision access to collections as data beyond OAI-PMH or APIs, how libraries are integrating collections as data as a core service, and the support needed to scaffold this access.

We hope that Open Repositories will continue to host fruitful exchanges about collections as data, and to frame community-supported work to make repository platforms more conducive to delivering collections as data.

Sarah Potvin

Hannah Frost