
By Jeffrey Edmunds, Digital Access Coordinator at the Penn State University Libraries
The phrase “Open Access” evokes, for most of us, Open Access resources: OA books, OA articles, OA journals. Equally important however, and often overlooked, is the metadata describing them. The visibility and discoverability of Open Access resources depends in part on good metadata, and since OA materials fall outside the traditional workflows of libraries acquisitions and cataloging, they are frequently under-described. A lack of good metadata impedes their discovery and lessens their visibility in the scholarly communication ecosystem.
Metadata for Open Access books originates with publishers, who generally use a metadata format known as ONIX (ONline Information eXchange). To be useful for libraries, ONIX data needs to be transformed into MARC (MAchine Readable Cataloging), the international standard for the capture and exchange of bibliographic metadata. MARC records are then loaded into library discovery systems such as catalogs to facilitate search and retrieval of Open Access materials.
Unfortunately, the current ecosystem, just as it remains biased against Open Access resources (most academic libraries rely on a tiny number of large publishers and discovery providers for both their electronic resources, locked behind paywalls, and the systems to manage them), is also skewed against the free and open sharing of metadata.
WorldCat, the world’s largest union catalog, with over 76,000,000 records as of September 2023, is a proprietary database, and OCLC members pay a fee to use it. Over the past several years, OCLC has taken an increasingly litigious stance on “its” records, going so far as to place copyright notices in some records and to block through legal challenges the free sharing of WorldCat records. While OCLC’s attempts to prevent the co-optation of its metadata by for-profit entities like Clarivate are understandable, the locking down of bibliographic records in its database runs counter to the ethos of shared cataloging that libraries champion as a core value. The result of this stance is that OCLC records are not Open Access and therefore are not freely sharable as such.
Another problematic aspect of records describing Open Access resources in WorldCat is that OCLC and Program for Cooperative Cataloging (PCC) policies preclude the explicit tagging of MARC records with OA markers. The PCC has a “provider-neutral e-resource” policy,[1] which means that a MARC 506 field, used to record licensing restrictions, cannot be used for an Open Access note:
506 \\$aOpen Access$fUnrestricted online access$2star
Similarly, the 856$7 subfield[2], adopted in 2019 to allow the specification of Open Access resources (using a value of “0” [zero]), is not mentioned as allowed in the PCC guidelines.
In summary, MARC records for Open Access resources in WorldCat are neither Open Access themselves (freely downloadable and sharable by libraries or other entities) nor do they accurately reflect, in a consistent and predictable way, the Open Access nature of the resources they describe.
With this problem in mind, what can we do to address it?
One solution might be a cooperatively built and maintained silo for bibliographic metadata that
- includes records for only Open Access resources (i.e. those with a CC license explicitly associated with them)
- includes only records that contain explicit mention of the resource’s Open Access status
- includes only records that themselves have a CC license attached to them
Thoth (thoth.pub) is a good start. Developed by Javier Arias in the context of the Community-led Open Publication Infrastructures for Monographs (COPIM) project funded by UKRI and the Arcadia Fund, Thoth describes itself as an “open metadata management and dissemination platform” focused in part on the “creation, curation, and dissemination of high-quality metadata records.” The metadata records in Thoth are made available under a CC0 (public domain) license.[3]
Unfortunately (from a libraries perspective), Thoth is intended for use by publishers rather than libraries, at least for now. As a result, the metadata creation interface is more akin to an ONIX-based system than a MARC-based system.[4] On the other hand, the metadata for any given object in the database can be output in one of many formats, including various versions and flavors of ONIX, CSV, JSON, KBART, and even MARC. The MARC records (the format most favored by libraries) are quite good, but could benefit from certain enhancements, such as the addition of subject terms from authoritative lists (the Library of Congress Subject Headings, the Getty Art and Architecture Thesaurus, Homosaurus, etc.) and Library of Congress or Dewey classification numbers.
For now, the Thoth metadata interface is publisher-dependent, i.e. a publisher can only create and edit metadata for its own publications. Ideally, an Open Access metadata repository would more closely mimic a cooperative cataloging model, whereby anyone with an account can improve, correct, and enhance any metadata record in the system. Obviously, such a scenario would require discussion among all the stakeholders (publishers, libraries, etc.) and a shared understanding of best practices for metadata creation and enhancement. There would have to be guardrails in place to ensure persistence and quality of metadata.
In the meantime, as mentioned in a recent OAPEN/DOAB blog post,[5] the Penn State University Libraries has collaborated with OAPEN to make good-quality MARC records for its entire corpus of nearly 30,000 titles freely available as Open Access under a CC license via its institutional repository. This effort, and platforms like Thoth, are important initial steps toward building an infrastructure for the creation, maintenance, curation, and sharing of metadata in alignment with the principles of Open Access: good metadata should be made available as early as possible in the research and publication process, it should be Open Access itself (i.e. both free of cost and freely sharable), and the Open Access nature of the resources the metadata describes should be explicitly stated.
As authors, researchers, libraries, funding agencies, and other interested stakeholders work to push the scholarly communications ecosystem toward greater openness and equity, open metadata will be a crucial component of their shared endeavor.
[1] https://www.loc.gov/aba/pcc/scs/documents/PN-RDA-Combined.pdf
[2] https://www.loc.gov/marc/bibliographic/bd856.html
[3] For more on Thoth metadata and the rationale behind it, see https://copim.pubpub.org/pub/open-metadata-thoth/release/1
[4] See https://www.youtube.com/watch?v=DiEZI_3Ksmg for a walk-through of the Thoth metadata module
[5] https://www.oapen.org/blog/?link=https%3A%2F%2Foapen.hypotheses.org%2F706
Header image by Alina Grubnyak on Unsplash.
This work is licensed under a Creative Commons Attribution 4.0 International License.
This is really interesting, thank you. As a Commissioning Editor and publisher I don’t fully understand how metadata is applied by librarians in practice, so this is very insightful.
Thanks Sarah! Glad this post was useful.