How much are we undercounting Open Access? A plea for better and open metadata.

It’s still pretty early days for measuring open access journal articles. For a long time the only reliable way to determine if an article had an open access version available was via a Google Scholar title search. You can’t get at Google Scholar via an API or web-crawling (because of the Captchas they have in place), so any attempt to mass pull open access data from Google Scholar is a long and painful process.

There were/are a number of tools trying to mass capture open access though, ones that

  • pull and merge open access content from repositories (e.g. BASE )
  • index full open access journals and their articles (e.g. DOAJ)
  • capture journals that become free to read after an embargo (e.g. PMC)
  • collect individual copyright statements for published articles (e.g. Crossref)

The brilliance in Unpaywall  is in how it pulls these all these sources together to create one database (Green, Gold, Bronze, and Hybrid). One that allows mass gathering of open access data but also easy use for individuals looking to find an OA version of a single article.

Being able to mass obtain data about open access publications opens all sorts of doors that were previously closed. We can see what kind of information is available to those without institutional access. We can start collecting open access articles and do fancy analysis of them with text-data mining and algorithms. We can track Research Funder OA compliance. We can measure where OA research has practical benefit to researchers.

Unpaywall is only as good as the data in can find though and that’s what I want to talk about.

I was poking around some articles published hybrid open access in Taylor and Francis (T&F) journals and was surprised that Unpaywall did not find them as open access.

Example #1:

Whiteside, Alan, et al. “Mixed results: the protective role of schooling in the HIV epidemic in Swaziland.” African Journal of AIDS Research 16.4 (2017).

CC-By license and Unpaywall API does not currently find an open access version

Example #2:

Shams, Tahireh Ashley, Sophie Gosselin, and Ryan Chuang. “Unintentional ingestion of black henbane: two case reports.” Toxicology Communications 1.1 (2017): 37-40.

CC-By license and Unpaywall API does not currently find an open access version

This isn’t Unpaywall’s fault. Unpaywall can’t find the open access version of these articles because T&F hasn’t made this information available.

The main source for pulling information about hybrid open access is via Crossref, where publishers mint their DOIs and provide content metadata. Publishers can also report the copyright license for their articles in crossref, which is how Unpaywall uses it for identifying hybrid open access.

Crossref recently launched a new Participation Report feature in Beta that allows you to see how complete the metadata is for each publisher. T&F – listed under their parent company Informa UK – only has license information for 4% of their content in Crossref. Compare this with Elsevier who has license information for 96% of their content. 

(When it comes to Open References though, T&F is at 99% while Elsevier is at 0%. I’ve ranted about this before.)

T&F needs to start reporting its copyright license information to Crossref so Unpaywall and other tools can capture these open access publications. Right now, since Unpaywall is being used in multiple library search tools (e.g. Scopus and Web of Science), users searching for these articles there will not find the open access links. The search tools won’t show they have access to these articles unless their institution has a subscription.

Open does not matter if it’s not discoverable. We need to start pushing publishers to make more (and better) metadata available in Crossref so it can push out to places like Unpaywall.

This got me thinking what about other open access content that isn’t discoverable. A recent study on Unpaywall found that if there was an truly open access copy of an article, it would find it 77% of the time. So for every 8 articles with an open access version it finds, it misses 2. They calculated this number by title searching in Google Scholar for an open copy.

Interestingly, they did not include articles on academic social networks, like ResearchGate, as true open access during this Google Scholar analysis. It’s pretty well documented now that a lot of the papers on ResearchGate infringe copyright and if Unpaywall wants to be mass used by researchers they gotta cover their butts. They  don’t index ‘sources of dubious legality’ like ResearchGate.

Not indexing ResearchGate means a lot of free-to-read content is missed out on though. A recent analysis showed that ResearchGate makes up the biggest source of free-to-read content in Google Scholar.

What else does Unpaywall miss? Capturing hybrid OA is almost impossible if publishers don’t make that data available somewhere. Unpaywall also struggles with the challenge of matching open, self-archived versions of articles to the final published record Unpaywall has. It’s a hard thing to do when no linked persistent identifier or the title of the article is different.

I was looking at Semantic Scholar the other day and noticed they appear to have launched a beta-version of their own open access finder. At first I thought they were using Unpaywall, but in some cases they are finding open versions Unpaywall does not.

For instance, Semantic Scholar’s beta ‘Alternate Sources’ finds a self-archived copy of this article in MIT’s DSpace but Unpaywall does not. Another example of the Semantic Scholar finding an OA copy but Unpaywall doesn’t.

This isn’t to say the Semantic Scholar tool is better than the Unpaywall tool. I’m sure there are lots of cases of the reverse as well where Unpaywall finds OA but Semantic Scholar doesn’t. The data is messy. Merging is hard.

That Unpaywall study really hit home for me that Google Scholar is still the defacto default tool for checking if an OA version of an article is available. Of all the articles I linked to above, Google Scholar finds the OA copy every time. I gotta wonder, how much OA does Google Scholar miss? How would we know or even check this? Google practically is the web sometimes.

There’s also a few other areas I can think of where we could be undercounting OA beyond Hybrid or Self-archived. Including Unpaywall data in Web of Science makes it great tool for doing a quick OA analysis, but Web of Science doesn’t include the OA Preprint data Unpaywall has. It’s still pretty early days for Preprints, but they are growing at a pretty high rate, and if you’re using just Web of Science for OA analysis your numbers are probably higher than you think.

Another area where we could be undercounting OA is for Delayed Open Access Journals. A lot of these journals are medical or health journals and when the journal issues pass their embargo become free-to-read in PubMed Central. There’s no real standard for delayed open access journals though. No reporting system. How many delayed open access journals are there outside of the ones in PubMed Central? It would be cool if journals could start reporting this information to Crossref as well. Or if we could create a DOAJ for these kind of journals. Some kind-of standard that tracks these journals and ensures that the content remains open.

To wrap it up here. We still got a ways to go when it comes to measuring open access. Pushing publishers to more reliable report copyright licenses to Crossref would be a big step forward, but there’s lots of little stuff we can do too.

Published by Ryan Regier

Doing lots of different stuff. Follow me on twitter at: @ryregier

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: