Please read my paper

There is no shortage of reading material in the research community. Scholars are now publishing nearly two million journal articles a year in over 28,000 different scholarly journals in the English language.[1] This is the result of a well-established trend. The number of journal titles has grown roughly 3% a year since the first scholarly journals appeared over three and a half centuries ago. In 1665, The British Royal Society began to publish Philosophical Transactions, the same year that the Académie Royale commenced publication of Journal des sçavans. As pointed out in analysis by Michael Mabe,[2] this growth in journals simply scales with the number of researchers. There is no way for an individual to deal with the volume of content, even a subset restricted to a field of the literature that matches the researcher’s interest.[3] Regardless, access to the scholarly literature has become a high-profile issue in the community. Because much of the literature is supported by subscription access, the more acute problem is discovery of the most important articles—and its associated data—by the inquisitive researcher. How can one identify the absolute essential research articles pertinent to any particular research endeavor? 

Text and data mining

To solve this problem, we will not rely on an army of research assistants, because even the low-cost, extended workday of graduate students can’t make a dent in this problem. Clearly, the dominant reader of the future will be a machine rather than a person. Adapting the scholarly literature to be efficiently and accurately machine-readable and developing machine-reading tools with user-friendly interfaces are frontier development projects in the publishing and information technology communities. This enterprise has a catchy name—text and data mining, or TDM for short—and there is considerable discussion of its prospects and potential benefits in the publishing community and among its customers and policy makers. 

In its simplest form, the search and indexing routines used by commercial search engines such as Google, Yahoo, and Bing perform text mining adopted to the full corpus of literature on the web to allow key topics to be discovered and exposed by these search engines. Most scholarly publishers sign agreements with these firms to allow their content to be “crawled” by robotic readers to tag the content for identification against key words and terms. Taken to the next level, more sophisticated TDM uses more sophisticated analytic tools, such as natural language processing and machine learning, to recognize relationships in unstructured text and key identifiers, such as names, chemical structures, and experimental methods. TDM is a ripe arena for research, development, and testing of techniques. 

Research funding agencies should be and are facilitating such development. When the US DOE recently announced its public access plan for scholarly publications and data, the plan was criticized in some camps for not spelling out details or requirements for TDM of publicly accessible content. A careful reading of the February 2013 OSTP memorandum (which is the basis for DOE’s policy, and other US funding agencies will soon follow suit) shows that specific requirements for TDM were not mandated. Given the relative infancy of this field, the OSTP memorandum encouraged the development of creative “reuse” of scholarly content—thus providing broad incentives rather than specific mandates at this early stage of development of TDM. One could easily imagine the havoc that could be generated if scholarly publishers had to open up their complete content to unregulated crawling by machine readers. All public and private databases of digital content have to be protected from ubiquitous online threats. 

The potential value of TDM tools and techniques are greatly enhanced if the widest possible collection of all content, from related and seemingly unrelated subjects, are made available for the mining exercise. It opens up the possibility of serendipitous discoveries when connections or relationships are examined beyond more narrow searches. Within the realm of physics, we have the recent example of last year’s Nobel Prize on the discovery of the Higgs boson. The fundamental theoretical work by Peter Higgs and his collaborators in high energy physics was based on examination of quantum phenomena in superconductors previously done by Philip W. Anderson.[4] However, careful vetting and application of crawling tools on the primary content or special mirror platforms created for these applications are sensible approaches. With the public access statutes introduced by the UK government in April of 2013, researchers with access to subscribed content are permitted to copy such content for noncommercial TDM purposes. Most publishers that I interact with will allow controlled TDM of their content upon request. The field is so new there are presently but a handful of such requests. 

But this status will change quickly. The most active work is occurring in the biomedical and pharmaceutical fields, where important topics such as drug discovery and patient reactions can be tracked across the literature. For a recent example of mining electronic health records for patterns, see Scientific Data. For those interested, a succinct analysis of TDM for this field has recently been published by a collaboration called the Pharma Documentation Ring and the publishing organization, Association of Learned & Professional Society Publishers

For more general references on this quickly developing field, particularly with respect to scholarly publishing, I refer you to a recent STM statement and report by the Publishing Research Consortium and new initiatives recently announced for TDM services by the publishing services organizations, the Copyright Clearance Center and CrossRef.




[2] Mabe, Michael. Serials 16(2), 191–7, 2003.

[3] Fraser, A.G. and Dunstan, F.D. On the impossibility of being an expert. BMJ 2010;341:c6815.

[4] P.W. Anderson, Phys. Rev. 130, 439, 1963.