How the ILL uses deep learning to find citations of its experiments

For decades, the Institute Laue-Langevin has been at the centre of neutron science research, where in recent years scientists have pointed their efforts at everything from the hunt for dark matter to the mysteries of limoncello.

The internationally financed facility was founded in 1967 as a joint initiative between France and Germany, with Britain becoming the third leading partner in 1973.

During its half-century in existence the centre, located in Grenoble, France, has been at the forefront of research in topics from materials engineering to superconductivity, while one of its instruments, the Physique Fondamentale 2, was for decades the only facility for ultra-cold neutron research.

Head of IT at the ILL, Jean-François Perrin, describes the instruments at the facility to Computerworld UK as similar to the Synchrotron, but instead of using X-Rays, the instruments at the ILL fire neutron beams to better understand matter.

At the moment the IT group is observing two key metrics - first, the number of requests from scientists to make experiments at the ILL, while the second is the number of publications that have arisen from work there. This second metric led to a publication analysis project so the ILL can determine where the results of its experiments are being cited.

"We have more or less 2,000 scientists visiting the ILL every year," says Perrin. "We generate a volume of data which is currently in the order of 300 terabytes a year, and a total archive of a petabyte."

In terms of the infrastructure, the ILL runs that petabyte of archived data on an IBM system as local storage. But for the publication analysis project, the organisation has a small Apache Spark infrastructure with about 1 terabyte of RAM and 265 CPU cores - a cluster of eight machines.

The publication analysis project makes use of deep learning techniques to surface the role of ILL instruments in research, combing databases for evidence of the facility's contributions.

"One thing that is becoming more and more important is the management of data - open data, fair data," says Perrin. "Until recently the data was used by the scientist in the experiment ... but since 2011 we have a policy which states that after three or five years, the data should become publicly available for all the scientists who are not lucky enough to get access to the ILL - to use this data for other analyses."

Because this data could be cited by anyone, the team at the ILL needs to trace where it crops up, what other scientists are using it for, and the kinds of papers that it's published in.

"Currently there is clearly a lack of information available - a lack of metrics - this is something new to invent," he says. "So three years ago we started a project, which was financed by the European Commission, in order to get more visibility on what is done with the data at the ILL."

This has included, for instance, the creation of a digital object identifier (DOI), which is a digital marker that researchers can use in their papers to trace citations back to the original data. In 2012 this DOI had "more or less no users" but as of 2018 roughly 20 percent of instrument users had cited their data using the digital signifier. That means 80 percent hadn't, of course, but progress nonetheless.

To discover topics that might be of further interest for future research, the IT team searches for the frequency of words in its own literature compared to the frequency of the same words in the wider web. "When we notice a word is happening more frequently, we follow this word and the evolution of this word in our documents," says Perrin.

One of the difficulties is separating these words from the noise. Here, the team uses deep learning techniques to try to determine whether the keywords are relevant to projects conducted using ILL instruments or unrelated.

Open source

A dependence on commercial licences can also be restrictive, and the ILL runs 80 percent of the IT on open source technology.

Opting for a commercial solution "would be a nightmare" with licences, says Perrin, not only from a cost perspective but also on the licence management side - as regular audits from the top commercial vendors would prove a "burden which does not fit well with science and the need of the scientists".

Experiments using instruments such as those found at the ILL have always been inextricably linked to their underlying infrastructure, but now even more so, says Perrin, as the sheer volume of data created during the experiment or required for image analysis can be far beyond the capacity of researchers at their home labs.

"If you look at the data of the ILL 10 years ago, the volume for a year was less than a terabyte, now we are in the order of 200, 300 terabytes last year," he says. "The consequences of that is that 10 years ago the scientists were coming to the ILL to perform their experiment, then they would go back to their lab for performing the analysis, for getting the science out of the data."

But now, with some series of experiments generating as much as 70 terabytes of data, taking that information out of the lab is "more or less impossible".

"They don't have the capacity to store the volume of data, they don't have the capacity to analyse the volume of data, and then this is a really tricky solution," Perrin says.

To address this, the organisation, together with others across Europe and partially financed by the European Commission, is working on an OpenStack infrastructure where the users could "instantiate [a] machine for performing the analysis and having access to the data".

This project is only just starting up, but it fits in with other recent initiatives such as Project PANOSC, which kicked off January 2019, that will bring together six European research infrastructures - CERIC-ERIC, ELI-DC, European Spallation Source, European XFEL, ILL, EGI and GEANT.

PaNOSCwill aim to develop common policies and strategies among these organisations with a view to opening their data, and ultimately to realise a "data commons for Neutron and Photon science".

"Especially for large instruments like us where the volume of data is becoming so huge ... [you can] no way continue to run an organisation like the ILL or the ESRF, these big machines, without a solid IT infrastructure," says Perrin.

In future, the ILL is looking into how AI algorithms might be able to predict the results of experiments with the actual instruments.

"It doesn't mean that the experiment is not necessary, but many people are looking for such a thing," Perrin says. "Based on the previous experiments done and the result of this experiment, learning from that, applying deep learning techniques - could we get information on the possible subsets of the next experiment?"


Copyright © 2019 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon