Projects - eLife Innovation Sprint 2021

Building Tools For Readable Papers With Clickable Depth

Led by Milo Johnson

The Problem: The linear format of scientific papers is an entrenched, constrained result of history that holds back efficient and effective transmission of information between scientists. Science requires depth, and publishers rightly require a high level of detail in scientific papers. But the reality is that the majority of readers don’t care about the majority of things in a paper. And we know it! At conferences, scientists give 5-minute talks at poster sessions that effectively communicate their work to 95% of their audience, and the other 5% can ask follow-up questions about the details. In contrast, papers hide their meaning in paragraphs of dense text, and we spend entire semesters teaching students how to parse academic papers based on an array of “tips and tricks.” Recently, more scientists have started to publish blog posts or twitter threads about their work. In these less-constrained formats, they can get right to the point: important results, key analyses, simple figures. In many cases, these blog posts or twitter threads are more efficient and effective ways of communicating scientific information than the papers they summarize. However, they lack depth. My project aims to build a highly-flexible, open-source tool for authoring papers with hierarchical, interactive depth.

The solution: My project is an open-source tool for writing readable papers with clickable depth. This tool would comprise two main modules: a hierarchical structure for organizing content and a suite of flexible content creation tools. Last year I created an interactive summary of a paper (https://www.miloswebsite.com/figure80/stories/interaction_action/). This summary had “clickable depth,” so that more detailed methods descriptions are initially hidden but expandable, and it also featured interactive data browsers for readers to explore the data themselves. Since working on this project, I have become excited about expanding on the idea of clickable depth. The basic concept is to be able to link blocks of content not just linearly (as in a twitter thread), but also hierarchically. This format will have the benefits of blog posts and twitter threads while also allowing authors to hide the important details behind the relevant major points, so that the 5% of readers who care can find them easily. Within the blocks of content users will be able to add text, images, slideshows, video, and interactive data visualizations (possibly integrated with tools like Vega-Lite). These tools would be built from the ground-up with scientists in mind, unlike platforms designed primarily for social media or e-commerce.

Proposed Work at Sprint: During this Sprint, I would like to work on

defining the most useful scope for this project,
defining the heirarchical content structure (e.g. html and markdown with a javascript static site generator to create final products, and
building a prototype. I think the expertise of others in the Sprint will be incredibly useful for all three of these goals. The chief deliverables would be a document defining the structure of the system and a prototype. A reach goal would be to build a prototype with a functional backend.

COAR Notify

Led by Paul Walk, Martin Klein

The problem: There is significant and growing interest in connecting pre-prints to peer-review, publication and endorsement services, what is known as the “publish, then review” model. Some integrations of this kind have already been prototyped and developed. However these are by nature “point-to-point” solutions linking, for example, a single significant preprint repository to a review service. Ideally, overlay journals should be able to review and publish preprints available in any preprint server or repository.

Many repositories are developed, deployed and managed in low-resource conditions. This means that development resources are scarce, and it is not viable to create individual, service-specific solutions for each and every integration requirement.

If we are to connect pre-prints in many, distributed repositories to peer-reviews and similar resources in a wide range of services then we need a general, interoperable protocol suitable for the linking of resources across a distributed service environment.

The solution: The resource-oriented nature of the Web is well-suited to an environment which places value in the fact that control of resources is distributed across a large number of repositories. In such an environment, it makes sense to take a pass-by-reference approach to interaction between different networked services, rather than relying on machine or human mediated processes to pass copies of resources around the network.

Resources in repositories have stable URIs that can be used for referencing. This means that a request for review can be sent as a standards-based notification that carries a resource’s stable URI to the inbox of a review service. This also means that the review service can obtain the resource that is to be reviewed by visiting that stable URI. Generally, this means that it becomes possible to invoke and use remote services on the network, by passing instructions to them together with, crucially, URIs identifying particular resources.

Proposed work at Sprint: The COAR Notify project is developing a model for such notifications (https://notify.coar-repositories.org), building on the standard W3C Linked Data Notifications protocol and the W3C Activity Streams 2.0 vocabulary.

We have some early development partners helping us to explore some preliminary use-cases, and we expect to be able to deploy a reference implementation of one of these use-cases later in 2021.

Having done this, the next step will be to invite wider participation. We believe that the eLife Sprint could offer a perfect opportunity to engage with like-minded technical people, who will be able to see the potential in what we are building, and who could work with us to develop more use-cases and prototypes.

As part of our project (in advance of the Sprint) we are developing a notification “inbox” for testing and experimentation purposes: This would be made available to the Sprint as a “target” for notifications. We are also developing documentation and example code to enable people to quickly assemble client software in the programming languages of their choice (or even using desktop HTTP client tools such as Curl).

We would design the Sprint activity to deliver two things:

Feedback on existing notification payload specifications, and their suitability for more use-cases
One or more prototyped notification exchanges demonstrating the use of Notify to address new resource-linking opportunities or requirements between repositories and other services.

MERITS: Metaresearch Evaluation Repository for Identifying Trustworthy Science

Led by Cooper Smout, Dawn Holford, Paola Masuzzo

The problem: In recent years, there has been an explosion of interest in post-publication peer-review, with many models proposing multidimensional article-level ratings (e.g., Kriegeskorte, 2012, Frontiers in Computer Neuroscience) as an alternative to unidimensional journal-level metrics (e.g, the journal impact factor). In line with these ideas, a growing number of preprint review platforms now solicit reviewers’ ratings of preprints on multiple dimensions (e.g. PREreview, Plaudit, Scibase, Rapid Reviews Covid-19, Crowdpeer; see Reimagine Review for more examples). Presently, however, these ratings remain siloed within each project, limiting the interoperability, searchability, and comparison between sites, and preventing research that could otherwise be conducted into the nature of these ratings and how they relate to real-world outcomes (e.g. citations, patents, replicability).

Additionally, it remains difficult for stakeholders (e.g. researchers, journalists, administration staff) to identify which preprints/articles have been peer-rated, nor where to find such ratings. This lack of transparency limits the exposure of the evaluations, and makes it hard for readers to identify whether a newly published preprint has been evaluated or not. Highlighting this information could help limit the impact of low quality research, such as the large number of low-quality preprints that have been promoted by the media throughout the COVID pandemic.

The solution: We propose to develop a central database for organising and storing multi-dimensional point-based article ratings. This database will serve two key purposes: (1) to amalgamate ratings from different preprint review platforms into a single location, and (2) to help researchers/journalists identify and find peer ratings of articles/preprints. Ratings will be stored in a common machine-readable format and linked to the original article and source of the rating (i.e. review platform). Articles will be identified using DOIs.

In the future, we anticipate that this database could be expanded to serve other purposes, for example, storing evaluation data collected during meta-research experiments (e.g. replicability ratings collected under the RepliCATS program), allowing peer reviewers at traditional journals to enter ratings submitted during the peer review process, and/or amalgamating multiple ratings across different sites (subject to appropriate research on this topic).

Proposed work at Sprint: For each deliverable, we will identify key users (e.g., meta-researchers, librarians, journalists), write user stories to understand their needs, and conduct functionality tests for the key users.

Deliverable 1: Prototype database

Establish project goals, code of conduct and contributor roles
Identify the types of data (e.g., from PREreview, Plaudit, Docmaps) to import into the database (we will bring examples to the Sprint) and determine a common syntax for them.
Determine platform to host database Build database and launch a prototype version on the host platform (note: we have funding available to cover hosting costs)
Create a project Readme, Licence and Roadmap
Develop code to import data and prevent duplication, and search database by DOI
Write contribution guidelines for future data imports

Deliverable 2 (subject to time and relevant skills): Database API

Determine API framework (e.g. OpenAPI 3.0)
Develop API

Deliverable 3 (subject to time and relevant skills): Web interface to visualise the API

Design web interface
Design a visual representation of the rating data, including links to the original reviews
Write website text
Develop and launch the web interface

Genestorian

Led by Manuel Lera Ramirez

The Problem: Genotypes of model organism strains are recorded by researchers as manually entered text, loosely following an allele nomenclature that becomes increasingly obsolete with the growing diversity of genetic modifications. When consulting a publication or a laboratory’s strain database, it can be hard or impossible to understand how a strain was generated. The best way to record the genotype of a strain would be to document the sequence modifications with respect to the reference genome, and the biological resources used to produce those. Electronic Laboratory Notebooks do not provide this functionality.

Documentation of the generation of recombinant DNA resources is possible with proprietary software. However, these services do not allow exporting the history of recombinant DNA entries in an Open Source machine-readable format. This prevents the integration of collections with other bioinformatic tools and limits the usability of data produced by researchers. Conversely, Open Source libraries for in silico genetic engineering have limitations. Besides the fact that researchers need to know how to code to use them, these libraries do not provide functionality for inventorying biological entities nor their relationship.

Producing a simple tool to document genetic engineering unambiguously would improve the reproducibility of science, fostering collaboration and knowledge transfer.

The Solution: I work in a fission yeast lab and in my free time I have been developing Genestorian, an Open Source web application to document genetic modifications in model organism collections. Genestorian will keep a laboratory inventory of strains, oligonucleotides and plasmids with sequence traceability, using a relational database, in silico molecular biology libraries, and genome databases. New plasmid and allele sequences will be generated in silico from sequences retrieved from the corresponding laboratory inventory and the model organism genome database. Strains and plasmids will be linked to the resources and genetic-engineering methods used to produce them, and to the experimental data validating genetic operations.

Genestorian will allow users to document strain and plasmid generation unambiguously on the browser through an intuitive web interface. It will also provide programmatic access through an API, and enable data exportation in a machine-readable format. This information can then be used to provide documentation of the history of a biological resource for publications, repositories or collaborators.

Right now Genestorian is at a very early stage of development, but a landing page is set up with a brief description and a video of a prototype: https://www.genestorian.org/

Proposed work at Sprint: I am a self-taught programmer, and although I write software for my research, I have never made major contributions to software projects with many users. Additionally, I am new to web development, so I lack the experience to make technical decisions that will impact security, how easy it is to use or deploy the application, etc.

I see the Sprint as an opportunity for me to discuss the design of the application with experts at an early stage so I can incorporate their feedback into the planning of the software. Additionally, I would like to discuss with them how to incorporate user feedback into the development cycle, to produce an easy to use software tool that meets the needs of the end-user.

Finally, I am trying to find funding for the development of this project. I have applied for a grant to develop this tool in the lab of Jürg Bähler in UCL, but we still do not have an answer from the funding agency. Should we not receive the grant, I hope to learn more about funding opportunities for Open Source software for research during the Sprint.

Octopus

Led by Alexandra Freeman

The Problem: Octopus is a platform designed to be the new primary research record. Initiated at the eLife sprint in 2018, it is now hoping to launch by the end of 2021.

The platform itself is nearly ready, but in Octopus every publication needs to be linked to an existing one – so the problem is creating a framework of linked publications, extracted from the Open Source literature, to which all new publications can be linked when Octopus launches.

In Octopus, publications are one of 8 types, and the framework we will be creating will be of Problems – research questions. These need to be automatically generated, extracted from the existing Open Source corpus, using natural language processing, and then hierarchically linked to form a branching structure.

With that framework created, Octopus will be ready for launch – everyone coming with a new publication will be able to find the most closely related research question to link their new publication to. This branching structure will mean that in future, scientific work will all be much more easily discovered and navigated.

The solution: The solution we envisage is:

To take JISC’s Open Access corpus The Core (JISC is a partner to Octopus)
To pre-filter the corpus by presence of an Abstract
To extract sentences that are most likely to define the research question of each paper, using adaptations of an algorithm originally developed for extracting the main findings of a paper, developed by Kevin Heffernan. Combine these with the title, journal and author keywords of the relevant paper (and keep the DOI).
To cluster those research questions by semantic similarity using existing NLP algorithms, run iteratively to make hierarchical clusters
To form a ‘research question’ (or Problem) for each cluster which is not necessarily perfect English, but understandable, and keep the DOIs of the papers that link to that Problem.
To feed that framework of Problems into the Octopus database, along with the linked papers.

Proposed Work at Sprint: It is not clear how far through the above process we will be by October, when the sprint will happen. Hopefully we will have had a chance to refine the algorithms and perhaps be ready to run them en masse on the large corpus. The result of this will be a node and link map of all the current problems in science – quite something to behold!

Whatever stage we’re at, the knowledge and expertise and time of a group of participants will be incredibly useful to get the Problem framework into Octopus ready for launch.

If this first milestone is achieved, then the final stage would be to create automatic emails to the authors of the papers which have been used from the corpus, asking them to check the classification of their own paper and the formation of the research question to which it has been linked. We will thus be crowd-sourcing a human touch to perfect the Problem framework prior to launch.

R4E curriculum: For Community-led Reproducibility Education

Led by April Clyburne-Sherin

The Problem: Today there is no reliable source for an average researcher to tap when they need introductory level, short-duration content on reproducible methods and tools. While initiatives, like the Turing Way, The Carpentries provide excellent onramps, researchers first require context to understand how these initiatives apply to them, their research, and their needs. There is a critical gap that exists at the introductory level. Our global community-led reproducibility education project, Reproducibility for Everyone (R4E) builds an introductory layer upon these other initiatives, and can help expand their reach and impact.

Researchers need to identify where gaps exist in awareness of open practices and fill that need with timely, targeted educational offerings. R4E aims to scale open research practices by linking researchers to the tools, communities, and practices that will help them work open effectively. R4E volunteer researchers run practical introductory workshops covering a conceptual framework for reproducibility. They cover fundamental methods, tools, and initiatives to improve reproducibility. During the eLife Sprint, R4E aims to seed, enhance, and develop curriculum with wide reuse at scale to help fill educational gaps and introduce more researchers to reproducibility.

The solution: Our project aims to apply an open source development approach to community-led reproducibility education. The landscape of reproducibility changes quickly, and we learn more from our participants and instructors with every R4E workshop. In the eLife Sprint, we aim to seed and develop new curriculum modules from the greater researcher community. We also aim to recruit contributors to review and revise existing curriculum modules. The R4E curriculum is iterative: instructors add discipline-specific tools and methods, modules that feel long can be tightened, and popular modules expanded. We hope to open up this iteration to a larger open source community beyond our existing R4E volunteers. Designing an open contribution process and testing it during the eLife Sprint benefits the quality of the curriculum through a larger pool of diverse reviewers and increases the variety of training materials through the seeding of new curriculum ideas to be developed and tested. To achieve this, we will be creating and documenting a process of virtual collaboration of module creation from proposal, to review, to publishing using online tools such as Github, Slack, and the Open Science Framework.

Proposed Work at Sprint: The main aim of participating in the sprint is to recruit diverse open source contributors to R4E curriculum development and improvement. A secondary aim is to receive feedback on the contribution process so we can improve our documentation and processes for new contributors. Milestones we hope to achieve include:

Seed new R4E curriculum modules
Expand or adapt existing R4E curriculum modules to include new audiences, topics, or disciplines
Review and revise existing R4E curriculum modules to include new tools, methods, vocabulary
Translate existing curriculum modules into new languages
Gather feedback and improve contributor guides, workshop guides, and other onboarding materials
Identify gaps in documentation and barriers to participation

Research Group Handbook

Led by Natalie Thurlby, James Thomas, Alastair Tanner

The Problem: The research culture of a group has an impact both on the people working within it (especially students and postdocs), and on the quality of the work that they create together. This culture is created through everyday decisions: if and how team meetings are run, how the group approaches reproducibility and publication practices, what’s in the lab code of conduct, who does the administration work, etc.

Often these things are not written down in a group handbook, since this is a prohibitive time commitment for stretched researchers. When they are, they are usually not a collaborative effort.

When these things aren’t written down:

People in the team can’t easily get the support they need.
Practices are more difficult to change for the better (e.g. making the group more inclusive or reproducible). The team may suffer from the Tyranny of Structurelessness (https://www.jofreeman.com/joreen/tyranny.htm): since the process for decision making isn’t known, it’s difficult for group members with less power to influence it.

The Solution: We want to create a template research group handbook, and materials to help people use it. We hope this will help people to take ownership of these decisions by reducing the barrier to thinking about the research culture. Both of these paired resources will be on GitHub and use Jupyter Book. This will build on and link to other Open Source resources.

The template handbook will reduce the time required for research groups to co-create a research handbook together by containing templates and activities to create resources such as the team’s roles and responsibilities or group values (e.g. “We work Openly, and here’s how). The template repository (at a very early stage) is on GitHub, including plans for what it might contain: https://github.com/very-good-science/our-handbook-template/issues/1

The how-to-use guidebook will explain how to make best use the template, both technically (e.g. how to create a repository from a template) and practically (e.g. run an annual team meeting to maintain the handbook).

The placeholder repository is here: https://github.com/very-good-science/our-meta-handbook

As a bonus, doing a writing (rather than a coding) task as an introduction to GitHub works excellently to introduce people to online collaboration on GitHub!

Proposed work at Sprint: The resources currently have a very basic skeleton, which needs fleshing out considerably to have:

A more complete template handbook; we would like to aim for two new sections of the template handbook, e.g. “working on side projects”, “getting the most out of one-to-ones”, or “writing a preregistration”. These might be sections that we’ve already identified or ideas that the participants come with.
An example handbook using the template repository: so that we can show potential users what the end goal is for them. E.g. “The Jean Golding Institute’s Data Science handbook”.
More detailed instructions for how to use the template repository.
More detailed contribution guidelines for future contributors, who fit the user profile “I’ve never used GitHub before, but have this resource that I use with my group, how can I add it in?”.

Understanding Repository Structures Through Charts: A GitHub Navigation Tool

Led by Michał Krassowski

The Problem: Research community is increasingly relying on software for data preparation and analysis. While many researchers are now aware of the need to deposit the scripts (code) and/or notebooks in online repositories prior to submission, these online repositories may be difficult to navigate and understand for an external researcher who tries to gain important insight on analysis flow and/or details. The reasons for navigation difficulty include: a high variety of script languages (e.g. R, Python, Perl, Julia) and tools (Jupyter, R Markdown, etc.) used in the analyses (some of which do not have tooling to support easy cross-document navigation), complexity of the analyses performed (which implies dozens or hundreds of analysis scripts), and limited time of investigators who often decide that documenting the scripts and their organisation is not the best use of their time.

While the ideal solution would be enforcing investigators to thoroughly document their analysis/code repositories, it is unlikely to become a norm soon. Even once such practices become standard, we will be still left with many already existing repositories missing any documentation.

The Solution: We propose to create a code/data flow overview tool that can create an easy to use chart for understanding the repository structure given only its GitHub or GitLab URL. An example code overview of a simple paper is available in the https://github.com/krassowski/multi-omics-state-of-the-field repository (Figure 3); upon clicking on the figure an SVG plot is shown in which the analysis scripts (here Jupyter notebooks) can be clicked to open the relevant file in the repository; the files saved by each of the scripts are indicated as data nodes and connected with appropriate scripts by arrows to indicate the relation (save/load). This chart was generated using previous proof-of-concept work in this area (https://github.com/krassowski/nbpipeline). The proposed application would work by statically analysing the scripts in online repositories and look for known methods saving and loading data (and script) files to create a directed graph and present it as an interactive visualisation.

The proposed tool could be deployed as a web application allowing researchers to analyse specific repositories, embedded into the workflow of journals (the navigation plots could be suggested as supplementary figures for larger projects), and made available for methodology researchers who wish to use it for analysis of thousands of existing research script repositories. Finally, graphs generated by the proposed tool could be suggested (via pull request) as an addition to the existing repositories that are currently missing any documentation whatsoever.

Proposed work at Sprint: The Sprint would be used to develop a proof-of-concept version of the tool, supporting only a subset of possible scenarios, including:

scripts and Jupyter notebooks written with Python,
scripts and RMarkdown notebooks using R,
bash scripts (including wget/curl commands)

By ensuring support of multiple file formats and languages from the beginning the tool will in future allow for analysis of the real-world research repositories which nowadays often mix and match various languages and technologies (to use the best tool available for the job). Given a variety of functions that can be used to save or load a file, the tool must be highly customisable and allow for extension of the default static analysis rules. It is also crucially important to define what information should be shown in the proposed overview plot (what is important for a researcher who wants to understand given repository?) and what customization options should be made available in the potential web application serving as a frontend for the user (a researcher visiting an undocumented repository). It is also crucial to gather a representative collection of repositories that the tool is expected to work with from a variety of disciplines and on a variety of platforms, which can be only accomplished with input of contributors representing diverse research expertise.

By the end of the Sprint the tool should allow a researcher without a programming background to analyse and understand an example repository.

VCMS: Virtual Conference Management System

Led by Simon Duerr

The problem: Scientific conferences emit a lot of CO2, are often not properly integrated into modern data publishing workflows, use antiquated poster formats that are poorly accessible and definitely not FAIR. Plus, conferences are of primary advantage to a select caste of affluent researchers from the global north that can afford to travel both financially and in terms of their caregiving responsibilities (family, dog, side job etc.). At the moment this is an important issue because of the urgency of the climate crisis and the ongoing pandemic.

The solution: An open source virtual conference management system (VCMS) to host smaller scientific meetings virtually at no cost (if hosting is provided by the university). The solution should come with timezone-aware scheduling, an abstract submission portal, one-click-orchid registration, newsletter functionality, virtual poster session allowing for interactive poster presentation via video, use of pronouns across the conference to create a safe space, whiteboard for brainstorming, chat, job board, mentor meeting rooms and automatic archiving of the conference with Doi to preserve a track record of the conference. A demo of the existing features can be found on vcms.simonduerr.eu

Proposed work at Sprint: The whole project is modular and a first version including some of the modules above will be published soon (need to do a little security review first, first module already published on github.com/VirtualConf). A conference with 54 participants has already been hosted with the tool. Some modules are not yet implemented such as the Chat and job board. In addition for non-tech people a review of the user interface and UX could be conducted and training material created. In addition, the archiving feature of the conference could be extracted into a standalone tool: archive-scientific-conference.com. A site where one can provide a csv file and zip with conference posters that will automatically publish all of them on Zenodo with the specified license(maybe after emailing the authors), provide DOI for each poster and archive the conference in a FAIR manner with one click. Conference websites will disappear over time but archived conferences in this way will not.