Scroll to top

Crawling data portals for improving research communication


Led by Andreiwid Correa and Kellyton Brito

It remains difficult for consumers of open data to discover, select and compare open-data repositories and platforms. A solution to collate and display this information will provide valuable data for benchmarking exercises, and ultimately help inform consumers. 

Led by Andreiwid Correa and Kellyton Brito

Aim

Open data powered and enabled many initiatives worldwide, usually in the form of data portals or data platforms. However, it remains difficult for consumers of open data to discover, select and compare open-data repositories and platforms: how many data portals can address their specific use case, what are the most used open data software platforms, and how can the level of openness be compared across repositories in different countries? A solution to collate and display this information will provide valuable data for benchmarking exercises, and ultimately help inform consumers. 

Work at the Sprint

At its heart, the solution needs to contain a way to automatically identify and collect open-data sources; it must crawl web pages worldwide which requires a lot of computational resources. To overcome this, we rely on the Common Crawl project, an open repository of web crawl data that makes available a database of raw data containing a freely accessible copy of the textual web. The solution is based on an iterative scan of the Common Crawl URL index on a monthly basis and allows the automatic identification of data portals and the gathering of information about their availability. The solution should then produce open, accessible data ready for reuse by the community, through the development of visualisation platforms and/or APIs. 

We are looking for…

UX and design contributors to help build a web-based system in a responsive way, with a client-side web development background. Contributors with knowledge and interest in open data and/or experience with platforms and mechanisms to publish data would also be very welcome.