OpenRefine

Free, open-source, powerful tool for working with messy data.

Harmonising Bibliographic Data using OpenRefine

What is OpenRefine?

OpenRefine, as what has been claimed by the developer, is a free, open-source, powerful tool for working with messy data. We will use this tool in order to clean some bibliographical data such as author's keywords, indexed keywords, affiliation, author's name, etc.

Installing OpenRefine

OpenRefine is designed to work with Windows, Mac, and Linux operating systems. Officially it can be obtained from openrefine.org. However, it can be also can be downloaded from the following links:

https://openrefine.org/download

Java must be installed and configured on your computer to run OpenRefine. It is recommended that you download and install Java before proceeding with the OpenRefine installation. Please note that OpenRefine works with Java 8 to Java 15 but not Java 16 or later versions.

Basic Steps to Clean Bibliographical Data using OpenRefine

For the purpose of this guide, we will use bibliographical data that has been downloaded from Scopus namely the scopus.csv file.

1. Open OpenRefine application

screenshot_2678.png

2.  Choose files that you want to clean and click Next.

screenshot_2679.png

3. Click Create Project

screenshot_2680.png

4. Identify the column that you want to clean (such as Author Keywords) and then click the Dropdown button of that column, click Edit Cell and then click Split Multi-Valued Cells.

Before that, you may need to transform all the data into lowercase (Edit cells ➜ Common transforms ➜ To lowercase).

screenshot_2681.png

5. Enter the separator used for that column.

screenshot_2683.png

6. Click the Dropdown button of the Author Keywords column and then click Facet and then Text Facet.

screenshot_2684.png

7. Edit the Facet on the left side of the screen by going through one by one of the keywords, OR using the Cluster function.

This is where the cleaning process takes place. Time spent here might be a little bit longer. You may start by cleaning the keywords using all the methods and functions under cluster. Once completing that part, screen and edit all keywords MANUALLY. and then, AGAIN looks under the cluster until all the keywords have been cleaned and harmonised.

screenshot_2685.png

8. Once done with the cleaning, you need to Join the Multi-Valued cells. Please re-enter the separator which is supposed to be similar to the separator that you used at the time you split.

screenshot_2686.png

9. Then export the file back to the original format that you import. Now your file is ready to be used for bibliometric analysis either in VOSviewer or Biblioshiny.

screenshot_2687.png

 

Aidi Ahmi

Tunku Puteri Intan Safinaz School of Accountancy
Universiti Utara Malaysia
06010 UUM Sintok
Kedah, Malaysia
Tel: +6049287222
E-mail: aidi[at]uum.edu.my