Breadcrumb

Best Practices for Cleaning and Storing Data

By Matt Arthur |

Friends,

As the next academic year approaches, many of you will begin to produce, store, and manage large amounts of data. Researchers generally have a basic idea of what they want to do with their data after collection. For example, they may want to conduct specific statistical tests, they may want to make plots, or they may want to simply summarize the results of their study in tabular form. Many of us have invested much time into learning how to process data to answer our research questions. The ways in which we manage those data, however, are often more nebulous. Best practices for data organization vary somewhat by discipline, but in the discussion that follows, we will review some universal practices for data management which can both simplify data storage and minimize the chance of data loss.

[Image description: Against a blue background appear the words “Pave your Way to a Data-Driven Future”]

[Pictured: Data is everywhere!]

Storage Considerations

We shall begin by reviewing some basic guidelines for data storage. In their paper, Good Enough Practices in Scientific Computing, Wilson et al. (2017) provide an in-depth discussion of data management practices. The following list is a sample of the points addressed by the authors. I encourage interested readers to check out the entire paper for more information.

1. Always save raw data as originally generated. This guideline is important for both the duplication of analysis and the correction of mistakes. Consider setting raw data files to “read-only” status so they are not accidentally corrupted. It is also important to carefully document how the raw data were produced. Many experimenters already do this by keeping detailed lab and field notes. However, this practice is also important when gathering data from the internet. Keep track of the websites you use and the parameters for queries you run to gather web data. If you’re using second-hand data, ensure you understand all your variables/attributes and how they are measured. Be mindful of any clustering or grouping of the raw data that may impact your analyses.

2. Always backup your data. Many researchers suggest a “3-2-1” rule for back-ups: data should be backed up in 3 copies, on 2 different media, 1 of which should be off-site (Perkel 2019). For instance, you may have one copy of the data on your computer, a second copy on a cloud-based storage service, and a third copy on an external hard drive. This is, of course, a general guideline; for extremely large files, creating many copies may be unrealistic.

3. Create the dataset you wish you had received. Copy the raw data file, and make a clean version which will be easy to read for humans and machines alike. Wilson et al. suggest changing variable names to be human-readable. It is also a good idea to convert your file to a non-proprietary format, such as CSV (for tabular data) or JSON/XML (for non-tabular data). This is also a good opportunity to re-code data entries; for instance, you may want to code missing values as “NA”, rather than “-99”, which could be mistaken for a number. This step may also involve removing erroneous data points, dropping columns, aggregating data, et cetera.

4. Record all your processing steps. This is essential for duplication of analysis. If you need to update your raw data file, it is important to have a record of all the updates you made the first time.

When storing and manipulating data, consider how your data files will fit into the scheme of your research project. Wilson et al. recommend putting each project in a separate directory with the following basic substructure:

  • A “doc” folder for text documentation and “ReadMe” files
  • A “data” folder for raw data and metadata
  • A “src” folder for any source code (C/C++, Python, R, et cetera)
  • A “bin” folder for any compiled code and programs
  • A “results” folder for cleaned data files and analysis

The authors also discuss organization for manuscripts based on project results. Interested readers should check out their paper for additional details.

[Image description: Against a white background read rainbow-colored words: “Own your Data”]

[Pictured: Own your data.]

Regarding Spreadsheets

No blog on data management, cleaning, and storage would be complete without a discussion about spreadsheets. They are ubiquitous in the business world and are used frequently in academic research as well. Spreadsheet tools such as Microsoft Excel make it easy to visualize data and reports in tabular form, but they may also lead us astray if improperly managed.

Broman and Woo (2018) offer several suggestions for effectively managing spreadsheets, some of which are discussed below.

1. Always be consistent. This rule applies to everything we put in spreadsheets. For example: Choose a single code for missing values and a single format for dates. If you have a variable called “State”, don’t put “CA” in some rows and “California” in others. Beware of trailing spaces—“CA” will be interpreted differently from “CA   ”. The ‘TRIM()’ function in Excel is helpful for dealing with this issue. It is also important to be consistent across files. If you are collaborating with peers, I recommend that you adopt a design convention for your spreadsheets following the guidelines in this discussion.

2. Separate data from calculations. I agree with Broman and Woo’s suggestion to place raw data and calculations in separate files. In this way, we avoid accidentally corrupting the raw data, which can be stored as originally generated. In my experience, it is usually acceptable to put cleaned data into a spreadsheet, but it is still advisable to keep it in a separate tab from any calculations. I also recommend separating manual inputs from calculations. I typically design spreadsheets to have an input tab in which I can change any manually-entered parameters for filtering or summarizing data. This way, I minimize the risk of changing cells that are not meant to be changed.

3. Organize your data into rectangles. This is the best way to avoid issues with analysis since spreadsheets are inherently rectangular. If your data blocks have blank cells in and around them, then spreadsheet formulas are more likely to depend on the precise location and structure of those blocks. In this situation, updating the spreadsheet may be tedious and error-prone. Also, rectangular data are much easier to export to tabular formats such as CSV. Broman and Woo provide some visual examples of non-rectangular data and discuss corrective measures.

4. Document everything. I suggest that every spreadsheet file have a “documentation” tab to describe the general purpose and function of the workbook. Broman and Woo recommend a data dictionary which can be placed in a separate file. For each variable in your study, it is good to document the variable’s name, units of measure, and general description. In my experience, it is also a good idea to list the type and width of each variable—i.e., the amount of disk space you expect the variable to occupy. For example, a “Zip Code” variable may be stored as a character type occupying 5 bytes. If someone (including your future-self) ever wants to store the data in a relational database, it is helpful to have this information in advance.

[Image description: An anthropomorphized Microsoft Excel icon smiles and displays a manual peace sign. Above the icon read the words: “I Excel at Most Things”]

[Pictured: We Excel at Excel.]

The preceding discussion is by no means an exhaustive list of data practices. Your situation may be governed by a different set of guidelines. For example, you may be limited in your storage options for reasons of data privacy or collaboration. My hope, however, is that the discussion above may offer recommendations that you can accommodate on an individual basis. Next quarter, try to incorporate some of these ideas if they suit you. With good data practices, you can avoid information loss and make research a more enjoyable experience for you and your colleagues.  Best of luck in the coming year!

References

Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. The American Statistician, 72(1), 2--10. 10.1080/00031305.2017.1375989

Perkel, J. (2022, June 17). Data Handling and Storage - Data Management Resources - Guides at University of Houston. GUIDES. Retrieved August 12, 2022, from https://guides.lib.uh.edu/datamanagement/handlingdata

Perkel, J. M. (2019). 11 ways to avert a data-storage disaster. Nature, 568, 131--132. 10.1038/d41586-019-01040-w

Wilson, G., Cranston, K., Bryan, J., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good Enough Practices in Scientific Computing. PLoS Comput Biol, 13(6). 10.1371/journal.pcbi.1005510