puddingR

When writing stories for The Pudding, our team goes through many steps. We start by asking a question about the world that we want to investigate. Then find data to help us answer that question. Next, that data often needs to be analyzed, tidied, and exported for making interactive graphics on the web.

The puddingR package exists to help streamline the analysis and data export process.

This document should help introduce you to puddingR’s tools, and show you how to use them in Pudding (and other) projects.

Installation

If you’re reading this document, you’ve probably already installed the package. But in case you haven’t, install the latest from GitHub and then load it into your R environment.

# install the latest
devtools::install_github("the-pudding/puddingR")

# load into your R environment
library(puddingR)

Set Up The Template Directory

Like The Pudding’s front-end starter template, puddingR assists in setting up a file directory for analysis. Using the same structure helps others to know where files are when looking for them, and helps to keep things organized.

To set up the template directory, open RStudio (download the latest here).

To create a new project, make a new directory (either using point and click or the command line mkdir project-name) and set that as your working directory.

setwd("Desktop/myProject")

No need to worry about the dangers of setting a working directory on your local machine, we’re about to overwrite it and use the here package to manage all absolute and relative references from here on out.

Inside your project directory, we now want to create the analysis starter template. create_project() exists specifically for this purpose.

By default, it creates the following file structure:

analysis
|-- assets
  |-- data
    |-- open_data
    |-- processed_data
    |-- raw_data
|-- packrat
|-- plots
|-- R
|-- reports
|-- rmds
|-- analysis.Rproj

The intention of the directories is as follows:

assets: much like the assets folder in our front-end starter template, anything we may need for our project can go in here
assets/data/raw_data: Data you have just received/downloaded etc. prior to cleaning/analyzing
assets/data/processed_data: Data that you have cleaned/processed in some way
assets/data/open_data: Data that’s ready to go into our open data repo, alongside codebooks to describe the data, and scripts used to go from raw data to released data
packrat: No need to manually touch anything in this directory, but it automatically stores packages you use in your project
plots: Any plots you exported. This is useful if you’re exporting SVG figures to clean up in Figma.
R: Sometimes it’s nice to separate R functions that you use often in your analysis into a separate R script. Those go in here.
rmds: Any Rmarkdown documents that you generate during your analysis
reports: When you knit an Rmd document that is the default Pudding template, you will receive an HTML document in return. These HTML reports should go in the reports folder.
analysis.Rproj: The R project file that helps keep your project contained to this directory.

If you’re ok with the file structure as indicated above, we can run:

create_project()

The default values for everything will create the directory structure as shown above, with an Rmarkdown file (in the rmds folder) to get you started.

It also sets up packrat as a package dependency management system. This allows each of your projects to run with their own versions of packages, and allows you to ensure that the correct versions of packages are retained when you share your project.

All of the above features can be altered via argument options in the create_project() function.

If, for instance, you want the parent directory to be called “R-analysis”, we can change the name argument:

create_project(name = "R-analysis")

If you’d like to create a custom file structure, perhaps something like this:

analysis
|-- data
|-- packrat
|-- R
|-- rmds
|-- analysis.Rproj

We can indicate that via the dirs argument:

create_project(dirs = c("data", "R", "rmds"))

Note that you don’t need to add the .Rproj file as that is generated automatically, or the packrat file as that is generated automatically as long as packagedeps is set to "packrat" (the default value).

For more information on the default values and other arguments, try:

?puddingR::create_project

Export Processed Data

When working on analyzing data, you often need to export data for multiple purposes. Sometimes you export processed data so that you don’t need to run the preliminary analysis on the raw data every time. Other times, you may need the processed data for use in creating web visualizations. Further still, you may need to export the data for sharing in a public repository.

In the puddingR package, there are a few functions to help with these types of tasks. The most basic of which is called export_data().

This function works alone to export data in either csv, tsv, or json format to the location you prefer.

Using the default options, like this:

# export your R dataframe called df to filename "analyzed_data"
puddingR::export_data(df, filename = "analyzed_data")

will export a dataframe (in this case called df) from your R environment, to a csv file in the processed_data directory of our default puddingR layout.

If, you wanted to export the file instead to the open_data directory, you can indicate that in the location argument:

puddingR::export_data(df, filename = "analyzed_data", location = "open")

You can also output to two locations at once. For instance, if you want this file to go to the open_data directory and also in the src/assets/data of the front-end Pudding starter template (assuming this is in the same parent directory as your analysis directory), you can do this:

# use a character vector to output to multiple locations simultaneously
puddingR::export_data(df, filename = "analyzed_data", location = c("open", "js"))

If you aren’t using the default puddingR directory structure, you can specify a custom directory in the directory argument:

# use the directory argument to output to a location outside 
# the default puddingR directory structure
puddingR::export_data(df, filename = "analyzed_data", directory = "myProject/thisOne/")

For more information on the default values and other arguments, try:

?puddingR::export_data

Create Data Codebooks

Sharing data openly is great, and something that we strive to do at The Pudding. But making data publicly accessible isn’t useful unless it is provided alongside useful metadata.

Previously, we had created guidelines of how to include relevant metadata that had to be copy & pasted into separate files and manually filled out for each file that was to be released.

The puddingR packages comes with the function create_codebook() to help streamline this process.

Imagine, you’d like to create a codebook full of metadata for a dataframe in your R environment called df. Using the default values like so:

puddingR::create_codebook(df, filename = "data_codebook")

will save a markdown styled file to the open_data/intermediate directory of the default puddingR file structure.

Any directory can be specified by changing the output_dir argument:

puddingR::render_codebook(df, filename = "data_codebook", output_dir = "myProject/thisOne/")

Also, by default, overwrite is set to FALSE. This is because rendering the codebook is not the final step. This will create a table containing all of the column names, and the types of data contained in each column, but descriptions of each column, as well as descriptions about the data collection method, source, the contact person etc. must all be filled in manually. The codebook generated gives a guide of what kinds of information should be included, but it is up to you to make sure that it is complete.

Combining Codebooks into a Single README

In The Pudding open data repository, the data used for each project is contained in a single folder and all of the codebooks for each file are combined into a single README. Making it easier to find everything at your disposal in the folder. To speed up this process, the puddingR package offers knit_data_readme() which combines several individual codebooks into a single README file.

If you are using the puddingR file structure, you can run the function with no specified arguments:

puddingR::knit_data_readme()

This will create a README.md file inside of open_data combining all of the .md files that it finds in the open_data/intermediate/ directory. Then, everything inside open_data except for the intermediate/ directory can be uploaded to the open data repository.

If you aren’t using the puddingR directory system, you can specify the directory where all of your codebooks are stored, as well as where the README file should be generated.

puddingR::knit_data_readme(directory = "my_codebooks", output_dir = "public_data")

Important: You should not manually change anything in the README.md file. All manual specifications of data should be made in the corresponding codebooks. Then, knit_data_readme() can be run repeatedly to incorporate any updated codebooks. Otherwise, you may accidentally overwrite your file.

Export Code Chunks

While sharing data and metadata is excellent practice, it’s even better if you can also share the code you used to go from raw data to processed data. If you’ve been analyzing your data in an .Rmd file, the puddingR package can make this process easier.

In the past, my workflow for sharing code was to manually go through my .Rmd file, copying the relevant code chunks from .Rmd into a separate .R script that served only one purpose: converting the raw data into the shared, processed data. Now, using the export_code() function, we can do this all at once.

Let’s assume that all of your analysis work exists in analysis.Rmd. Inside of that file you have 50 code chunks, but you only need a few to display this particular bit of data cleaning. You need to ensure that your code chunks are named. So, instead of your code chunk starting like this:

```{r}

Give each code chunk a unique name like this:

```{r load_data}

or ```{r find_average}

Now, if we want to only export those two code chunks into a separate .R script, we can run

puddingR::export_code(file = "analysis.Rmd",
                      chunks = c("load_data", "find_average"),
                      output_file = "average")

This will create average.R in the open_data directory. It will contain all of the code that was included in the load_data and find_average code chunks.

If you want to use a custom directory:

puddingR::export_code(file = "analysis.Rmd",
                      chunks = c("load_data", "find_average"),
                      output_file = "average",
                      output_dir = "myFolder/")

Exporting Data, Codebooks, and Code Simultaneously

Now that you know you can export data, a codebook for metadata, and the code used to clean and process data separately, let’s look at the export_all() function.

This powerful function combines export_data(), export_code(), and create_codebook() all in one neat function.

Let’s imagine the following situation. You want to:

Export a dataframe in your R environment called df.
Make the data available for open data sharing and to be exported to the standard front-end Pudding Starter Template.
Export a codebook with the data
Export the code used to generate the data

You are using the puddingR standard directory structure and The Pudding front-end starter template.

You could do:

# export the data
puddingR::export_data(df, filename = "analyzed_data", location = c("open", "js"))

# export the codebook
puddingR::create_codebook(df, filename = "data_codebook")

# export the code
puddingR::export_code(file = "analysis.Rmd",
                      chunks = c("load_data", "find_average"),
                      output_file = "average")

Or, you could use the export_all() function instead:

puddingR::export_all(df,
                      filename = "analyzed_data",
                      location = c("open", "js"),
                      codebook = TRUE,
                      scripts = c("load_data", "find_average"),
                      script_file = "analysis.Rmd")

Which does the exact same thing.