puddingR.Rmd
When writing stories for The Pudding, our team goes through many steps. We start by asking a question about the world that we want to investigate. Then find data to help us answer that question. Next, that data often needs to be analyzed, tidied, and exported for making interactive graphics on the web.
The puddingR
package exists to help streamline the analysis and data export process.
This document should help introduce you to puddingR
’s tools, and show you how to use them in Pudding (and other) projects.
If you’re reading this document, you’ve probably already installed the package. But in case you haven’t, install the latest from GitHub and then load it into your R environment.
Like The Pudding’s front-end starter template, puddingR
assists in setting up a file directory for analysis. Using the same structure helps others to know where files are when looking for them, and helps to keep things organized.
To set up the template directory, open RStudio (download the latest here).
To create a new project, make a new directory (either using point and click or the command line mkdir project-name
) and set that as your working directory.
No need to worry about the dangers of setting a working directory on your local machine, we’re about to overwrite it and use the here
package to manage all absolute and relative references from here on out.
Inside your project directory, we now want to create the analysis starter template. create_project()
exists specifically for this purpose.
By default, it creates the following file structure:
analysis
|-- assets
|-- data
|-- open_data
|-- processed_data
|-- raw_data
|-- packrat
|-- plots
|-- R
|-- reports
|-- rmds
|-- analysis.Rproj
The intention of the directories is as follows:
knit
an Rmd document that is the default Pudding template, you will receive an HTML document in return. These HTML reports should go in the reports folder.If you’re ok with the file structure as indicated above, we can run:
The default values for everything will create the directory structure as shown above, with an Rmarkdown file (in the rmds
folder) to get you started.
It also sets up packrat
as a package dependency management system. This allows each of your projects to run with their own versions of packages, and allows you to ensure that the correct versions of packages are retained when you share your project.
All of the above features can be altered via argument options in the create_project()
function.
If, for instance, you want the parent directory to be called “R-analysis”, we can change the name
argument:
If you’d like to create a custom file structure, perhaps something like this:
We can indicate that via the dirs
argument:
Note that you don’t need to add the .Rproj file as that is generated automatically, or the packrat
file as that is generated automatically as long as packagedeps
is set to "packrat"
(the default value).
For more information on the default values and other arguments, try:
When working on analyzing data, you often need to export data for multiple purposes. Sometimes you export processed data so that you don’t need to run the preliminary analysis on the raw data every time. Other times, you may need the processed data for use in creating web visualizations. Further still, you may need to export the data for sharing in a public repository.
In the puddingR
package, there are a few functions to help with these types of tasks. The most basic of which is called export_data()
.
This function works alone to export data in either csv
, tsv
, or json
format to the location you prefer.
Using the default options, like this:
# export your R dataframe called df to filename "analyzed_data"
puddingR::export_data(df, filename = "analyzed_data")
will export a dataframe (in this case called df
) from your R environment, to a csv file in the processed_data
directory of our default puddingR
layout.
If, you wanted to export the file instead to the open_data
directory, you can indicate that in the location
argument:
You can also output to two locations at once. For instance, if you want this file to go to the open_data
directory and also in the src/assets/data
of the front-end Pudding starter template (assuming this is in the same parent directory as your analysis directory), you can do this:
# use a character vector to output to multiple locations simultaneously
puddingR::export_data(df, filename = "analyzed_data", location = c("open", "js"))
If you aren’t using the default puddingR
directory structure, you can specify a custom directory in the directory
argument:
# use the directory argument to output to a location outside
# the default puddingR directory structure
puddingR::export_data(df, filename = "analyzed_data", directory = "myProject/thisOne/")
For more information on the default values and other arguments, try:
Sharing data openly is great, and something that we strive to do at The Pudding. But making data publicly accessible isn’t useful unless it is provided alongside useful metadata.
Previously, we had created guidelines of how to include relevant metadata that had to be copy & pasted into separate files and manually filled out for each file that was to be released.
The puddingR
packages comes with the function create_codebook()
to help streamline this process.
Imagine, you’d like to create a codebook full of metadata for a dataframe in your R environment called df
. Using the default values like so:
will save a markdown styled file to the open_data/intermediate
directory of the default puddingR
file structure.
Any directory can be specified by changing the output_dir
argument:
Also, by default, overwrite
is set to FALSE
. This is because rendering the codebook is not the final step. This will create a table containing all of the column names, and the types of data contained in each column, but descriptions of each column, as well as descriptions about the data collection method, source, the contact person etc. must all be filled in manually. The codebook generated gives a guide of what kinds of information should be included, but it is up to you to make sure that it is complete.
In The Pudding open data repository, the data used for each project is contained in a single folder and all of the codebooks for each file are combined into a single README. Making it easier to find everything at your disposal in the folder. To speed up this process, the puddingR
package offers knit_data_readme()
which combines several individual codebooks into a single README
file.
If you are using the puddingR
file structure, you can run the function with no specified arguments:
This will create a README.md
file inside of open_data
combining all of the .md files that it finds in the open_data/intermediate/
directory. Then, everything inside open_data
except for the intermediate/
directory can be uploaded to the open data repository.
If you aren’t using the puddingR
directory system, you can specify the directory where all of your codebooks are stored, as well as where the README
file should be generated.
Important: You should not manually change anything in the
README.md
file. All manual specifications of data should be made in the corresponding codebooks. Then,knit_data_readme()
can be run repeatedly to incorporate any updated codebooks. Otherwise, you may accidentally overwrite your file.
While sharing data and metadata is excellent practice, it’s even better if you can also share the code you used to go from raw data to processed data. If you’ve been analyzing your data in an .Rmd file, the puddingR
package can make this process easier.
In the past, my workflow for sharing code was to manually go through my .Rmd file, copying the relevant code chunks from .Rmd into a separate .R script that served only one purpose: converting the raw data into the shared, processed data. Now, using the export_code()
function, we can do this all at once.
Let’s assume that all of your analysis work exists in analysis.Rmd
. Inside of that file you have 50 code chunks, but you only need a few to display this particular bit of data cleaning. You need to ensure that your code chunks are named. So, instead of your code chunk starting like this:
```{r}
Give each code chunk a unique name like this:
```{r load_data}
or ```{r find_average}
Now, if we want to only export those two code chunks into a separate .R script, we can run
puddingR::export_code(file = "analysis.Rmd",
chunks = c("load_data", "find_average"),
output_file = "average")
This will create average.R
in the open_data
directory. It will contain all of the code that was included in the load_data
and find_average
code chunks.
If you want to use a custom directory:
Now that you know you can export data, a codebook for metadata, and the code used to clean and process data separately, let’s look at the export_all()
function.
This powerful function combines export_data()
, export_code()
, and create_codebook()
all in one neat function.
Let’s imagine the following situation. You want to:
df
.You are using the puddingR
standard directory structure and The Pudding front-end starter template.
You could do:
# export the data
puddingR::export_data(df, filename = "analyzed_data", location = c("open", "js"))
# export the codebook
puddingR::create_codebook(df, filename = "data_codebook")
# export the code
puddingR::export_code(file = "analysis.Rmd",
chunks = c("load_data", "find_average"),
output_file = "average")
Or, you could use the export_all()
function instead:
puddingR::export_all(df,
filename = "analyzed_data",
location = c("open", "js"),
codebook = TRUE,
scripts = c("load_data", "find_average"),
script_file = "analysis.Rmd")
Which does the exact same thing.