class: center, middle, inverse, title-slide # The Promise and Perils of Packaging Your Processes ## (but mostly promise) ### Karl Hailperin
khailper
@khailper
khailper@gmail
### 2018/06/20 --- layout: true <div class="my-footer"><span>khailper.github.io/process_packaging_pres</span></div> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> <!-- reusing footer code from https://github.com/rstudio-education/arm-workshop-rsc2019/blob/master/static/slides/xaringan.Rmd, https://github.com/rstudio-education/arm-workshop-rsc2019/blob/master/static/slides/css/sfah.css, and https://github.com/yihui/xaringan/wiki/Footer-and-header-lines stylesheet code from https://holtzy.github.io/Pimp-my-rmd/#footer_and_header--> --- # Who am I? - Data science consultant - Have built R packages for a client to streamline data processing and visualization - They/them pronouns --- # What this talk is (and isn't) about - It's about practices I've found useful in developing packages for personal use or at work. - Not (necessarily) about destined for Github/CRAN - Not about how to build packages/intro to `devtools` and `roxygen2` (but there will be resources on that at the end) --- # Why should I bother building a package if no one else is going to use it? - `devtools` and `usethis` provide tools for documenting what you were thinking when you wrote the code and what your dependencies are - `testthat` makes it easier to build unit tests that will keep you from breaking things (or at least not noticing that you're breaking things) - Good rule of thumb (that I can't find the source for): If you re-use the same code three times, write a function. If you re-use the same function three times, write a package. - Future you will be grateful - Low-stress way to develop related skills --- # Also, you might be sharing your processes with your co-workers - shared != public - It's likely your team is going to run into same problem. An internal package means you only have to solve it once. - Code review is quicker when you don't have to double-check the same things every time. - Using a package of common processes also acts as a style guide. --- # How repeated code is like for loops ## (and therefore cupcakes) - [For those unfamiliar with the above reference](https://www.youtube.com/watch?v=GyNqlOjhPCQ) - Repeating code obscures changes. - Repeating code leads to accidental changes. - Need to remember to manually propagate changes to all versions. --- # utils.R is your friend (internal functions) - (`utils.R` is a convention from [Hadley Wickham](http://r-pkgs.had.co.nz/namespace.html) for that file where I put the functions that only exist to be called by other functions.) - Remember that not everything needs to be `@exported`-ed - "Rule of three" applies to code within functions - moving chunks of code into a (well-named) internal function can make it easier to review and test your code - makes it easy to update a common workflow or use a parameter to 'switch' which workflow your using --- # Using sensible defaults - Use defaults to hide repeated processes that you don't always want to run. - For me, this was driven home by a _very_ slow processing step I almost never used. - You may have steps that happen enough to put in your package, but sometimes it's critical to skip those steps. - often a simple TRUE/FALSE parameter is enough --- # Case study in internal functions and defaults - Starting point was something like this: ```r # assorted documentation ommitted # usethis::use_pipe() allows us to use %>% #' @export read_page_views <- function(){ here::here("data", "page_views.csv") %>% readr::read_csv() %>% # data comes in as UTC dplyr::mutate(time = lubridate::with_tz(time, tz = "America/Chicago"), # some non-standard characters in page names were # coming out garbled, and needed to be replaced # for use in plot labels page_name = regex_stuff(page_name)) } ``` --- # But then we added bunch of other data sources... -- - Now code looks more like `here::here("data", "web_data","page_views.csv")`. -- - Those first two steps are repeated a lot across similar functions. -- - Also, there's discussion of reading directly from AWS. -- - So now `read_page_views()` looks more like this: -- ```r #' @export read_page_views <- function(){ * read_data_file(location = "web_data", * file_name = "page_views.csv") %>% # data comes in as UTC dplyr::mutate(time = lubridate::with_tz(time, tz = "America/Chicago"), page_name = regex_stuff(page_name)) } ``` -- - `read_data_file` is an internal function that wraps the first two steps --- # That seems overly complicated... -- - That's not entirely unfair. -- - I may have omitted some code. -- - The code looked more like this: ```r #' @export *read_page_views <- function(location){ * data_file <- dplyr::case_when( * location == "local" ~ read_data_file(location = "web_data", * file_name = "page_views.csv"), * location == "aws" ~ read_data_file(location = "aws", * file_name = "page_views.csv"), * TRUE ~ stop("Helpful error message") * ) data_file %>% # data comes in as UTC dplyr::mutate(time = lubridate::with_tz(time, tz = "America/Chicago"), page_name = regex_stuff(page_name)) } ``` --- # read_data_file() pseudo-code ```r read_data_file <- function(location, file_name){ if (location = "aws"){ # code to handle reading from AWS } else{ here::here("data", location, file_name) %>% readr::read_csv() } } ``` - Added bonus: we can give a default value for `location` parameters. - Went from `read_page_views <- function(location = "local"){# ...}` to `read_page_views <- function(location = "aws"){# ...}` when AWS authorization system was ready. - Actual code still just uses `read_page_views()`. - Plus, `read_data_file()` makes it easy to create new `read_*()` functions as we add data sources. Just edit the wrapper and write some data-specific unit tests. --- # Remember regex_stuff(page_name)? -- - Yes...why? -- - It turns out it takes *forever* and almost never matters -- - Better implementation: -- ```r #' @export *read_page_views <- function(location = "local", process_name = FALSE){ data_file <- dplyr::case_when( location == "local" ~ read_data_file(location = "web_data", file_name = "page_views.csv"), location == "aws" ~ read_data_file(location = "aws", file_name = "page_views.csv"), TRUE ~ stop("Helpful error message") ) %>% mutate(time = lubridate::with_tz(time, tz = "America/Chicago")) * if (process_name){ * dplyr::mutate(data_file, * page_name = regex_stuff(page_name)) * } else { * return(data_file) * } } ``` --- # Additional resources - [RStudio's cheetsheet](https://github.com/rstudio/cheatsheets/blob/master/package-development.pdf) - [_R Packages_ by Hadley Wickham](http://r-pkgs.had.co.nz/) - [WIP 2<sup>nd</sup> edition with Jenny Bryan](https://r-pkgs.org/) - ["Writing an R package from scratch" by Hilary Parker](https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/) - [Thomas Westlake's version of "Writing an R package from scratch" using `usethis`](https://r-mageddon.netlify.com/post/writing-an-r-package-from-scratch/) - ["R Package Primer" by Karl Broman](https://kbroman.org/pkg_primer/) - ["`usethis` workflow for package development" by Emil Hvitfedlt](https://www.hvitfeldt.me/blog/usethis-workflow-for-package-development/) - [The tidyverse style guide](https://style.tidyverse.org/) - If you're interested in packaging your employer's colo(u)r preferences for use in `ggplot2`: [Creating corporate colour palettes for ggplot2]( https://drsimonj.svbtle.com/creating-corporate-colour-palettes-for-ggplot2) by Simon Jackson --- # Thank you/questions