This vignette walks you through a variety of different installation options and workflows for analysing data in bulk on a high performance computing (HPC) platform.
Loading package onto an “airgapped” HPC system
If working on a secure or “airgapped” system without internet access,
using the R remotes
package to install
nutrientprofiler
will not work.
Instead, if there is a file/data transfer process available, the zipped package can be transferred to the secure computational platform and installed locally.
This process can be broken down into 3 steps:
- Downloading the package/zipped archive to your local system
- Transferring the files to the secure system (after any required security checks)
- Installing without internet access on the secure system.
The instructions here assume access to some form of archived CRAN mirror so that other commonly used packages are available. Please check through the packages listed in the code snippets and ensure you have access to these on your system.
1. Downloading the package to your local machine
You can download a specific tagged release (see the GitHub releases page) or a development pre-release version of the code from a specific branch of the repository.
Tagged releases allow for better reproducibility; however, if a pre-release version needs to be used, reproducibility can still be ensured by recording the download link used and the access date.
1.1 Downloading a tagged release
The recommended method is to download the most recent tagged
release as a .tar.gz
archive.
From your local desktop you can run the following R snippet to
download the v1.0.0
release of the package:
# download the v1.0.0 release as a .tar.gz archive
# to your current directory
download.file("https://github.com/Leeds-CDRC/nutrientprofiler/archive/refs/tags/v1.0.0.tar.gz",
dest = "./nutrientprofiler-v1.0.0.tar.gz")
Alternatively, this can be downloaded directly from the GitHub releases page.
A different release can be specified by changing the release tag. Make sure to also update the destination filename to prevent confusion:
# replace <tag> with the required version number
download.file("https://github.com/Leeds-CDRC/nutrientprofiler/archive/refs/tags/<tag>.tar.gz",
dest = "./nutrientprofiler-<tag>.tar.gz")
If required, zip file archives are also available on the GitHub
releases page or can be downloaded using the following R script
below. Please note that these .zip
archives require a
slightly different installation to .tar.gz
; please read
through the isntallation steps first.
# download the <tag> release as a zip archive
# to your current directory
download.file("https://github.com/Leeds-CDRC/nutrientprofiler/archive/refs/tags/<tag>.zip",
dest = "./nutrientprofiler-<tag>.zip")
1.2 Downloading the current development version or a specific branch
If you want to install a pre-release version of a specific version on
a branch, replace tags/version-number
in the url with
heads/branch-name
(and rename the destination file
something sensible).
Please record the date of download and most recent commit identifier as the branch may be updated or changed following your download and installation.
For example, the following downloads the current package on the “VarEdits” branch of the repository:
# replace "tags/version-number" with "heads/branch-name":
# in this example, we replaced "tags/v1.0.0" with "heads/VarEdits"
download.file("https://github.com/Leeds-CDRC/nutrientprofiler/archive/refs/heads/VarEdits.tar.gz",
dest = "./nutrientprofiler-VarEdits.tar.gz")
Again, a zip file version can be downloaded by replacing
tar.gz
in the snippet above with zip
. A zip
file can also be downloaded from the project GitHub page by
navigating to the required branch and then using the green “Code” button
to open the “Clone” option menu, and selecting the “Download zip”
option.
For added reproducibility, you can specify the git commit tag to ensure you are using a specific version:
# use the unique commit id after "archive/" to specify a specific commit on a branch
# use the first 8 characters of the commit id to tag your downloaded archive
download.file("https://github.com/Leeds-CDRC/nutrientprofiler/archive/8502784c9e1402505530d87db001fc23fb0fb6df.tar.gz",
dest = "./nutrientprofiler-8502784.tar.gz")
Again, make sure to modify the installation command to match the archive name you’ve supplied.
2. Transferring your code to the secure platform
This step will vary depending on the data transfer policies and process enforced by your institution. While the package archive can be unzipped for testing, it should be saved on the secure system in it’s original compressed format.
3. Installation
On the secure computing platform, once the archive has been transferred, you can then install the package. The installation method differs depending on the filetype.
3.1 Install .tar.gz
archives
Using an appropriate relative path for the archive, you can install it from source:
# install the package directly from source
install.packages("./nutrientprofiler-v1.0.0.tar.gz", repos = NULL, type="source")
Change the suggested filename to suit your specific installation:
# install the package directly from source
install.packages("./nutrientprofiler-VarEdits.tar.gz", repos = NULL, type="source")
3.2 Install .zip
archives
In order to install the package from a .zip
file, you
need to use the devtools
package:
# install and load devtools
install.packages("devtools")
library(devtools)
devtools::install_local("./nutrientprofiler-v1.0.0.zip")
Again, change the suggested filename to suit your specific installation.
Example workflow for processing data in bulk
Once nutrientprofiler
has been installed on your system,
it can be imported and used interactively or run within scripts on bulk
data.
Ensure data is formatted to match the example csv files with same column names.
Assuming the data that you want to analyse is stored in
"data/example_data.csv"
and you want to save the results
to"results/example_data_results.csv"
with all the original
columns plus the results columns, you can use a script like this:
# load required libraries
library(tidyr)
library(dplyr)
library(nutrientprofiler)
# read in the data
npm_testcases <- read.csv("data/example_data.csv")
# Analyse all entries, including specific gravity conversion, NPM scoring and assessment
npm_testcases_results <- npm_testcases %>%
rowwise() %>%
mutate( sg = SGConverter(pick(everything()))) %>%
mutate(test = NPMScore(pick(everything()), sg_adjusted_label="sg")) %>%
unnest(test) %>%
rowwise() %>%
mutate(assess = NPMAssess(pick(everything()))) %>%
unnest(assess) %>%
select(everything(), energy_score, sugar_score, salt_score, fvn_score,
protein_score, satfat_score, fibre_score, NPM_score, NPM_assessment)
# Save results to a csv file
write.csv(npm_testcases_results, "results/example_data_results.csv", row.names = FALSE)
This only requires slight modification if the desired input file is
instead in "data/example_data.xlsx"
:
# load required libraries
install.packages("readxl") # If this is not already installed in the workspace
library(tidyr)
library(dplyr)
library(readxl)
library(nutrientprofiler)
# read in the data on the first sheet of the spreadsheet (sheet = 1)
npm_testcases <- read_excel("data/example_data.xlsx", sheet = 1)
# The rest of the wokflow is the same...
Troubleshooting
The most common errors to arise are likely to be related to incorrect column names in your data, or incorrect datatypes for the values in these columns.
You can check the names of the columns using
names(npm_testcases)
.
You can replace column names using a script like this:
# load required libraries
library(tidyr)
library(dplyr)
library(nutrientprofiler)
# read in the data
npm_testcases <- read.csv("data/example_data.csv")
# Function to rename variables
replace_var_names <- function(data_frame){
if ("fat_measurement_g" %in% names(data_frame)){
data_frame <- rename(data_frame, satfat_measurement_g = fat_measurement_g)
}
if ("fruit_nut_measurement_percent" %in% names(data_frame)){
data_frame <- rename(data_frame, fvn_measurement_percent = fruit_nut_measurement_percent)
}
}
# Call the function on the example data
replaced_names <- replace_var_names(npm_testcases)
The example shown above uses a function so that this can be reused with multiple data sets that have the same column naming issues.