For this guidance, we are going to use an example dataset called
example_data
. Please see the detailed documentation on what
parameter names and values are expected or required, and to see the
workflows explaining how to load in data.
This example data has a few typos in certain categories. This
documentations steps through one example workflow for filtering and
checking your input data before using the nutrientprofiler
library to analyse it.
# load in required libraries
library(nutrientprofiler)
# read in the data
example_data <- read.csv("data/example_data.csv")
Checking input data column names
While there are an array of functions to check parameter names and missing parameters, it is useful to be able to manually check your input data with base R or a library like tidyr or dplyr additionally.
You can check what column names are in your loaded dataframe using base R:
# Print out your data column names
names(example_data)
#> [1] "name" "brand"
#> [3] "product_category" "product_type"
#> [5] "food_type" "drink_format"
#> [7] "drink_type" "nutrition_info"
#> [9] "energy_measurement_kj" "energy_measurement_kcal"
#> [11] "sugar_measurement_g" "satfat_measurement_g"
#> [13] "salt_measurement_g" "sodium_measurement_mg"
#> [15] "fibre_measurement_nsp" "fibre_measurement_aoac"
#> [17] "protein_measurement_g" "fvn_measurement_percent"
#> [19] "weight_g" "volume_ml"
#> [21] "volume_water_ml"
You can compare these parameter names to the expected parameters that
nutrientprofiler
uses for analysis using the
inputDataCheck()
function:
inputDataCheck(example_data)
#> [1] "All required column names found. Proceed with analysis."
If required, parameter names can be fixed at this point using the
parameterRename()
and fillMissingParameters()
functions.
Checking input data values
Next, you can use the CheckValues()
function to produce
an overview of the values stored in your dataframe:
overview_of_values <- CheckValues(example_data)
print(overview_of_values)
#> Parameter_name Count_unique_values
#> 1 name 10
#> 2 brand 1
#> 3 product_category 1
#> 4 product_type 2
#> 5 food_type 2
#> 6 drink_format 5
#> 7 drink_type 2
#> 8 nutrition_info 5
#> 9 energy_measurement_kj 3
#> 10 energy_measurement_kcal 4
#> 11 sugar_measurement_g 5
#> 12 satfat_measurement_g 3
#> 13 salt_measurement_g 3
#> 14 sodium_measurement_mg 3
#> 15 fibre_measurement_nsp 2
#> 16 fibre_measurement_aoac 3
#> 17 protein_measurement_g 5
#> 18 fvn_measurement_percent 3
#> 19 weight_g 3
#> 20 volume_ml 4
#> 21 volume_water_ml 2
#> Unique_values
#> 1 lembas, zeno's icecream, mystic rush, delta ringer drink, welter water, janus's drink, beta ringer drink, zeta ringer drink, heavyweight water, bantam water
#> 2 NA
#> 3 NA
#> 4 Food, Drink
#> 5 , Ice cream
#> 6 , Ready, Powdered, Cordial, Powder
#> 7 , Carbted/juice drink
#> 8 , Preparation instructions given, As consumed, Prep. instructions not given, Preparation instructions not given
#> 9 266, NA, 188
#> 10 NA, 24, 194, 205
#> 11 50, 21, 11, 15, 19
#> 12 3, 11, 0
#> 13 NA, 0.08, 0.1
#> 14 0.6, NA, 100
#> 15 3, NA
#> 16 NA, 0.7, 0
#> 17 7, 3.5, 0, 0.5, 0.1
#> 18 0, 3, 6
#> 19 100, NA, 25
#> 20 NA, 100, 50, 20
#> 21 NA, 100
You can read these and compare them to the parameter table provided in the documentation, and can save the output to a csv file for future reference:
write.csv(overview_of_values, "path/to/output/file.csv", row.names=FALSE)
In this example, we have a few values with typos that will cause issues:
print(overview_of_values[6,])
#> Parameter_name Count_unique_values Unique_values
#> 6 drink_format 5 , Ready, Powdered, Cordial, Powder
print(overview_of_values[7,])
#> Parameter_name Count_unique_values Unique_values
#> 7 drink_type 2 , Carbted/juice drink
print(overview_of_values[8,])
#> Parameter_name Count_unique_values
#> 8 nutrition_info 5
#> Unique_values
#> 8 , Preparation instructions given, As consumed, Prep. instructions not given, Preparation instructions not given
We have Powder
instead of Powdered
;
Carbted/juice drink
instead of
Carbonated/juice drink
; and
Prep. instructions not given
instead of
Preparation instructions not given
. For clear cases of
typos like this, we can easily replace all incorrect values.
Fixing typos
We can easily fix typos in the dataset using the following script. This is kept separate from applying new values or applying defaults as is described in the Handling Input Data article, where an additional column is added to record where values were overwritten.
example_data$drink_format[example_data$drink_format=="Powder"] <- "Powdered"
example_data$drink_type[example_data$drink_type=="Carbted/juice drink"] <- "Carbonated/juice drink"
example_data$nutrition_info[example_data$nutrition_info=="Prep. instructions not given"] <- "Preparation instructions not given"
After updating these in place, you can check again if you have captured all the incorrect values:
CheckValues(example_data)
#> Parameter_name Count_unique_values
#> 1 name 10
#> 2 brand 1
#> 3 product_category 1
#> 4 product_type 2
#> 5 food_type 2
#> 6 drink_format 4
#> 7 drink_type 2
#> 8 nutrition_info 4
#> 9 energy_measurement_kj 3
#> 10 energy_measurement_kcal 4
#> 11 sugar_measurement_g 5
#> 12 satfat_measurement_g 3
#> 13 salt_measurement_g 3
#> 14 sodium_measurement_mg 3
#> 15 fibre_measurement_nsp 2
#> 16 fibre_measurement_aoac 3
#> 17 protein_measurement_g 5
#> 18 fvn_measurement_percent 3
#> 19 weight_g 3
#> 20 volume_ml 4
#> 21 volume_water_ml 2
#> Unique_values
#> 1 lembas, zeno's icecream, mystic rush, delta ringer drink, welter water, janus's drink, beta ringer drink, zeta ringer drink, heavyweight water, bantam water
#> 2 NA
#> 3 NA
#> 4 Food, Drink
#> 5 , Ice cream
#> 6 , Ready, Powdered, Cordial
#> 7 , Carbonated/juice drink
#> 8 , Preparation instructions given, As consumed, Preparation instructions not given
#> 9 266, NA, 188
#> 10 NA, 24, 194, 205
#> 11 50, 21, 11, 15, 19
#> 12 3, 11, 0
#> 13 NA, 0.08, 0.1
#> 14 0.6, NA, 100
#> 15 3, NA
#> 16 NA, 0.7, 0
#> 17 7, 3.5, 0, 0.5, 0.1
#> 18 0, 3, 6
#> 19 100, NA, 25
#> 20 NA, 100, 50, 20
#> 21 NA, 100
Note that for a large dataset with thousands of values, the
number of unique values for numerical parameters such as
weight_g
will be very high. Instead of printing the full
table, you should instead use subsets for your initial
overview. You can subset the overview table using code like
this:
overview_of_values[c("Parameter_name", "Count_unique_values")]
#> Parameter_name Count_unique_values
#> 1 name 10
#> 2 brand 1
#> 3 product_category 1
#> 4 product_type 2
#> 5 food_type 2
#> 6 drink_format 5
#> 7 drink_type 2
#> 8 nutrition_info 5
#> 9 energy_measurement_kj 3
#> 10 energy_measurement_kcal 4
#> 11 sugar_measurement_g 5
#> 12 satfat_measurement_g 3
#> 13 salt_measurement_g 3
#> 14 sodium_measurement_mg 3
#> 15 fibre_measurement_nsp 2
#> 16 fibre_measurement_aoac 3
#> 17 protein_measurement_g 5
#> 18 fvn_measurement_percent 3
#> 19 weight_g 3
#> 20 volume_ml 4
#> 21 volume_water_ml 2
This allows you to identify the categorical parameters (with manageable numbers of unique values) and to print these rows individually:
# Printing using the row index
print(overview_of_values[6,])
#> Parameter_name Count_unique_values Unique_values
#> 6 drink_format 5 , Ready, Powdered, Cordial, Powder
# Printing using the parameter name
print(overview_of_values[which(overview_of_values$Parameter_name == "drink_format"),])
#> Parameter_name Count_unique_values Unique_values
#> 6 drink_format 5 , Ready, Powdered, Cordial, Powder
Check validity of numeric data
In order to check numeric values, you can use max and minimum values (and measures of average if desired) to check that the values are within the expected bounds by using basic mathematical functions on the column of interest in the original product data dataframe:
print("protein_measurement_g values")
#> [1] "protein_measurement_g values"
print(paste("Max:", max(example_data$protein_measurement_g)))
#> [1] "Max: 7"
print(paste("Min.:", min(example_data$protein_measurement_g)))
#> [1] "Min.: 0"
print(paste("Mean:", mean(example_data$protein_measurement_g)))
#> [1] "Mean: 1.58"
print(paste("Median:", median(example_data$protein_measurement_g)))
#> [1] "Median: 0.5"
Individual values can be interrogated, plotted against other
parameters using a graphics library like ggplot
, and
replaced in a similar way to described above for the character-type
entries.
Please see the Handling Input Data for further pre-processing steps beyond this point.