Input data is likely to be messy and to not contain the required
column names for nutrientprofiler
. This vignette provides
some advice and guidelines on how to pre-process your data to allow
analysis and assessment with nutrientprofiler
.
You can see what parameters are expected in more detail in the Parameter guide documentation or the Example dataset reference. The order of parameters/columns in the loaded data do not matter.
Loading an example template
When possible, it is a good idea to ensure your data is in the required format before yourload it in to R. This can be achieved by using a template.
The nutrientprofiler
package offers an example dataset,
cdrcdrinks
. This can be saved out as an example csv file
and then edited and loaded with real data:
# Save an example csv file to your current working directory
write.csv(cdrcdrinks, "template_full.csv", row.names = FALSE)
Alternatively, just the headers can be saved, providing a blank csv file which can be opened in Excel or your editor of choice:
Checking column names and renaming parameters
If raw data has been acquired with different column names than required, a reproducible workflow can be set up to rename these parameters as required.
The first step is to check the existing column names:
inputDataCheck(cdrcdrinks)
#> [1] "All required column names found. Proceed with analysis."
If the dataframe contains names as expected, you’ll get a message telling you to proceed with analysis, as with the example dataset above.
However, lets say we have loaded in a dataset from csv
irregular_data
that does not contain the exact columns
required:
inputDataCheck(irregular_data)
#> [1] "The provided dataframe is missing the following required column names:"
#> [1] "fvn_measurement_percent"
#> [1] "The provided dataframe contains these unmatched columns names:"
#> [1] "fruit_nutrition_percent"
#> [1] "irregular_data <- parameterRename(missing_column_name = 'fvn_measurement_percent', associated_data_column = '<INSERT DATA COLUMN NAME HERE>', data_frame = irregular_data)"
This provides you with the name of the column(s) that do not match an
expected column name ("fruit_nutrition_percent"
in this
example), and provides you with a prompt to use the
parameterRename()
function.
You can copy this hint and replace
<INSERT DATA COLUMN NAME HERE>
with the data that
matches the listed missing_column_name
:
irregular_data <- parameterRename(missing_column_name = 'fvn_measurement_percent', associated_data_column = 'fruit_nutrition_percent', data_frame = irregular_data)
#> [1] "Replacing fruit_nutrition_percent with fvn_measurement_percent"
You can now check again if the required parameters are available.
inputDataCheck(irregular_data)
#> [1] "All required column names found. Proceed with analysis."
Missing parameters
Some of the parameters are optional, but still need an allocated
column in the input data table. This column can be simply filled with
NA
values. nutrientprofiler
provides some
simple functions to help achieve this quickly.
You can see if we run the inputDataCheck
there are no
returns for “unmatched columns”, suggesting that the data is missing as
opposed to unlabelled:
inputDataCheck(missing_data)
#> [1] "The provided dataframe is missing the following required column names:"
#> [1] "brand" "product_category"
#> [1] "The provided dataframe contains these unmatched columns names:"
#> character(0)
#> [1] "missing_data <- parameterRename(missing_column_name = 'brand', associated_data_column = '<INSERT DATA COLUMN NAME HERE>', data_frame = missing_data)"
#> [2] "missing_data <- parameterRename(missing_column_name = 'product_category', associated_data_column = '<INSERT DATA COLUMN NAME HERE>', data_frame = missing_data)"
We can store the list of missing parameters by using the
listMissingParameters
function:
missing_params <- listMissingParameters(missing_data)
print(missing_params)
#> [1] "brand" "product_category"
If we are happy that these are optional parameters that can be filled
with NA
values, we can pass this list (or a subset of it)
to the fillMissingParameters
function:
filled_data <- fillMissingParameters(missing_data, missing_params)
filled_data
#> name product_type food_type drink_format
#> 1 lembas Food
#> 2 zeno's icecream Food Ice cream
#> 3 mystic rush Drink Ready
#> 4 delta ringer drink Drink Powdered
#> 5 welter water Drink Cordial
#> 6 janus's drink Food
#> 7 beta ringer drink Drink Powdered
#> 8 zeta ringer drink Drink Powdered
#> 9 heavyweight water Drink Cordial
#> 10 bantam water Drink Cordial
#> drink_type nutrition_info
#> 1
#> 2
#> 3 Carbonated/juice drink
#> 4 Preparation instructions given
#> 5 As consumed
#> 6
#> 7 As consumed
#> 8 Preparation instructions not given
#> 9 Preparation instructions given
#> 10 Preparation instructions not given
#> energy_measurement_kj energy_measurement_kcal sugar_measurement_g
#> 1 266 NA 50
#> 2 NA 24 21
#> 3 NA 194 11
#> 4 188 NA 15
#> 5 NA 205 19
#> 6 NA 24 21
#> 7 188 NA 15
#> 8 188 NA 15
#> 9 NA 205 19
#> 10 NA 205 19
#> satfat_measurement_g salt_measurement_g sodium_measurement_mg
#> 1 3 NA 0.6
#> 2 11 0.08 NA
#> 3 0 NA 100.0
#> 4 0 NA 100.0
#> 5 0 0.10 NA
#> 6 11 0.08 NA
#> 7 0 NA 100.0
#> 8 0 NA 100.0
#> 9 0 0.10 NA
#> 10 0 0.10 NA
#> fibre_measurement_nsp fibre_measurement_aoac protein_measurement_g
#> 1 3 NA 7.0
#> 2 NA 0.7 3.5
#> 3 NA 0.0 0.0
#> 4 NA 0.0 0.5
#> 5 NA 0.0 0.1
#> 6 NA 0.7 3.5
#> 7 NA 0.0 0.5
#> 8 NA 0.0 0.5
#> 9 NA 0.0 0.1
#> 10 NA 0.0 0.1
#> fvn_measurement_percent weight_g volume_ml volume_water_ml brand
#> 1 0 100 NA NA NA
#> 2 0 NA 100 NA NA
#> 3 0 NA 100 NA NA
#> 4 3 25 NA 100 NA
#> 5 6 NA 100 NA NA
#> 6 0 NA 100 NA NA
#> 7 3 NA 50 NA NA
#> 8 3 25 NA NA NA
#> 9 6 NA 20 100 NA
#> 10 6 NA 100 NA NA
#> product_category
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA
#> 7 NA
#> 8 NA
#> 9 NA
#> 10 NA
This then allows the data to be processed in bulk without triggerring errors related to missing data.
Data validity
Please see the article on Preprocessing input data to see advice on how to check whether parameters are valid and within the range expected, and to fix any typos in the dataset.
Missing data
Certain parameters are only required for certain product types, so we have included some functions to check if the necessary parameters are present for each specific product.
Find out what data is missing
report_table <- ReqParamCheck(incomplete_data)
#> [1] "Please note that the following parameters are optional and are not checked here:"
#> [1] "name" "brand" "product_category"
#> [1] "Please note that the following parameters have automatically applied defaults if missing and are not checked by this function:"
#> [1] "drink_format" "food_type" "drink_type"
#> [1] "If you wish to check the presence of these parameters and record when defaults are used,"
#> [1] "please see the `AutoDefaultParamCheck()` and `AutoDefaultParamNote()` functions"
#> [1] "Please see the `SaveReport()` function to output a csv version of this report table."
We can transpose and print this table to see how many rows are missing values for each parameter:
t(report_table)
#> [,1]
#> name NA
#> brand NA
#> product_category NA
#> product_type integer,0
#> food_type NA
#> drink_format NA
#> drink_type NA
#> nutrition_info integer,0
#> energy_measurement_kj integer,0
#> energy_measurement_kcal NA
#> sugar_measurement_g integer,3
#> satfat_measurement_g integer,0
#> salt_measurement_g integer,0
#> sodium_measurement_mg NA
#> fibre_measurement_nsp integer,0
#> fibre_measurement_aoac NA
#> protein_measurement_g integer,0
#> fvn_measurement_percent integer,0
#> weight_g integer,0
#> volume_ml NA
#> volume_water_ml integer,0
In addition, there are certain parameters that have automatically
assigned defaults if they are missing
("drink_format", "food_type", "drink_type"
). You can also
check if these parameters are missing using the
AutoDefaultParamCheck()
function. This adds the indices of
rows missing these values to the report dataframe object:
report_table <- AutoDefaultParamCheck(incomplete_data, report_table)
These “automatic defaults” can be recorded in the dataframe of
product data by using the AutoDefaultParamNote()
function,
in the default_params_used
column:
incomplete_data <- AutoDefaultParamNote(incomplete_data, report_table)
#> [1] "Checking drink_format"
#> [1] "Checking food_type"
#> [1] "Checking drink_type"
incomplete_data
#> name brand product_category product_type food_type
#> 1 lembas NA NA Food
#> 2 zeno's icecream NA NA Food Ice cream
#> 3 mystic rush NA NA Drink
#> 4 delta ringer drink NA NA Drink
#> 5 welter water NA NA Drink
#> 6 janus's drink NA NA Food
#> 7 beta ringer drink NA NA Drink
#> 8 zeta ringer drink NA NA Drink
#> 9 heavyweight water NA NA Drink
#> 10 bantam water NA NA Drink
#> drink_format drink_type nutrition_info
#> 1
#> 2
#> 3 Ready Carbonated/juice drink
#> 4 Powdered Preparation instructions given
#> 5 Cordial As consumed
#> 6
#> 7 Powdered As consumed
#> 8 Powdered Preparation instructions not given
#> 9 Cordial Preparation instructions given
#> 10 Cordial Preparation instructions not given
#> energy_measurement_kj energy_measurement_kcal sugar_measurement_g
#> 1 266 NA
#> 2 NA 24
#> 3 NA 194 11
#> 4 188 NA
#> 5 NA 205 19
#> 6 NA 24 21
#> 7 188 NA 15
#> 8 188 NA 15
#> 9 NA 205 19
#> 10 NA 205 19
#> satfat_measurement_g salt_measurement_g sodium_measurement_mg
#> 1 3 NA 0.6
#> 2 11 0.08 NA
#> 3 0 NA 100.0
#> 4 0 NA 100.0
#> 5 0 0.10 NA
#> 6 11 0.08 NA
#> 7 0 NA 100.0
#> 8 0 NA 100.0
#> 9 0 0.10 NA
#> 10 0 0.10 NA
#> fibre_measurement_nsp fibre_measurement_aoac protein_measurement_g
#> 1 3 NA 7.0
#> 2 NA 0.7 3.5
#> 3 NA 0.0 0.0
#> 4 NA 0.0 0.5
#> 5 NA 0.0 0.1
#> 6 NA 0.7 3.5
#> 7 NA 0.0 0.5
#> 8 NA 0.0 0.5
#> 9 NA 0.0 0.1
#> 10 NA 0.0 0.1
#> fvn_measurement_percent weight_g volume_ml volume_water_ml
#> 1 0 100 NA NA
#> 2 0 NA 100 NA
#> 3 0 NA 100 NA
#> 4 3 25 NA 100
#> 5 6 NA 100 NA
#> 6 0 NA 100 NA
#> 7 3 NA 50 NA
#> 8 3 25 NA NA
#> 9 6 NA 20 100
#> 10 6 NA 100 NA
#> default_params_used
#> 1 food_type
#> 2 <NA>
#> 3 <NA>
#> 4 drink_type
#> 5 drink_type
#> 6 food_type
#> 7 drink_type
#> 8 drink_type
#> 9 drink_type
#> 10 drink_type
We can use the report generated by ReqParamCheck()
and
AutoDefaultParamCheck()
to discover what parameters are
missing in more detail. When we printed the transposed report table
above (using t(report_table)
), we saw that some parameters
read NA
- these are optional parameters that were not
checked. Some parameters read integer, 0
meaning that these
parameters were checked, and no data is missing from product entries
where that parameter is required. Where data is missing,
integer, NUM
is repeated with a non-zero value in place of
NUM
, for example sugar_measurement_g
reads
integer, 3
. You can interogate which product entries are
missing data for sugar_measurement_g
using the double
bracket notation:
report_table$sugar_measurement_g[[1]]
#> [1] 1 2 4
Fill missing data
We can enter data for sugar_measurement_g
in products
with the indices 1 2 4
using the
ManualParamUpdate()
function. This function takes the
following arguments: data_frame
, the product data.frame
object being analysed, in this case incomplete_data
;
parameter_name
- in this case
"sugar_measurement_g"
; index_list
- these can
be supplied manually or from the report as shown above;
value
- the value to set the parameter to for these rows.
We are going to use an unrealistically high value of 100.0
in order to clearly see where this has been applied.
updated_data <- ManualParamUpdate(
incomplete_data,
"sugar_measurement_g",
report_table$sugar_measurement_g[[1]],
100.0)
Where values are required to be numeric, we can coerce the values to a numeric type:
updated_data$sugar_measurement_g <- as.numeric(updated_data$sugar_measurement_g)
We can see the updated sugar_measurement_g
values in our
dataframe, before and after the coercion:
# Update sugar values:
print(updated_data$sugar_measurement_g)
#> [1] "100" "100" "11" "100" "19" "21" "15" "15" "19" "19"
updated_data$sugar_measurement_g <- as.numeric(updated_data$sugar_measurement_g)
# Record of where parameters have been manually edited/applied:
print(updated_data$manual_params_used)
#> [1] "sugar_measurement_g" "sugar_measurement_g" NA
#> [4] "sugar_measurement_g" NA NA
#> [7] NA NA NA
#> [10] NA
This updated dataframe can then be used in the analysis workflow examples given in this documentation:
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
updated_data_results <- updated_data %>%
rowwise() %>%
mutate( sg = SGConverter(pick(everything()))) %>%
mutate(test = NPMScore(pick(everything()), sg_adjusted_label="sg")) %>%
unnest(test) %>%
rowwise() %>%
mutate(assess = NPMAssess(pick(everything()))) %>%
unnest(assess) %>%
select(everything(), energy_score, sugar_score, salt_score, fvn_score,
protein_score, satfat_score, fibre_score, NPM_score, NPM_assessment)