

prep_sett:    function

  - returns a list of lists, representing ata settings and random forest
    model settings that are to be used in other functions
  - if called without modifying any arguments, the returned list just
    contains default settings

  - the returned list has 2 components named: 'data' and 'rf'

     - 'data' is a sub-list containing settings such as paths to input 
       and output datasets, data imputation options, options for 
       setting-aside of reference data-points for independent validation, 
       etc.


     - 'rf' is a sub-list containing settings that are directly related
       to fitting of, and prediction using, the random forest model.  Most
       of these settings are directly from the package responsible the 
       for the model (i.e. randomForest).

  - pass the list returned by this function to the print_sett() function
    to get a print-out of the current state of all settings. e.g.:

         sett = prep_sett() # default settings as sett
         print_sett(sett)   # print out current state of settings

  - after modifying settings, pass the settings list to the
    save_sett(sett) function to save the current state of these settings

      - data settings will only be saved if a valid specification has been
        made for sett$data$save_to_json
      - random forest settings will only be saved if a valid specification
        has been made for sett$rf$save_to_json
      - e.g.:

         sett$data$inRefCSV = 'myRefData.csv' # change data setting from default
         sett$data$ntree = 600  # change random forest setting from default

         sett$data$save_to_json = 'myDataSettings.json'
         sett$rf$save_to_json = 'myRandomForestSettings.json'

         save_sett(sett) # save the current state of the settings to file

  - previously saved json settings files can be loaded using the 
    'DATA_load_from_json' and 'RF_load_from_json' arguments of the 
    prep_sett() function. e.g.:

         sett = prep_sett(DATA_load_from_json='myDataSettings.json',
                          RF_load_from_json='myRandomForestSettings.json')
         print_sett(sett)   # print out current state of settings


save_sett:    function

  - can be used to save an updated copy of the data and random forest
    settings (sett$data & sett$rf) to json file
  - arguments:

       sett :  a copy of the settings list, with valid specifications
               for sett$data$save_to_json and/or
               sett$rf$save_to_json

  - e.g.:

         sett = prep_sett() # default settings as sett

         sett$data$inRefCSV = 'myRefData.csv' 
         sett$rf$ntree = 600  

         sett$data$save_to_json = 'myDataSettings.json'
         sett$rf$save_to_json   = 'myRandomForestSettings.json'

         save_sett(sett) # save the current state of the settings

print_sett:    function

  - can be used to print the current state of the data and/or random
    forest settings to the console
  - arguments:

       sett :  a copy of the settings list

  - e.g.:

         sett = prep_sett() # default settings as sett

         sett$data$inRefCSV'] = 'myRefData.csv' 
         sett$rf$inRefCSV'] = 600  

         print_sett(sett) # print data and rf settings

         print_sett(sett$data) # print just data settings

         print_sett(sett$rf) # print just rf settings

print_sett:    function

  - can be used to reset the reference data or imputer reference data 
    (refdat or impdat) data-frame to one of two previous states:
      - immediately before data imputation and splitting
      - immediately after data imputation and splitting

  - arguments:

       dat :  a copy of the data frame that is to be reset

       imputedAndSplit :  (default FALSE) logical representing what state
                           the data frame should be returned to

              if TRUE: immediately after data imputation (i.e. as returned
                       by the b_imputeDataAndSplitIndValidSet function)

              if FALSE: immediately before data imputation (i.e. as returned
                        by the a_initialDataPrep function)

  - returns: a copy of the reset data frame

  - e.g.:

         a = a_initialDataPrep(sett) 
         b = b_imputeDataAndSplitIndValidSet(a$refdat,sett)
         c = c_prepRandomForest(refdat,impdat,sett)

         refdat=c$refdat; impdat=c$impdat; rf=c$rf         # at this point both refdat and impdat have prediction results
         # attached to them, from the c_prepRandomForest function

         # reset refdat and impdat to their state as returned
         # by b_imputeDataAndSplitIndValidSet
         refdat = resetRefDataframe(refdat,imputedAndSplit=TRUE)
         impdat = resetRefDataframe(impdat,imputedAndSplit=TRUE)

a_initialDataPrep:    function

  - reads input files, prepares a data frame representing reference data 
    (refdat), and extracts raster data at reference point locations

  - arguments:

       sett :  a copy of the settings list

  - returns: a list with two components:

       'refdat' the data frame representing the reference data, with 
                extracted raster data attached.

       'inrast' a copy of the input raster dataset, opened as an 
                raster::RasterStack object


b_imputeDataAndSplitIndValidSet:    function

  - takes a copy of the reference data data-frame, as returned by the 
    a_initialDataPrep function, imputes the data, and splits/sets-aside
    the data points into an independent validation set and a set that is
    to be used for building training the Random Forest model

  - see the following data settings: 
          'indValidSplit','splitByClass',
          'impute_strategy','impute_by_class',

  - arguments:

       refdat  :  a copy of the reference data data-frame, as returned by
                  the a_initialDataPrep function
       sett    :  a copy of the settings dictionary

       verbose :  (default: TRUE) boolean representing whether or not to
                   print details of the imputation and splitting to the 
                   console

  - returns: a list with two componentns:

       'refdat' a data frame representing the reference data (refdat), 
                with a new field/column ('indValidSet') representing 
                whether a data point has been kept for the model (value of 
                0), set aside for independent validation (value of 1), or,
                in some cases, omitted due to the presence of NoData in
                combination with the lack of an imputation strategy (value
                of -1)

       'impdat' a data frame representing the imputed (or subset)  
                reference data (impdat), after having implemented the 
                imputation strategy. It is similar to the refdat data 
                frame that is returned by this function, but NoData 
                values from the raster dataset have been imputed at 
                reference data locations. If the imputataion strategy was 
                NA, then records/rows containing  NoData values have been
                omitted from this data farme instead.


c_prepRandomForest:    function

  - prepares and fits/trains a Random Forest model based on the 
    imputed/omitted reference data (impdat) and the user-specified random 
    forest settings
  - also updates both the reference data and imputed reference data 
    data-frames with predictions and probabilities of class membership at
    reference data locations

  - arguments:

       refdat :  a copy of the reference data data-frame, as returned by
                 the b_imputeDataAndSplitIndValidSet function
       impdat :  a copy of the imputed and independent validation-split 
                 reference data data-frame, as returned by the
                 b_imputeDataAndSplitIndValidSet function
       sett   :  a copy of the settings list

  - returns: a list with three components:

       'refdat' a copy of the reference data data-frame (refdat) with 
                Random Forest predictions/probabilities added.

       'impdat' a copy of the imputed/omitted reference data data-frame 
                (impdat) with Random Forest predictions/probabilities 
                added.

       'rf'  the trained/fit Random Forest classifier object

d_independentValidation:    function

  - generates accuracy statistics from the Random Forest classification 
    based on the reference data points that were set aside for independent
    validation
  - accuracy statistics include:  error matrix, user's & producer's
                                   accuracies, overall accuracy, kappa

  - arguments:

       refdat :  a copy of the reference data data-frame, as returned by
                 the c_prepRandomForest function
       sett   :  a copy of the settings list

  - returns: a list with the following components:

       'errorMat'   a data frame representing the error matrix 

       'classAcc'   a data frame representing class-specific user's 
                      and producer's accuracies 

       'overallAcc' a float value representing the overall accuracy 

       'kappa'      a float value representing Cohen's kappa coefficient

e_consolidateArossIterations:    function

  - Consolidates independent validation results and Random Forest model
    characteristics across iterations into individual data frames and 
    (optionally) otuput CSV files

  - arguments:

       dat   :  a copy of the reference data (refdat) or imputed reference 
                data (impdat),

       sett  :  a copy of the settings list

       valid :  (optional) a list of independent validation results as 
                returned by the d_independentValidation function
                - this is required for consolidating independent 
                  validation accuracy statistics and/or error matrices

       rf    :  (optional) a list of random forest models as returned by 
                the c_prepRandomForest function
                - this is required for consolidating feature importances, 
                  out-of-bag error rates, and out-of-bag error matrices

  - returns: a list with the following components:

       'errorMat'  a data frame of error matrices across iterations 
                  (requires input for valid). This object is NA if 
                  valid argument is NA / not specified.

       'accStats'  a data frame of independent validation accuracy 
                  statistics across iterations (requires input for 
                  valid argument). This object is NA if valid is NA
                  / not specified.

       'featureImportances' a data frame of feature importances across
                     iterations (requires input for rf argument). This 
                     object is NA if valid if rf argument is NA / not 
                     specified.

       'oobErrorRate'  a data frame of out-of-bag Random Forest error 
                       rates across iterations (requires input for rf 
                       argument). This object is NA if rf argument is NA. 

       'oobErrorMat'  a data frame of out-of-bag Random Forest error 
                      matrices across iterations (requires input for rf 
                      argument). This object is NA if rf argument is NA. 

working_dir:    data setting and random forest setting

  - represents the current working directory
  - to change this setting you should use setwd(), as in: 
         setwd('c:\\working\\myWorkingDir')

load_from_json:    data setting and random forest setting

  - default:  NA  (do not load settings from json file)
  - a string representing the filename (if in working directory) or 
    the full path to an input json file storing the settings
  - if a valid filename or path is specfied, settings (data or rf 
    settings) will be loaded from that file
  - filename should end in '.json'

save_to_json:    data setting and random forest setting

  - default:  NA  (do not save settings to json file)
  - a string representing the filename (in working directory) or the 
    full path to an output json file for n    storing the settings
  - if a valid filename or path is specfied, settings (data or rf  
    settings) will be saved to that file
  - filename should end in '.json'

randSeedValue:    data setting and random forest setting

  - default:  NA    (no setting of random seed)
  - a numeric value that will be used to set the random seed
  - the randSeedValue setting in the data settings list (sett$data) is
    used to set the random seed when the a_initialDataPrep function 
    is called (i.e. seed for data prep such independent validation 
    splitting)
  - the randSeedValue setting in the random forest settings list 
    (sett$rf) is used to set the random seed when the c_prepRandomForest
    function is called (i.e. seed for the random forest algorithm)
     - *CAUTION* if performing random forest fitting iterations, this 
       setting (sett$rf$randSeedValue) should be changed for each 
       iteration or the output results from each iteration will be the 
       same (assuming no change in the data).

inRefCSV:    **REQUIRED** data setting

  - default:  ''    (empty string; user specification is required)
  - string representing the filename (if in working directory) or 
    the full path to an input CSV file representing reference data 
    points
  - at the very least, this file should contain:
     - a field/column containing integer values representing classes
     - two fields/columns containing values representing x and y 
       coordinates
         - these coordinates must be in the same spatial reference 
           system as the input raster dataset

FN_pointID:    data setting

  - default:  ''    (empty string; script will assign unique  
    identifier to each reference data point)
  - string representing the name of the field/column within the  
    input refernece data CSV file that contains a 
    unique identifier for each reference data point
  - not required, but if specified, this information may be included  
    in some output files

FN_classnum:    **REQUIRED** data setting

  - default:  ''    (empty string; user specification is required)
  - string representing the name of the field/column within the  
    input reference data CSV file that contains integer values
    representing reference classes

FN_classlab:    data setting

  - default:  ''    (empty string; no class labels will be 
                     included)
  - string representing the name of the field/column within the 
    input refernece data CSV file that contains class labels (e.g. 
    text representing the land cover that a given class number
    represents)
  - not required, but if specified, this information may be included 
    in some output files

FN_xy:    **REQUIRED**  data setting

  - default:  c('','')   (vector of two empty strings; user 
    specification is required)
  - vecotr containing two strings representing names of the 
    fields/columns of the field within the input refernece data CSV 
    file that contain the x and y coordinates (respecitively) of the 
    reference data point locations 
     - these coordinates must be in the same spatial reference  
       system as the input raster dataset

inRastPath:    **REQUIRED**  data setting

  - default:  ''    (empty string; user specification is required)
  - string representing the filename (if in working directory) or  
    the full path to an input raster dataset

indValidSplit:    data setting

  - default:  0.3    (30% of the valid reference data points will be set 
                      aside for independent validation)
  - a numeric value ( >=0.0 and <1.0 ) representing the proportion of the 
    reference data to split off / set-aside for independent validation.
  - if 0, no independent validation will be performed
  - the splitting-off of the independent validation set will take 
    place after data imputation
  - splitting of independent validation set will also depend on the 
    specified 'splitByClass' data setting

splitByClass:    data setting

  - default:  TRUE    (splitting off of the independent validation 
                       will occur at the class level)
  - boolean representing whether (TRUE) or not (FALSE) the 
    independent validation split proportion should be applied at 
    the class level
  - if FALSE, the split proportion (see 'indValidSplit' data 
    setting) will be applied to the reference dataset in general, 
    meaning the class proportions in the idenpendent validation set
    may differ slightly from those in the model set

minPointsForModel:    data setting

  - default:  10    (if less than 10 reference points points are left 
                     for the model, an error will be raised)
  - integer value (>=0) representing the minimum number of reference
    data points that should be going into the model / Random Forest
    algorithm following the independent validation split 
  - ignored if indValidSplit is zero (all data are being used for 
    the model) 
  - if less than this number of data-points are remaining for the 
    model, an error will be raised 
  - if 'splitByClass' data setting is TRUE, then this number will 
    be checked against the number of reference data points of each 
    class that are going into the model 
  - if 'splitByClass' data setting is FALSE, then this number will 
    be checked against the overall number of reference data points,
    in general, that are going into the model 

saveDataCSV:    data setting

  - default:  NA    (no reference data CSV file will be saved)
  - string representing the filename (if in working directory) or 
    the full path to an output CSV file storing the following
    information:  

     - indicator of whether a data point was set aside for 
       independent validation (indValidSet). 
        - Value of 1 for set aside. 0 for not. If  'impute_strategy'  
          is set to None (omission), the omitted rows are still  
          included in this file, but will have a value of -1 for 
          indValidSet. 

     - reference data: ref. class labels (classLab),ref. class 
       values (classNum), x/y coordinates (x/y) 

     - extracted (non-imputed / no omissions) training samples with
       NA for NoData 
        - Predictions may not have been made based on these data. 
          See 'impute_strategy', below. 
        - Field/Column names begin with 'ch#_' where # is any 
          number of digits representing the image channel from which 
          the data were extracted 

     - predictions (predict) and probabilities of class memebership 
       at each reference data point 
        - Class membership probability fields are named 'prob_c#',
          where # represents a class number/value. 

  - if specified, file will be generated when the c_prepRandomForest()
    function is called
  - specified path must not already exist. No over-writing.

impute_strategy:    data setting

  - default:  'mean'    (NoData values will be substituted with the 
                         mean)
  - either NA, a string representing the data imputation strategy, 
    or numeric value for filling missing values

  - If NA, no imputation will be performed. Instead, any data 
    points that contain NoData / NA values in any channel, will be
    omitted.  This omission will be done prior to the independent 
    validation split.
  - If 'mean', missing values will be replaced with the mean of the 
    column.
  - If 'median', missing values will be replaced with the median of 
    the column.
  - If a numeric value is specified, missing values will be replaced
    with this value.

impute_by_class:    data setting

  - default:  TRUE    (imputation will be performed at the class 
                       level) 
  - logical representing whether (TRUE) or not (FALSE) the 
    imputation should be performed at the class level 
  - this setting is ignored if the impute_strategy is NA or a 
    constant value (numeric)
  - e.g.  if impute_strategy is 'mean' and impute_by_class is TRUE
    then, for a given channel, missing values for class 1 will be 
    replaced by the mean of the reference points that are of
    reference class 1.  The same for class 2, and so on.
  - e.g.  if impute_strategy is 'mean' and impute_by_class is FALSE
    then, for a given channel, missing values for class 1 will be 
    replaced by the mean of the channel (at ref. point locations) 
    regardless of reference class.

saveImputedDataCSV:    data setting

  - default:  NA    (no imputed reference data CSV file will be saved)
  - string representing the filename (if in working directory) or 
    the full path to an output CSV file storing a copy of the reference
    data with imputation (or omission in the case where 'impute_strategy'
    is NA) applied to missing values
  - field/column names are the same as for 'saveDataCSV', above, but 
    the extracted raster data ('ch#_...') have either been imputed
    or points have been omitted, based on the 'impute_strategy' and 
   'impute_by_class' options.
  - if specified, file will be generated when the c_prepRandomForest()
    function is called
  - specified path must not already exist. No over-writing.

saveRFclassifier:    data setting

  - default:  NA   (Random Forest classifier will not be saved to file)
  - string representing the filename (if in working directory) or 
    the full path to an output RData file (should end with .RData
    extension) storing a copy of the fitted Random Forest model
  - saving the model may be necessary if you later want to apply 
    predictions over the full image set
  - if specified, file will be generated when the c_prepRandomForest()
    function is called
  - specified path must not already exist. No over-writing.

saveImportancesCSV:    data setting

  - default:  NA    (no CSV file representing feature importances
                     from the current model will be saved)
  - string representing the filename (if in working directory) or 
    the full path to an output CSV file, storing a copy of the
    feature importances from the current Random Forest model
  - if specified, file will be generated when the c_prepRandomForest()
    function is called
  - specified path must not already exist. No over-writing.
  - if not specified, file representing feature importances across 
    several models can still be generated when the 
    e_consolidateArossIterations() function is called

saveErrorMatrixCSV:    data setting

  - default:  NA    (no CSV file representing independent 
                     validation error matrices will be saved)
  - string representing the filename (if in working directory) or 
    the full path to an output CSV file, storing a copy of the
    error matrices from the independent validation
  - if specified, file will be generated when the
    d_independentValidation() function is called
  - specified path must not already exist. No over-writing.
  - if not specified, file representing independent validation error
    matrices across several models can still be generated when the
    e_consolidateArossIterations() function is called

saveAccuracyCSV:    data setting

  - default:  NA    (no CSV file representing independent 
                     validation accuracy statistics will be saved)
  - string representing the filename (if in working directory) or 
    the full path to an output CSV file, storing a copy of the
    accuracy statistics from the independent validation
  - if specified, file will be generated when the
    d_independentValidation() function is called
  - specified path must not already exist. No over-writing.
  - if not specified, file representing independent validation 
    accuracy statistics across several models can still be generated
    when the e_consolidateArossIterations() function is called

consol_outCSV_dir:    data setting

  - default:  None    (no CSV files representing info consolidated 
                       across Random Forest models will be saved)
  - string representing path (relative to working directory or full
    path) to a directory that CSV files output from the 
    e_consolidateArossIterations() function will be stored within
  - if specified, several CSV files representing statistics that 
    have been consolidated across model fitting & validation  
    iterations may be saved, including:

     - ...accuracyStats.csv: independent validation accuracy
              statistics.  Requires specification for the 'valid'
              argument when calling e_consolidateArossIterations()

     - ...importances.csv: feature importances.  Requires 
              specification for the 'rf' argument when calling 
              e_consolidateArossIterations()

     - ...indErrorMatrices.csv: independent validation error
              matrices.  Requires specification for the 'valid'
              argument when calling e_consolidateArossIterations()

     - ...OOBerrorStats.csv: Error rates from the model's out-of-bag
              data.  Includes overall error rate and class specific 
              error rates.  Requires specification for the 'rf'
              argument when calling e_consolidateArossIterations()

     - ...OOBerrorMatrices.csv: Error matrices from the model's 
              out-of-bag data. Requires specification for the 'rf'
              argument when calling e_consolidateArossIterations()

     '...' represents the 'consol_outCSV_basename' data setting

  - files will be generated when the e_consolidateArossIterations()
    function is called
  - over-writing of output CSV files not supported so a different 
    'consol_outCSV_dir' and/or 'consol_outCSV_basename' should be
    specified each time the e_consolidateArossIterations() function
    is called (unless previously-generated files have been moved or
    deleted)

consol_outCSV_basename:    data setting

  - default:  'consol_'    
  - string representing basename for CSV files output when calling
    the e_consolidateArossIterations() function
  - if specified as an empty string (''), CSV files can still be 
    generated but they won't have a basename
  - if specified, recommend ending with an underscore or other 
    character to separate the basename from the rest of the filename
  - over-writing of output CSV files not supported so a different 
    'consol_outCSV_dir' and/or 'consol_outCSV_basename' should be
    specified each time the e_consolidateArossIterations() function
    is called (unless previously-generated files have been moved or
    deleted)

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 

ntree:    random forest ('rf') setting

  *** see the official documentation for randomForest: ?randomForest 
