Difference between revisions of "Random Forest (ViGrA) Classification in SAGA"
Line 82: | Line 82: | ||
[[File:VisualizeyourdataSAGA.png|200px|thumb|left]] |
[[File:VisualizeyourdataSAGA.png|200px|thumb|left]] |
||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
===Creating Training Data (Points) for each class=== |
===Creating Training Data (Points) for each class=== |
Revision as of 16:39, 21 December 2014
Tutorial on preforming Random Forest Classification using R
Contents
Purpose
This tutorial will demonstrate how to perform a Random Forest classification using the ViGrA tool found in SAGA. Random Forest (RF) is an algorithm that uses an ensemble of decision trees. Using multiple decision trees, the highest probability tree can be used to perform a classification or regression. This tutorial will cover the basics of creating training data, and running a land cover Random Forest classification in SAGA.
Introduction
Random Forest (RF) classification is an ensemble learning method, which uses decision tree classifiers. Where a random sub-sample of the data is taken and a classification is made from that sub-sample. This process is done to a user specified amount of runs and the average is taken to improve accuracy and prevent over-fitting. There is no better algorithm for classification, it runs efficiently on large data bases and even with large amount of variables (100+). It provides unbiased estimates and the tree provides a visual representation of which classes are important and which ones aren't. Also allows for easy computation of similarities and differences between variables and also statistical uncertainty for classification. Furthermore a basic Random Forest imagine classification is available in the open access software SAGA using the ViGrA.
Futher information on Random Forest can be found at wikipedia.org. Random Forest
Software
The Random Forest classification can be run in a program as a script such as R or Python. However these programs can have a steep learning curve, and be complex with importing and exporting files. Luckily SAGA version 2.1.2 contains a Random Forest Classification tool that uses ViGrA. Note: Older version 2.0.8 of SAGA does not contain the Random Forest Classification (ViGrA) tool.
Data
There are two types of data that is needed to perform a supervised random forest classification, which are variables and a training data set.
-The variables used for a classification are raster images, these variables include spectrum wavelength image bands such as (Red, Green, Blue and Near Infrared), temperature, and Digital Elevation models (DEMs) and determined derivatives such as (slope and aspect). Other variables can be used such as vector and table data, but these options are not explored in this tutorial.
-Training data is a data set that has to be a polygon layer for the tool. Training data is usually represented as a point or polygon, and these points or polygons refer to a specific class. A class is a unique feature or trait that is represented in a classification, they are created to illustrate locations and patterns. The training data has to be created manually and is often time consuming, as the large the training data sets have to be visually determined for each point or polygon. The larger the training data set, the better spatial representation of the area, and the greater the variability in random sample selection, which both increase the classification accuracy. If your training data is in a point format then it can be easily buffered to create a small polygon that can be used.
Tutorial for Random Forest Classification (ViGrA)
Uploading information into SAGA
To upload files into SAGA, use your mouse and go to File->Open, from here a load screen will appear, click on the bottom right corner of the screen and select the all files tab, select your files and click open
To visualize your data, select the data you would like to display and click on Add to Map
Creating Training Data (Points) for each class
To create training data, the over all data set has to contain a minimum of a 100 random points. We will create 3 classes of point files for the classification of three classes (water, forest, and urban), the water class will be demonstrated below. Note: that the forest and urban classes are created in the same fashion.
To create a point file, we start by creating a new layer file. Select geoprocessing-> Shapes-> add new shape file
Name your class for points, and make sure the shape type is selected to be points.
A point layer file has been created, now we must add the points, to add a point right click on the created layer and add to map to ensure that the layer has been activated.
You will see a small screen, asking which map to upload the shape file to, as I will be creating a class for water I will select the image band that best represent this class, in this case NIR band represent water well in a black.
Too add a point, activate the add shape feature that is in the Edit-->Add Shape
Now that the shape file add has been activated, we can create a point file by selecting the action tool in the bar
Then select the area in which you wish to create the point and click, a small square with a circle inside will form, that represents the location that you chose, to finalize the creation either press enter or double left click on the mouse and click on the only option to edit selected shapes.
Repeat the add shape process for the water layer file to keep adding points to the layer. Add enough points that will allow you to properly sample the area. I would recommend a minimum of 30 points per class.
You should have a training data set now of points that resembles the following image.
Once your training data has been created for a class repeat the same process yo create a new layer file for each class you will be using for your classification.
Buffering and Merging layers
The points need to be converted into polygons, which is achieved by buffering the points. Points have to be individually buffered before they are merged. To buffer a point file select Shapes-> Tasks-> Shapes buffer
In the menu that pops up, select the buffer distance to 'not set' and then select a appropriate buffer size, I chose 10 for my classification, the higher detailed that your training data is in you can select a smaller size, or if the features are very similar you can choose a larger buffer size.
Once you have created the three buffer files, you can merge these polygons together to create a single data set of polygons for your classification training data. To merge layers select the Shapes->Construction->Merge Layers
Then you have to select the layers you wish to merge (making sure they are the buffer layers).
You will have your overall polygon training data set to be used in your classification. NOTE: you should have a smaller cross validation data set created as well following the same steps however only using 25-30% of your total points.
Running a random forest classification
We will be using the Random Forest Classification (ViGrA) tool to perform a supervised classification. To open the tool select the following options Imagery->Classification->Random Forest (ViGrA)
Then upload all the image bands you wish to load by selecting the Grid system you wish to use, and then the Features you wish to upload.
Select the number of trees you wish to create and other settings. Note: that running 1000 trees took approximately 3 mins to run on a higher end PC(intel i7-4790K 4.00 quad core processor). Refrain from running larger trees then this unless you are allowing it to run overnight. The following are the settings that I have set up as, there is little documentation on some of these settings.
Data Objects
Grid
Grid system -> Selects which grid system to load for you to upload your variables >>Features -> Select which features you wish to upload into your variables <<Random Forest classification -> select create to create a new grid, or overwrite a existing one <Prediction Probability -> Select the create to create a new grid for each class specifically
Shapes
>> Training Areas -> Select the merged buffer training data that you will be using Label Field -> Sets which attribute header you will be using for the classification (recommend not set) Use Label as Identifier -> allows you to decide if you will be using the header of the file as the class (leave uncheck if unsure) The Minimum Redundancy Feature Selection -> Unsure of its purpose for the classification. More information can be found here on it. mRMR
Options
Feature Probabilities -> Unsure of its purpose Import from file -> if a previous Random forest classification has been performed the tree and settings can be loaded here
-Options
Export to File -> this will export your decision tree and settings as a file to be loaded on a later classification Tree count -> this is how many trees you wish to create (note try to remain below 1000 as more can cause slow classification) Samples per Tree - > This will specify the fraction of total number of samples to be used per tree for learning. Sample with replacement -> Checking this box improves randomness as it will sample the training data with replacement. Minimum Node Split Size -> This will allow the tool to know how many samples are require to make a split, use 1 to use all samples and complete growth Features to Node -> Unsure remained default at square root Stratification -> Unsure remained default at none
After the classification has been produced, the final image is produced, and may be visually displeasing as colours are produced at random.
Validation
There are many types of validation for classification. One method for validation of random forest is to use a small sample of your training data (30%), and perform a random forest classification on this data following the same steps described in this tutorial and compare it to your actual random forest classification. There are many ways to perform a validation, and they will not be covered here as it is not included in the tool and would be broken into mulitple tutorials.
How to edit class colours
How to change the colours in a map for better visual representation.
Now that you have successfully performed a Random Forest Classification, the output maps colour is displeasing.
To change the properties of a map we must ensure that the properties tab is activated. to activate the properties tab click on the window tab and select the show properties
Then choose the lookup table options
From here change the colours into your desired format
Then you will have your final product
Conclusion
Overall the tool works well and can successfully produce a classification, although it may be limited in the speed of its execution, the user does not need to have knowledge of scripting to run a classification. With its lack of documentation and explanation, only users that have performed a RFC before may understand how to use the tool, before this tutorial was created of course. However the largest downfall of the tool is exclusion of the decision tree that is used to make the classification. This tree provides a visual representation of which classes are important and which ones aren't and is one of the greatest strengths of performing a RFC. Therefore the tool may run a successful RFC, but it looses the strength of RF.
References
SAGA software - http://www.saga-gis.org/en/index.html
ViGrA website - http://ukoethe.github.io/vigra/
Variables for IKONOS satellite images in the Blue,Green, Red and Near Infrared bands in Southern Gatineau Park. Provided by Murray Richardson