NAME
MachineLearning::DecisionTrees - create, test, print out, store, and employ decision trees
SYNOPSIS
use MachineLearning::DecisionTrees;
my $grove = MachineLearning::DecisionTrees->new({
"InputFieldNames" => ["Input1", "Input2", "Input3"],
"OutputFieldNames" => ["Output1", "Output2"],
"InputDataTypes" => {
"Input1" => ["male", "female"],
"Input2" => "number",
"Input3" => "special_number"},
"TrainingData" => $training_data_file_path});
my $all_is_well = $grove->get_success();
my $message = $grove->get_message();
my $saved_ok = $grove->save($file_path);
my $grove = MachineLearning::DecisionTrees->open($file_path);
my $tree_printout = $grove->print_out($tree_name);
my $tree_test_results = $grove->test({
"TreeName" => $tree_name,
"ValidationData" => $validation_data_file_path,
"NodeList" => \@selected_node_numbers});
my $completed_ok = $grove->employ({
"TreeName" => $tree_name,
"TargetData" => $writable_data_file_path,
"NodeList" => \@selected_node_numbers});
DESCRIPTION
This module defines a package for creating a MachineLearning::DecisionTrees object.
Decision trees as implemented by the MachineLearning::DecisionTrees module favor quality over quantity. That is, they are optimized to find the highest quality predictions for a single output value without undue regard to how many data records (or instances) get screened out in the process. This approach is highly useful for many applications. For example, you might want to narrow down a list of applicants for a grant while trying to predict which recipients will make exceptional use of the grant; or, out of many potential financial investments, you might want to identify a small group that have unusually good prospects for success.
The output values supported by the MachineLearning::DecisionTrees module are 1 and 0, and the accuracy of the results is important only for the 1 values. When choosing the attribute on which to split a node, the criterion is maximization of the highest resulting ones percentage. (In the case of a tie, the highest quantity of records breaks the tie.) NOTE: The selection of the first three attributes (resulting in the children, grandchildren, and great-grandchildren of the root node) is performed organically. That is, the best out of all possible permutations of attributes for the first three slots "wins". Subsequent attributes are chosen individually.
Decision trees as implemented by the MachineLearning::DecisionTrees module come in groves. A grove is one or more decision trees, all built from the same data file. The data file used to build (or train) a grove contains one or more output fields, and each output field corresponds to one tree. When printed out, a decision tree identifies itself using the name of the output field for which it was built.
A decision tree is made up of nodes, starting with a root node (which typically appears at the top in graphical representations, highlighting the fact that trees in nature are upside down :-) and branching from there via child nodes. During the building, testing, or employment of a decision tree, each parent node divides data records associated with it among its children according to each record's value for a particular input field (or attribute). A node with no children is a leaf node and terminates a particular path through the tree.
A path through a decision tree always starts at the root node and ends when one of the following conditions occurs during tree creation, which comprises training and pruning phases:
During training, there are no further attributes available with which to generate children from the current node.
There is only one record associated with the current node. NOTE: The training algorithm never creates a child node that does not have any records associated with it or that does not produce any "1" results.
The pruning process has determined that application of the remaining attributes to the current branch does not improve its performance.
Pruning
After a tree has been built, it gets pruned from the bottom up. The pruning algorithm removes any leaf node if the records associated with one of the node's ancestors (other than the root node) have an equal or higher percentage of ones for the output field on which the tree is based than do the records associated with the leaf node. Any nodes that become leaf nodes during the pruning process are themselves subject to pruning.
Once a tree has been pruned, the remaining leaf nodes are sorted by quality (defined as the magnitude of the ones percentage for the records associated with the leaf node). In a printout of the tree, the leaf nodes appear in order by quality (highest to lowest) together with the exact ones percentage for each leaf node based on the training data. For each leaf node, the printout also gives the number of associated training records and the path through the tree.
Input fields and output fields
The data used to build, test, or employ decision trees must include at least two input fields, which are used as attributes by nodes during spawning.
The data must also include one or more output fields in which each value is either a 1 or a 0. (The output values can be blank when the data is passed to the employ() method.) For example, there might be a field for whether you enjoy certain TV shows that have been on in the past and another field for whether your spouse enjoys them. You could create a grove of two decision trees from this data: one tree to select shows that you would most likely enjoy out of the new television offerings starting up in the fall, and another tree to select shows that your spouse would most likely enjoy. (With luck, there will be some that you both enjoy!)
Input data types
Each input field uses one of three data types: enumeration, number, or special_number.
An enumeration type comprises a list of two or more possible unique text values (for example, male and female for a Gender field).
A number type comprises decimal numbers. The tree-building algorithm divides the values in a number field into the following categories after statistically analyzing all the values for that field in the data set. (In the Criteria column below, m is the mean and d is the population standard deviation.)
Category Criteria
-------- --------
abnormally_low Less than m - 2d
low Greater than or equal to m - 2d, and less than
m - d
medium Greater than or equal to m - d, and less than
or equal to m + d (that is, within one standard
deviation of the mean)
high Greater than m + d, and less than or equal to
m + 2d
abnormally_high Greater than m + 2d
A special_number type is a decimal number for which the sign and the absolute value have separate significance. For example, the sign might indicate direction (up or down) while the absolute value indicates momentum, or the sign might indicate approval versus disapproval while the absolute value indicates the intensity of the response. The categories for this type are abnormally_strong_negative, strong_negative, medium_negative, weak_negative, abnormally_weak_negative, abnormally_weak_positive_or_zero, weak_positive, medium_positive, strong_positive, and abnormally_strong_positive.
The categories into which the algorithm places numeric inputs act as enumerated values.
PREREQUISITES
To use this module, you must have the MachineLearning module installed.
METHODS
- MachineLearning::DecisionTrees->new($args);
-
This is the constructor.
In addition to the class-name argument, which is passed automatically when you use the
MachineLearning::DecisionTrees->new()syntax, the constructor takes a reference to a hash containing the following keys:InputFieldNames OutputFieldNames InputDataTypes TrainingDataThe values associated with the InputFieldNames and OutputFieldNames keys must each be a reference to an array of field names. All field names (for input and output fields combined) must be unique. Field names must not contain commas or line-break characters. There must be at least two input field names and at least one output field name.
The value associated with the InputDataTypes key must be a reference to a hash in which the keys are input field names (which must match those specified by the InputFieldNames argument) and each value indicates the data type for the corresponding input field. If the value is an array reference, the field has a data type of
enumerationand the array must contain two or more strings, all unique, representing the possible values for the input field. Otherwise, the value indicating the data type must be the stringnumberor the stringspecial_number.The value associated with the TrainingData key must be the path to a file containing CSV-format training data. The first line in the file must contain field names, which must match the field names specified by the InputFieldNames and OutputFieldNames arguments.
The constructor returns a reference to a MachineLearning::DecisionTrees object (a grove), which is implemented internally as a hash. All functionality is exposed through methods.
If the constructor receives a valid hash reference providing all required information, the
get_success()instance method returns true (1) and theget_message()instance method returns an empty string; otherwise,get_success()returns false (0) andget_message()returns a string containing an error message. - $grove->get_success();
-
This returns true (
1) if the grove was initialized successfully; otherwise, it returns false (0). - $grove->get_message();
-
When
get_success()returns true (1), this returns an empty string. Whenget_success()returns false (0),get_message()returns an error message. - $grove->save($file_path);
-
This method saves (serializes) the grove to the specified file.
If the serialization and file-writing operations succeed, this method returns true (
1); otherwise, the method sets the_successfield to false (0), places an error message in the_messagefield, and returns false (0). - MachineLearning::DecisionTrees->open($file_path);
-
This method returns a new MachinLearning::DecisionTrees object (a grove) created by restoring such an object from the specified file. If the file-reading and deserialization operations succeed, the resulting object's
get_success()method returns true (1) and theget_message()method returns an empty string. Otherwise, theopen()method creates and returns a new object that has a false value in the_successfield and an error message in the_messagefield. - $grove->print_out($tree_name);
-
This method prints out a particular tree within the grove. The name that you provide to the method is the name of the output field corresponding to the tree that you want to study.
The printout lists each leaf node in order by quality (highest to lowest). For each leaf node, the printout gives a number with which you can specify the node when you call the
test()method or theemploy()method, the ones percentage for the node after training, the number of records associated with the node after training, and the path through the tree leading to the node.This method returns a string.
- $grove->test($args);
-
This method takes a reference to a hash with the following fields:
TreeName - The name of the tree, which is the same as the name of the output field for which the tree was built. ValidationData - The path to a file containing CSV-format validation data. This must have the same fields and data types as the training data used to create the grove. NodeList - A reference to an array containing the numbers of the nodes that you want to use in the test to indicate "1" results.This method returns a string giving the test results, which comprise the number of records in the test data, the number of records assigned a "1" result during the test, and the accuracy (as a percentage) of those "1" results based on the validation data.
- $grove->employ($args);
-
This method takes a reference to a hash with the following fields:
TreeName - The name of the tree, which is the same as the name of the output field for which the tree was built. TargetData - The path to a file containing CSV-format data. This must have the same fields and data types as the training data used to create the grove. NodeList - A reference to an array containing the numbers of the nodes that you want to use to generate "1" results.This method divides the records from the specified data file among the leaf nodes of the specified tree, assigning a "1" result to any record associated with one of the nodes given by the
NodeListargument. For those records, the method places a1in the output field associated with the tree. For the remaining records, the method places a0in that field.The data file that you provide to this method must contain an output field with the same name as the tree; however, the values in that field can be blank. If an output value is not blank, it must be a
1or a0to conform to the data type defined for output fields. This method populates or overwrites the values in the output field with new results.If everything goes OK, this method returns true (
1); otherwise, the method sets the_successfield to false (0), places an error message in the_messagefield, and returns false (0).
AUTHOR
Benjamin Fitch, <blernflerkl@yahoo.com>
COPYRIGHT AND LICENSE
Copyright 2009 by Benjamin Fitch
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.