NOTES ON SETTING UP AND USING RANDOM FORESTS

The public domain version of random forests has been stripped down from my experimental version, but contains the main features. I apologize in advance for all bugs and would like to hear about them. To find out how this program works, read my paper "Random Forests-Random Features" Its available as a technical report if you go back to my department home page (www.stat.berkeley.edu) and click on technical reports. It will be published soon in Machine Learning.

The program is written in extended Fortran 77 making use of a number of VAX extensions. It runs on SUN workstations f77 and on Absoft Fortran 77 (available for Windows) but may have hang ups on other f77 compilers.

Random forests does

  • classification
  • variable importance (in two ways)
  • computes proximity measures between cases
  • computes densities
  • gives a measure of outlyingness for each case

    The last two are done for the unsupervised case i.e. no class labels. The density estimation is experimental and has not been thoroughly tested, so treat it with caution. I have used proximities to cluster data and they seem to do a reasonable job.

    I. SETTING PARAMETERS

    The first five lines following the parameter statement need to be filled in by the user.

    LINE 1 DESCRIBING THE DATA

    mdim = number of variables
    nsample = number of cases (examples or instances)
    nclass = number of classes
    maxcat = the largest number of values assumed by a categorical variable in the data
    ntest = the number of cases in the test set. Put ntest=0 if there is no test set.
    ltest = 0 if there is no test set, 1 if the test set has no class labels, 2 if the test set has class labels.

    If the data is unsupervised, then put nsample=twice the number of cases in the data. If iden=1 or if noutlier=1the code will automatically add an equal number of cases. Put nclass=2. The original data is labeled class #1, and the added data is labeled class #2. The added data is gotten from sampling independently from the marginals of the original data.

    If there are no categorical variables in the data set maxcat=1. If there are categorical variables, the number of categories assumed by each categorical variable has to be specified in an integer vector called cat, i.e. setting cat(5)=7 implies that the 5th variable is a categorical with 7 values. If maxcat=1, the values of cat are automatically set equal to one. If not, the user must fill in the values of cat in the early lines of code.

    For a J-class problem, random forests expects the classes to be numbered 1,2, ...,J. For an L valued categorical, it expects the values to be numbered 1,2, ... ,L.

    A test set can have two puposes--first: to check the accuracy of RF on a test set. The error rate given by the internal estimate will be very close to the test set error unless the test set is drawn from a different distribution. Second: to get predicted classes for a set of data with unknown class labels. In both cases the test set must have the same format as the training set. If there is no class label for the test set, assign each case in the test set labeled classs #1, i.e. put cl(n)=1, and set labelts=0. Else set labelts=1.

    LINE 2 SETTING UP THE RUN

    mtry = number of variables randomly selected at each node
    jbt = number of trees to grow
    look = how often you want to check the prediction error
    modvar = 0 (run with all variables)
    modvar = 1 (run with selected variables)

    mtry

    this is the only parameter that requires some judgment to set, but forests isn't to sensitive to its value as long as it's in the right ball park. I have found that setting mtry equal to the square root of mdim gives generally near optimum results. My advice is to begin with this value and try a value twice as high and half as low monitoring the results by setting look=1 and checking the test set error for a small number of trees. With many noise variables present, mtry has to be set higher.

    jbt:

    this is the number of trees to be grown in the run. Don't be stingy--random forests produces trees very rapidly. If you want auxiliary information like variable importance or proximities grow a lot of trees--say a 1000 or more. Sometimes, I run out to 5000 trees if there are many variables and I want the variables importances to be stable.

    look:

    random forests carries along an internal estimate of the test set error as the trees are being grown. This estimate is outputted to the screen every look trees. Setting look=10, for example, gives the current test set error every tenth tree added. Setting look=jbt+1 eliminates the output. The final test set error will always appear on the screen. If lbtest=1, the test set error will also be on screen. Do not be dismayed to see the error rates fluttering around slightly as more trees are added. Their behavior is analagous to the sequence of averages of the number of heads in tossing a fair coin.

    modvar;

    If modvar=1, then the user must specify which variables to use in the run. This is done near the beginning of the program where the user specifies the values of the binary vector incl. If incl(m)=0 the mth variable is not included in the run. If incl(m)=1 the variable is included. modvar=1 sets all values of incl=0 so the user needs only to set incl=1 for those variables to be included.

    LINE 3 OPTIONS

    imp = 1 turns on the variable importances method described in my tech report.
    igini = 1 prints out variable importances defined as the sum of the decreases in the gini criterion on all trees due to a given variable. Its results are generally (but not always--see remarks) consistent with the first definition. It is more sensitive then imp when there are many variables The gini computation is always on whether or not imp is on or not.
    iprox = 1 turns on the computation of the intrinsic proximity measures between any two cases .
    jiden = 1 computes the density of labelless data with respect to the product of their marginals.
    noutlier = 1 computes an outlingness measure for labelless data. If this is on, then iprox must also be switched to one.

    NOTE:to get output from these options, output file names must be specified in the output section (the lower part of the main program).

    LINE 4

    ipi: pi is an real-valued vector of length nclass which sets prior probabilities for classes. ipi=1 sets these priors equal to the class proportions. If the class proportions are very unbalanced, you may want to put larger priors on the smaller classes. This can be done in subroutine prep.
    icost: misscost is a real valued matrix that is nclass by nclass. misscost(i,j) is the cost for classifying a class i case as class j. misscost is then symettrized in the program. icost =1 sets misscost(i,j)=1 if i is not equal to j and zero if it is. Other values for this matrix can be entered in subroutine prep.

    LINE 5 OUTPUT CONTROLS

    Note: user must supply file names for all output listed below or send it to the screen

    infow = 1 prints the following columns to a file

    impw = 1 prints the following columns to a file

    iginiw = 1 prints the following columns to a file

    iproxw = 1 prints to file

    iden = 1 prints the following columns to a file

    noutlierw = 1 prints the follwing columns to a file

    ntestw = 1 prints the follwing coumns to a file

    LINE 6 Pay no attention.

    OTHER USER WORK:

    The user has to construct the read in the data code of which I have left an example. This needs to be done after the dimensioning of arrays. If maxcat >0 then the categorical values need to be filled in. If modvar=1, then the user has to specify which variables to include.

    REMARKS:

    There is some default output to the screen. It gives the final internal errsor rate, and, if applicable, the test set error rate. It also gives the confusion matrix--based on the internal test sets.

    The proximities can be used in the clustering program of your choice. Their advantage is that they are intrinsic rather than an ad hoc measure. I have used them in some standard and home-brew clustering programs and gotten reasonable results. The proximities between class 1 cases in the unsupervised situation can be used to cluster.

    I have not played with the density estimate p(1| x)/p(2| x) very much. So be carefull in interpreting these. If RF has a high error rate in discriminating between class#1 and the synthetic class #2 the density estimation is not reliable. Similarly the measure of outlyingness has not been tested. If users try the density estimation and outlyingness measure, I would appreciate comments.

    Two measures of variable importance: the two measures of variable importances are defined differently. The imp measure is easier to undertand (see my paper) and is based only on the test sets left out on each tree construction. The gini is intrinsic to the tree construction using the training set. On microarray data with 5000 variables and less than 100 cases, the gini is a more sensitive measure.

    When I have run them both on a lower dimensional problem, they occasionally single out slightly different subsets of variables as being important. But when I use modvar to enter each set, the error rate is about equal and about equal to running with all variables.