Why does adding a redundant predictor to randomForest improve prediction?












2












$begingroup$


I write in hopes of understanding an odd behavior of the randomForest package. I am trying to predict a factor y with 9 levels using 8 binary factors X1-X8. I get good accuracy (0.8959), and the following confusion matrix:



                      y
A B C D E F G H I
A 75 0 0 0 0 0 0 0 0
B 0 121 0 5 0 0 0 0 0
C 0 0 156 1 0 0 0 1 0
D 0 0 0 0 0 0 0 0 0
E 1 6 0 73 172 3 0 1 1
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Notice that RF makes no predictions for row D of the confusion matrix. Now I perform the following experiment: I make a copy of the first column of the predictor matrix, call it "junk", and append it to the predictor matrix. Now randomForest gives improved accuracy (0.9657) and the
following confusion matrix:



                  y
A B C D E F G H I
A 73 0 0 0 0 0 0 0 0
B 2 119 0 5 0 0 0 0 0
C 0 2 156 1 0 0 0 1 0
D 1 6 0 73 4 3 0 1 1
E 0 0 0 0 168 0 0 0 0
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Note that randomForest now makes good predictions for row D of the confusion matrix.



In summary, appending a redundant copy of one of the predictor variables to the predictor matrix improves accuracy of randomForest. Further, it doesn't make much difference which predictor you append. They all give roughly the same accuracy and roughly the same confusion matrix.



I append code and data below. Can someone explain what is happening?



Code:



### Save the file and change the location
setwd("C:\tmp")

rm(list=ls())
library(randomForest)

# input compressed data and restore the
# number observed for each row
compressed <- read.csv("compressed.csv")
num <- compressed$NUM
newnum <- rep(1:length(num),num)
dat <- compressed[newnum,2:10]

y <- dat$y
x <- dat[,2:9]

# original data produces bad results
# for row D of confusion matrix
set.seed(323)
badrf=randomForest(y=y,x=x)
badpred=predict(badrf,newdata=x)
badtable <- table(badpred, y)
badtable
badaccuracy=sum(diag(badtable))/sum(badtable)
badaccuracy

# duplicate, say, x-matrix column 1
ndx <- 1
junk <- x[,ndx]
newx <- cbind(x,junk)

# re-analysis with superfluous new variable
# gives good results
set.seed(323)
goodrf=randomForest(y=y,x=newx)
goodpred=predict(goodrf,newdata=newx)
goodtable <- table(goodpred, y)
goodtable
goodaccuracy=sum(diag(goodtable))/sum(goodtable)
goodaccuracy


Data:



"NUM","y","X1","X2","X3","X4","X5","X6","X7","X8"
1,"A","NO","NO","NO","NO","NO","NO","NO","NO"
69,"A","NO","NO","YES","NO","NO","NO","NO","NO"
2,"A","NO","NO","YES","NO","NO","NO","NO","YES"
4,"A","NO","YES","YES","NO","NO","NO","NO","NO"
6,"B","NO","NO","NO","NO","NO","NO","NO","NO"
119,"B","NO","NO","NO","NO","NO","NO","NO","YES"
2,"B","YES","NO","NO","NO","NO","NO","NO","YES"
155,"C","YES","NO","NO","NO","NO","NO","NO","NO"
1,"C","YES","YES","NO","NO","NO","NO","NO","NO"
73,"D","NO","NO","NO","NO","NO","NO","NO","NO"
5,"D","NO","NO","NO","NO","NO","NO","NO","YES"
1,"D","NO","NO","NO","NO","NO","NO","YES","NO"
1,"D","NO","NO","NO","NO","YES","NO","NO","NO"
3,"D","NO","YES","NO","NO","NO","NO","NO","NO"
1,"D","YES","NO","NO","NO","NO","NO","NO","NO"
4,"E","NO","NO","NO","NO","NO","NO","NO","NO"
158,"E","NO","NO","NO","NO","NO","YES","NO","NO"
10,"E","YES","NO","NO","NO","NO","YES","NO","NO"
3,"F","NO","NO","NO","NO","NO","NO","NO","NO"
90,"F","NO","NO","NO","YES","NO","NO","NO","NO"
31,"G","NO","NO","NO","NO","NO","NO","YES","NO"
1,"H","NO","NO","NO","NO","NO","NO","NO","NO"
83,"H","NO","NO","NO","NO","YES","NO","NO","NO"
1,"H","NO","YES","NO","NO","YES","NO","NO","NO"
1,"H","YES","NO","NO","NO","YES","NO","NO","NO"
1,"I","NO","NO","NO","NO","NO","NO","NO","NO"
102,"I","NO","YES","NO","NO","NO","NO","NO","NO"
3,"I","NO","YES","NO","NO","NO","NO","NO","YES"
1,"I","NO","YES","NO","NO","NO","NO","YES","NO"









share|cite|improve this question









$endgroup$



migrated from stackoverflow.com Nov 24 '18 at 14:10


This question came from our site for professional and enthusiast programmers.


















  • $begingroup$
    Tune the rf model and it will output similar predictions for the two cases.
    $endgroup$
    – missuse
    Nov 15 '18 at 13:03
















2












$begingroup$


I write in hopes of understanding an odd behavior of the randomForest package. I am trying to predict a factor y with 9 levels using 8 binary factors X1-X8. I get good accuracy (0.8959), and the following confusion matrix:



                      y
A B C D E F G H I
A 75 0 0 0 0 0 0 0 0
B 0 121 0 5 0 0 0 0 0
C 0 0 156 1 0 0 0 1 0
D 0 0 0 0 0 0 0 0 0
E 1 6 0 73 172 3 0 1 1
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Notice that RF makes no predictions for row D of the confusion matrix. Now I perform the following experiment: I make a copy of the first column of the predictor matrix, call it "junk", and append it to the predictor matrix. Now randomForest gives improved accuracy (0.9657) and the
following confusion matrix:



                  y
A B C D E F G H I
A 73 0 0 0 0 0 0 0 0
B 2 119 0 5 0 0 0 0 0
C 0 2 156 1 0 0 0 1 0
D 1 6 0 73 4 3 0 1 1
E 0 0 0 0 168 0 0 0 0
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Note that randomForest now makes good predictions for row D of the confusion matrix.



In summary, appending a redundant copy of one of the predictor variables to the predictor matrix improves accuracy of randomForest. Further, it doesn't make much difference which predictor you append. They all give roughly the same accuracy and roughly the same confusion matrix.



I append code and data below. Can someone explain what is happening?



Code:



### Save the file and change the location
setwd("C:\tmp")

rm(list=ls())
library(randomForest)

# input compressed data and restore the
# number observed for each row
compressed <- read.csv("compressed.csv")
num <- compressed$NUM
newnum <- rep(1:length(num),num)
dat <- compressed[newnum,2:10]

y <- dat$y
x <- dat[,2:9]

# original data produces bad results
# for row D of confusion matrix
set.seed(323)
badrf=randomForest(y=y,x=x)
badpred=predict(badrf,newdata=x)
badtable <- table(badpred, y)
badtable
badaccuracy=sum(diag(badtable))/sum(badtable)
badaccuracy

# duplicate, say, x-matrix column 1
ndx <- 1
junk <- x[,ndx]
newx <- cbind(x,junk)

# re-analysis with superfluous new variable
# gives good results
set.seed(323)
goodrf=randomForest(y=y,x=newx)
goodpred=predict(goodrf,newdata=newx)
goodtable <- table(goodpred, y)
goodtable
goodaccuracy=sum(diag(goodtable))/sum(goodtable)
goodaccuracy


Data:



"NUM","y","X1","X2","X3","X4","X5","X6","X7","X8"
1,"A","NO","NO","NO","NO","NO","NO","NO","NO"
69,"A","NO","NO","YES","NO","NO","NO","NO","NO"
2,"A","NO","NO","YES","NO","NO","NO","NO","YES"
4,"A","NO","YES","YES","NO","NO","NO","NO","NO"
6,"B","NO","NO","NO","NO","NO","NO","NO","NO"
119,"B","NO","NO","NO","NO","NO","NO","NO","YES"
2,"B","YES","NO","NO","NO","NO","NO","NO","YES"
155,"C","YES","NO","NO","NO","NO","NO","NO","NO"
1,"C","YES","YES","NO","NO","NO","NO","NO","NO"
73,"D","NO","NO","NO","NO","NO","NO","NO","NO"
5,"D","NO","NO","NO","NO","NO","NO","NO","YES"
1,"D","NO","NO","NO","NO","NO","NO","YES","NO"
1,"D","NO","NO","NO","NO","YES","NO","NO","NO"
3,"D","NO","YES","NO","NO","NO","NO","NO","NO"
1,"D","YES","NO","NO","NO","NO","NO","NO","NO"
4,"E","NO","NO","NO","NO","NO","NO","NO","NO"
158,"E","NO","NO","NO","NO","NO","YES","NO","NO"
10,"E","YES","NO","NO","NO","NO","YES","NO","NO"
3,"F","NO","NO","NO","NO","NO","NO","NO","NO"
90,"F","NO","NO","NO","YES","NO","NO","NO","NO"
31,"G","NO","NO","NO","NO","NO","NO","YES","NO"
1,"H","NO","NO","NO","NO","NO","NO","NO","NO"
83,"H","NO","NO","NO","NO","YES","NO","NO","NO"
1,"H","NO","YES","NO","NO","YES","NO","NO","NO"
1,"H","YES","NO","NO","NO","YES","NO","NO","NO"
1,"I","NO","NO","NO","NO","NO","NO","NO","NO"
102,"I","NO","YES","NO","NO","NO","NO","NO","NO"
3,"I","NO","YES","NO","NO","NO","NO","NO","YES"
1,"I","NO","YES","NO","NO","NO","NO","YES","NO"









share|cite|improve this question









$endgroup$



migrated from stackoverflow.com Nov 24 '18 at 14:10


This question came from our site for professional and enthusiast programmers.


















  • $begingroup$
    Tune the rf model and it will output similar predictions for the two cases.
    $endgroup$
    – missuse
    Nov 15 '18 at 13:03














2












2








2


1



$begingroup$


I write in hopes of understanding an odd behavior of the randomForest package. I am trying to predict a factor y with 9 levels using 8 binary factors X1-X8. I get good accuracy (0.8959), and the following confusion matrix:



                      y
A B C D E F G H I
A 75 0 0 0 0 0 0 0 0
B 0 121 0 5 0 0 0 0 0
C 0 0 156 1 0 0 0 1 0
D 0 0 0 0 0 0 0 0 0
E 1 6 0 73 172 3 0 1 1
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Notice that RF makes no predictions for row D of the confusion matrix. Now I perform the following experiment: I make a copy of the first column of the predictor matrix, call it "junk", and append it to the predictor matrix. Now randomForest gives improved accuracy (0.9657) and the
following confusion matrix:



                  y
A B C D E F G H I
A 73 0 0 0 0 0 0 0 0
B 2 119 0 5 0 0 0 0 0
C 0 2 156 1 0 0 0 1 0
D 1 6 0 73 4 3 0 1 1
E 0 0 0 0 168 0 0 0 0
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Note that randomForest now makes good predictions for row D of the confusion matrix.



In summary, appending a redundant copy of one of the predictor variables to the predictor matrix improves accuracy of randomForest. Further, it doesn't make much difference which predictor you append. They all give roughly the same accuracy and roughly the same confusion matrix.



I append code and data below. Can someone explain what is happening?



Code:



### Save the file and change the location
setwd("C:\tmp")

rm(list=ls())
library(randomForest)

# input compressed data and restore the
# number observed for each row
compressed <- read.csv("compressed.csv")
num <- compressed$NUM
newnum <- rep(1:length(num),num)
dat <- compressed[newnum,2:10]

y <- dat$y
x <- dat[,2:9]

# original data produces bad results
# for row D of confusion matrix
set.seed(323)
badrf=randomForest(y=y,x=x)
badpred=predict(badrf,newdata=x)
badtable <- table(badpred, y)
badtable
badaccuracy=sum(diag(badtable))/sum(badtable)
badaccuracy

# duplicate, say, x-matrix column 1
ndx <- 1
junk <- x[,ndx]
newx <- cbind(x,junk)

# re-analysis with superfluous new variable
# gives good results
set.seed(323)
goodrf=randomForest(y=y,x=newx)
goodpred=predict(goodrf,newdata=newx)
goodtable <- table(goodpred, y)
goodtable
goodaccuracy=sum(diag(goodtable))/sum(goodtable)
goodaccuracy


Data:



"NUM","y","X1","X2","X3","X4","X5","X6","X7","X8"
1,"A","NO","NO","NO","NO","NO","NO","NO","NO"
69,"A","NO","NO","YES","NO","NO","NO","NO","NO"
2,"A","NO","NO","YES","NO","NO","NO","NO","YES"
4,"A","NO","YES","YES","NO","NO","NO","NO","NO"
6,"B","NO","NO","NO","NO","NO","NO","NO","NO"
119,"B","NO","NO","NO","NO","NO","NO","NO","YES"
2,"B","YES","NO","NO","NO","NO","NO","NO","YES"
155,"C","YES","NO","NO","NO","NO","NO","NO","NO"
1,"C","YES","YES","NO","NO","NO","NO","NO","NO"
73,"D","NO","NO","NO","NO","NO","NO","NO","NO"
5,"D","NO","NO","NO","NO","NO","NO","NO","YES"
1,"D","NO","NO","NO","NO","NO","NO","YES","NO"
1,"D","NO","NO","NO","NO","YES","NO","NO","NO"
3,"D","NO","YES","NO","NO","NO","NO","NO","NO"
1,"D","YES","NO","NO","NO","NO","NO","NO","NO"
4,"E","NO","NO","NO","NO","NO","NO","NO","NO"
158,"E","NO","NO","NO","NO","NO","YES","NO","NO"
10,"E","YES","NO","NO","NO","NO","YES","NO","NO"
3,"F","NO","NO","NO","NO","NO","NO","NO","NO"
90,"F","NO","NO","NO","YES","NO","NO","NO","NO"
31,"G","NO","NO","NO","NO","NO","NO","YES","NO"
1,"H","NO","NO","NO","NO","NO","NO","NO","NO"
83,"H","NO","NO","NO","NO","YES","NO","NO","NO"
1,"H","NO","YES","NO","NO","YES","NO","NO","NO"
1,"H","YES","NO","NO","NO","YES","NO","NO","NO"
1,"I","NO","NO","NO","NO","NO","NO","NO","NO"
102,"I","NO","YES","NO","NO","NO","NO","NO","NO"
3,"I","NO","YES","NO","NO","NO","NO","NO","YES"
1,"I","NO","YES","NO","NO","NO","NO","YES","NO"









share|cite|improve this question









$endgroup$




I write in hopes of understanding an odd behavior of the randomForest package. I am trying to predict a factor y with 9 levels using 8 binary factors X1-X8. I get good accuracy (0.8959), and the following confusion matrix:



                      y
A B C D E F G H I
A 75 0 0 0 0 0 0 0 0
B 0 121 0 5 0 0 0 0 0
C 0 0 156 1 0 0 0 1 0
D 0 0 0 0 0 0 0 0 0
E 1 6 0 73 172 3 0 1 1
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Notice that RF makes no predictions for row D of the confusion matrix. Now I perform the following experiment: I make a copy of the first column of the predictor matrix, call it "junk", and append it to the predictor matrix. Now randomForest gives improved accuracy (0.9657) and the
following confusion matrix:



                  y
A B C D E F G H I
A 73 0 0 0 0 0 0 0 0
B 2 119 0 5 0 0 0 0 0
C 0 2 156 1 0 0 0 1 0
D 1 6 0 73 4 3 0 1 1
E 0 0 0 0 168 0 0 0 0
F 0 0 0 0 0 90 0 0 0
G 0 0 0 1 0 0 31 0 0
H 0 0 0 1 0 0 0 84 0
I 0 0 0 3 0 0 0 0 106


Note that randomForest now makes good predictions for row D of the confusion matrix.



In summary, appending a redundant copy of one of the predictor variables to the predictor matrix improves accuracy of randomForest. Further, it doesn't make much difference which predictor you append. They all give roughly the same accuracy and roughly the same confusion matrix.



I append code and data below. Can someone explain what is happening?



Code:



### Save the file and change the location
setwd("C:\tmp")

rm(list=ls())
library(randomForest)

# input compressed data and restore the
# number observed for each row
compressed <- read.csv("compressed.csv")
num <- compressed$NUM
newnum <- rep(1:length(num),num)
dat <- compressed[newnum,2:10]

y <- dat$y
x <- dat[,2:9]

# original data produces bad results
# for row D of confusion matrix
set.seed(323)
badrf=randomForest(y=y,x=x)
badpred=predict(badrf,newdata=x)
badtable <- table(badpred, y)
badtable
badaccuracy=sum(diag(badtable))/sum(badtable)
badaccuracy

# duplicate, say, x-matrix column 1
ndx <- 1
junk <- x[,ndx]
newx <- cbind(x,junk)

# re-analysis with superfluous new variable
# gives good results
set.seed(323)
goodrf=randomForest(y=y,x=newx)
goodpred=predict(goodrf,newdata=newx)
goodtable <- table(goodpred, y)
goodtable
goodaccuracy=sum(diag(goodtable))/sum(goodtable)
goodaccuracy


Data:



"NUM","y","X1","X2","X3","X4","X5","X6","X7","X8"
1,"A","NO","NO","NO","NO","NO","NO","NO","NO"
69,"A","NO","NO","YES","NO","NO","NO","NO","NO"
2,"A","NO","NO","YES","NO","NO","NO","NO","YES"
4,"A","NO","YES","YES","NO","NO","NO","NO","NO"
6,"B","NO","NO","NO","NO","NO","NO","NO","NO"
119,"B","NO","NO","NO","NO","NO","NO","NO","YES"
2,"B","YES","NO","NO","NO","NO","NO","NO","YES"
155,"C","YES","NO","NO","NO","NO","NO","NO","NO"
1,"C","YES","YES","NO","NO","NO","NO","NO","NO"
73,"D","NO","NO","NO","NO","NO","NO","NO","NO"
5,"D","NO","NO","NO","NO","NO","NO","NO","YES"
1,"D","NO","NO","NO","NO","NO","NO","YES","NO"
1,"D","NO","NO","NO","NO","YES","NO","NO","NO"
3,"D","NO","YES","NO","NO","NO","NO","NO","NO"
1,"D","YES","NO","NO","NO","NO","NO","NO","NO"
4,"E","NO","NO","NO","NO","NO","NO","NO","NO"
158,"E","NO","NO","NO","NO","NO","YES","NO","NO"
10,"E","YES","NO","NO","NO","NO","YES","NO","NO"
3,"F","NO","NO","NO","NO","NO","NO","NO","NO"
90,"F","NO","NO","NO","YES","NO","NO","NO","NO"
31,"G","NO","NO","NO","NO","NO","NO","YES","NO"
1,"H","NO","NO","NO","NO","NO","NO","NO","NO"
83,"H","NO","NO","NO","NO","YES","NO","NO","NO"
1,"H","NO","YES","NO","NO","YES","NO","NO","NO"
1,"H","YES","NO","NO","NO","YES","NO","NO","NO"
1,"I","NO","NO","NO","NO","NO","NO","NO","NO"
102,"I","NO","YES","NO","NO","NO","NO","NO","NO"
3,"I","NO","YES","NO","NO","NO","NO","NO","YES"
1,"I","NO","YES","NO","NO","NO","NO","YES","NO"






r machine-learning random-forest prediction






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Nov 15 '18 at 10:58







Neal Oden











migrated from stackoverflow.com Nov 24 '18 at 14:10


This question came from our site for professional and enthusiast programmers.









migrated from stackoverflow.com Nov 24 '18 at 14:10


This question came from our site for professional and enthusiast programmers.














  • $begingroup$
    Tune the rf model and it will output similar predictions for the two cases.
    $endgroup$
    – missuse
    Nov 15 '18 at 13:03


















  • $begingroup$
    Tune the rf model and it will output similar predictions for the two cases.
    $endgroup$
    – missuse
    Nov 15 '18 at 13:03
















$begingroup$
Tune the rf model and it will output similar predictions for the two cases.
$endgroup$
– missuse
Nov 15 '18 at 13:03




$begingroup$
Tune the rf model and it will output similar predictions for the two cases.
$endgroup$
– missuse
Nov 15 '18 at 13:03










1 Answer
1






active

oldest

votes


















1












$begingroup$

For each node in a tree, the random forest algorithm does not pick the best dimension to split on but the best dimension among a (small) sample of them in order to add variance in the prediction.



Here, the column you add must be important and by adding it you make it appear more often in the sample of dimensions among which the random forest chooses. You make it more important in your prediction. And by doing so, you do not decrease too much the variability of your trees.



This behavior could explain your findings



PS: Indeed, ideally, it should be moved to stats.stackexchange






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f378548%2fwhy-does-adding-a-redundant-predictor-to-randomforest-improve-prediction%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown
























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    For each node in a tree, the random forest algorithm does not pick the best dimension to split on but the best dimension among a (small) sample of them in order to add variance in the prediction.



    Here, the column you add must be important and by adding it you make it appear more often in the sample of dimensions among which the random forest chooses. You make it more important in your prediction. And by doing so, you do not decrease too much the variability of your trees.



    This behavior could explain your findings



    PS: Indeed, ideally, it should be moved to stats.stackexchange






    share|cite|improve this answer









    $endgroup$


















      1












      $begingroup$

      For each node in a tree, the random forest algorithm does not pick the best dimension to split on but the best dimension among a (small) sample of them in order to add variance in the prediction.



      Here, the column you add must be important and by adding it you make it appear more often in the sample of dimensions among which the random forest chooses. You make it more important in your prediction. And by doing so, you do not decrease too much the variability of your trees.



      This behavior could explain your findings



      PS: Indeed, ideally, it should be moved to stats.stackexchange






      share|cite|improve this answer









      $endgroup$
















        1












        1








        1





        $begingroup$

        For each node in a tree, the random forest algorithm does not pick the best dimension to split on but the best dimension among a (small) sample of them in order to add variance in the prediction.



        Here, the column you add must be important and by adding it you make it appear more often in the sample of dimensions among which the random forest chooses. You make it more important in your prediction. And by doing so, you do not decrease too much the variability of your trees.



        This behavior could explain your findings



        PS: Indeed, ideally, it should be moved to stats.stackexchange






        share|cite|improve this answer









        $endgroup$



        For each node in a tree, the random forest algorithm does not pick the best dimension to split on but the best dimension among a (small) sample of them in order to add variance in the prediction.



        Here, the column you add must be important and by adding it you make it appear more often in the sample of dimensions among which the random forest chooses. You make it more important in your prediction. And by doing so, you do not decrease too much the variability of your trees.



        This behavior could explain your findings



        PS: Indeed, ideally, it should be moved to stats.stackexchange







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Nov 15 '18 at 12:17









        PopPop

        953619




        953619






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f378548%2fwhy-does-adding-a-redundant-predictor-to-randomforest-improve-prediction%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Error while running script in elastic search , gateway timeout

            Adding quotations to stringified JSON object values