Random Forest Classifier accuracy doesn't get higher than 50%












-2














I am very new to machine learning and I am trying to classify this UCI Heart Disease Dataset using sklearn's random forest classifier. My approach is very basic, and I wanted to ask how I could improve my accuracy with the algorithm (some tips, links, etc.). My accuracy tops out at about 50% every time. Here's my code:



import pandas as pd
import numpy as np
import random as random
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_excel('/Users/Mady/Documents/ClevelandData.xlsx')
df.replace('?', -99999, inplace=True)

labels = df.iloc[:,-1]
labels = labels.values

df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
riskFactors = df.values

random.seed(123)
random.shuffle(labels)
random.seed(123)
random.shuffle(riskFactors)

labels_train = labels[:(int(len(labels) * 0.8))]
labels_test = labels[(int(len(labels) * 0.8)):]

riskFactors_train = riskFactors[:(int(len(riskFactors) * 0.8))]
riskFactors_test = riskFactors[(int(len(riskFactors) * 0.8)):]

model = RandomForestClassifier(n_estimators = 1000)
model.fit(riskFactors_train,labels_train)
predicted_labels = model.predict(riskFactors_test)
acc = accuracy_score(labels_test,predicted_labels)
print(acc)









share|improve this question






















  • explore your data first. look for patterns that you think your model should be able to estimate. what makes you think your dataset is estimable beyond a 50% accuracy rate?
    – John H
    Nov 12 '18 at 19:12










  • Hi, welcome to StackOverflow. This question may be too broad for this forum. I suggest posting to Code Review or Data Science. datascience.stackexchange.com
    – Evan
    Nov 12 '18 at 19:14










  • There is some principal trouble with you data or labels. Could you please provide some sample from the date and label?
    – Geeocode
    Nov 12 '18 at 19:24






  • 2




    I think you messed up when you are shuffling the labels and riskFactors, for consistency, you should try to use the train_test_split provided by sklearn.
    – Yilun Zhang
    Nov 12 '18 at 19:38










  • Thank you so much! I am definitely a newbie to this as I got to 80% simply by using the train_test_split and removing the random part(there must have been some error there).
    – Kasy Chakra
    Nov 12 '18 at 20:14
















-2














I am very new to machine learning and I am trying to classify this UCI Heart Disease Dataset using sklearn's random forest classifier. My approach is very basic, and I wanted to ask how I could improve my accuracy with the algorithm (some tips, links, etc.). My accuracy tops out at about 50% every time. Here's my code:



import pandas as pd
import numpy as np
import random as random
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_excel('/Users/Mady/Documents/ClevelandData.xlsx')
df.replace('?', -99999, inplace=True)

labels = df.iloc[:,-1]
labels = labels.values

df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
riskFactors = df.values

random.seed(123)
random.shuffle(labels)
random.seed(123)
random.shuffle(riskFactors)

labels_train = labels[:(int(len(labels) * 0.8))]
labels_test = labels[(int(len(labels) * 0.8)):]

riskFactors_train = riskFactors[:(int(len(riskFactors) * 0.8))]
riskFactors_test = riskFactors[(int(len(riskFactors) * 0.8)):]

model = RandomForestClassifier(n_estimators = 1000)
model.fit(riskFactors_train,labels_train)
predicted_labels = model.predict(riskFactors_test)
acc = accuracy_score(labels_test,predicted_labels)
print(acc)









share|improve this question






















  • explore your data first. look for patterns that you think your model should be able to estimate. what makes you think your dataset is estimable beyond a 50% accuracy rate?
    – John H
    Nov 12 '18 at 19:12










  • Hi, welcome to StackOverflow. This question may be too broad for this forum. I suggest posting to Code Review or Data Science. datascience.stackexchange.com
    – Evan
    Nov 12 '18 at 19:14










  • There is some principal trouble with you data or labels. Could you please provide some sample from the date and label?
    – Geeocode
    Nov 12 '18 at 19:24






  • 2




    I think you messed up when you are shuffling the labels and riskFactors, for consistency, you should try to use the train_test_split provided by sklearn.
    – Yilun Zhang
    Nov 12 '18 at 19:38










  • Thank you so much! I am definitely a newbie to this as I got to 80% simply by using the train_test_split and removing the random part(there must have been some error there).
    – Kasy Chakra
    Nov 12 '18 at 20:14














-2












-2








-2







I am very new to machine learning and I am trying to classify this UCI Heart Disease Dataset using sklearn's random forest classifier. My approach is very basic, and I wanted to ask how I could improve my accuracy with the algorithm (some tips, links, etc.). My accuracy tops out at about 50% every time. Here's my code:



import pandas as pd
import numpy as np
import random as random
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_excel('/Users/Mady/Documents/ClevelandData.xlsx')
df.replace('?', -99999, inplace=True)

labels = df.iloc[:,-1]
labels = labels.values

df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
riskFactors = df.values

random.seed(123)
random.shuffle(labels)
random.seed(123)
random.shuffle(riskFactors)

labels_train = labels[:(int(len(labels) * 0.8))]
labels_test = labels[(int(len(labels) * 0.8)):]

riskFactors_train = riskFactors[:(int(len(riskFactors) * 0.8))]
riskFactors_test = riskFactors[(int(len(riskFactors) * 0.8)):]

model = RandomForestClassifier(n_estimators = 1000)
model.fit(riskFactors_train,labels_train)
predicted_labels = model.predict(riskFactors_test)
acc = accuracy_score(labels_test,predicted_labels)
print(acc)









share|improve this question













I am very new to machine learning and I am trying to classify this UCI Heart Disease Dataset using sklearn's random forest classifier. My approach is very basic, and I wanted to ask how I could improve my accuracy with the algorithm (some tips, links, etc.). My accuracy tops out at about 50% every time. Here's my code:



import pandas as pd
import numpy as np
import random as random
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_excel('/Users/Mady/Documents/ClevelandData.xlsx')
df.replace('?', -99999, inplace=True)

labels = df.iloc[:,-1]
labels = labels.values

df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
riskFactors = df.values

random.seed(123)
random.shuffle(labels)
random.seed(123)
random.shuffle(riskFactors)

labels_train = labels[:(int(len(labels) * 0.8))]
labels_test = labels[(int(len(labels) * 0.8)):]

riskFactors_train = riskFactors[:(int(len(riskFactors) * 0.8))]
riskFactors_test = riskFactors[(int(len(riskFactors) * 0.8)):]

model = RandomForestClassifier(n_estimators = 1000)
model.fit(riskFactors_train,labels_train)
predicted_labels = model.predict(riskFactors_test)
acc = accuracy_score(labels_test,predicted_labels)
print(acc)






python machine-learning scikit-learn random-forest






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 '18 at 19:08









Kasy ChakraKasy Chakra

85




85












  • explore your data first. look for patterns that you think your model should be able to estimate. what makes you think your dataset is estimable beyond a 50% accuracy rate?
    – John H
    Nov 12 '18 at 19:12










  • Hi, welcome to StackOverflow. This question may be too broad for this forum. I suggest posting to Code Review or Data Science. datascience.stackexchange.com
    – Evan
    Nov 12 '18 at 19:14










  • There is some principal trouble with you data or labels. Could you please provide some sample from the date and label?
    – Geeocode
    Nov 12 '18 at 19:24






  • 2




    I think you messed up when you are shuffling the labels and riskFactors, for consistency, you should try to use the train_test_split provided by sklearn.
    – Yilun Zhang
    Nov 12 '18 at 19:38










  • Thank you so much! I am definitely a newbie to this as I got to 80% simply by using the train_test_split and removing the random part(there must have been some error there).
    – Kasy Chakra
    Nov 12 '18 at 20:14


















  • explore your data first. look for patterns that you think your model should be able to estimate. what makes you think your dataset is estimable beyond a 50% accuracy rate?
    – John H
    Nov 12 '18 at 19:12










  • Hi, welcome to StackOverflow. This question may be too broad for this forum. I suggest posting to Code Review or Data Science. datascience.stackexchange.com
    – Evan
    Nov 12 '18 at 19:14










  • There is some principal trouble with you data or labels. Could you please provide some sample from the date and label?
    – Geeocode
    Nov 12 '18 at 19:24






  • 2




    I think you messed up when you are shuffling the labels and riskFactors, for consistency, you should try to use the train_test_split provided by sklearn.
    – Yilun Zhang
    Nov 12 '18 at 19:38










  • Thank you so much! I am definitely a newbie to this as I got to 80% simply by using the train_test_split and removing the random part(there must have been some error there).
    – Kasy Chakra
    Nov 12 '18 at 20:14
















explore your data first. look for patterns that you think your model should be able to estimate. what makes you think your dataset is estimable beyond a 50% accuracy rate?
– John H
Nov 12 '18 at 19:12




explore your data first. look for patterns that you think your model should be able to estimate. what makes you think your dataset is estimable beyond a 50% accuracy rate?
– John H
Nov 12 '18 at 19:12












Hi, welcome to StackOverflow. This question may be too broad for this forum. I suggest posting to Code Review or Data Science. datascience.stackexchange.com
– Evan
Nov 12 '18 at 19:14




Hi, welcome to StackOverflow. This question may be too broad for this forum. I suggest posting to Code Review or Data Science. datascience.stackexchange.com
– Evan
Nov 12 '18 at 19:14












There is some principal trouble with you data or labels. Could you please provide some sample from the date and label?
– Geeocode
Nov 12 '18 at 19:24




There is some principal trouble with you data or labels. Could you please provide some sample from the date and label?
– Geeocode
Nov 12 '18 at 19:24




2




2




I think you messed up when you are shuffling the labels and riskFactors, for consistency, you should try to use the train_test_split provided by sklearn.
– Yilun Zhang
Nov 12 '18 at 19:38




I think you messed up when you are shuffling the labels and riskFactors, for consistency, you should try to use the train_test_split provided by sklearn.
– Yilun Zhang
Nov 12 '18 at 19:38












Thank you so much! I am definitely a newbie to this as I got to 80% simply by using the train_test_split and removing the random part(there must have been some error there).
– Kasy Chakra
Nov 12 '18 at 20:14




Thank you so much! I am definitely a newbie to this as I got to 80% simply by using the train_test_split and removing the random part(there must have been some error there).
– Kasy Chakra
Nov 12 '18 at 20:14












1 Answer
1






active

oldest

votes


















0














Solved this by removing the random part as there must have been some error there.
As suggested by Yulin Zhang, I used the train_test_split provided by sklearn.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268587%2frandom-forest-classifier-accuracy-doesnt-get-higher-than-50%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Solved this by removing the random part as there must have been some error there.
    As suggested by Yulin Zhang, I used the train_test_split provided by sklearn.






    share|improve this answer


























      0














      Solved this by removing the random part as there must have been some error there.
      As suggested by Yulin Zhang, I used the train_test_split provided by sklearn.






      share|improve this answer
























        0












        0








        0






        Solved this by removing the random part as there must have been some error there.
        As suggested by Yulin Zhang, I used the train_test_split provided by sklearn.






        share|improve this answer












        Solved this by removing the random part as there must have been some error there.
        As suggested by Yulin Zhang, I used the train_test_split provided by sklearn.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 5 '18 at 3:06









        Kasy ChakraKasy Chakra

        85




        85






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268587%2frandom-forest-classifier-accuracy-doesnt-get-higher-than-50%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues