RF model loses accuracy when I remove it from Pipeline

up vote
3
down vote

favorite

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([

('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),

('fit', RandomForestClassifier())

])

I run it:

clf = rf_pipeline.fit(X_train, y_train)

preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)

accuracy_score(y_test, preds)

precision_score(y_test, preds)

the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:

# initialize

tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)

# transform and fit

vect = tfidf.fit_transform(X_train)

clf = rf_class.fit(vect, y_train)

# predict

clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

asked Nov 10 at 18:33

Oct

697

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

up vote
3
down vote

favorite

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([

('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),

('fit', RandomForestClassifier())

])

I run it:

clf = rf_pipeline.fit(X_train, y_train)

preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)

accuracy_score(y_test, preds)

precision_score(y_test, preds)

# initialize

tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)

# transform and fit

vect = tfidf.fit_transform(X_train)

clf = rf_class.fit(vect, y_train)

# predict

clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

asked Nov 10 at 18:33

Oct

697

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

up vote
3
down vote

favorite

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([

('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),

('fit', RandomForestClassifier())

])

I run it:

clf = rf_pipeline.fit(X_train, y_train)

preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)

accuracy_score(y_test, preds)

precision_score(y_test, preds)

# initialize

tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)

# transform and fit

vect = tfidf.fit_transform(X_train)

clf = rf_class.fit(vect, y_train)

# predict

clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

asked Nov 10 at 18:33

Oct

697

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([

('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),

('fit', RandomForestClassifier())

])

I run it:

clf = rf_pipeline.fit(X_train, y_train)

preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)

accuracy_score(y_test, preds)

precision_score(y_test, preds)

# initialize

tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)

# transform and fit

vect = tfidf.fit_transform(X_train)

clf = rf_class.fit(vect, y_train)

# predict

clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

scikit-learn nlp random-forest spacy tfidfvectorizer

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

1 Answer
1

active

oldest

votes

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict

clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242164%2frf-model-loses-accuracy-when-i-remove-it-from-pipeline%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict

clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict

clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict

clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict

clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

answered Nov 12 at 5:04

Tomáš Přinda

33317

answered Nov 12 at 5:04

Tomáš Přinda

33317

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

qeV7dh5CgoXk0U2ohLAfPLsB,S7c EVjlBB1 OM0HG7z3LUssshit9kx 4iodbWok1Q,Y

搜尋此網誌

Ndtyjky