RF model loses accuracy when I remove it from Pipeline
up vote
3
down vote
favorite
Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...
I have an nlp pipeline that does basically the following:
rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])
I run it:
clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)
When I optimize I get accuracy in the high 90's with the following:
confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)
the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:
# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))
When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.
Any ideas?
scikit-learn nlp random-forest spacy tfidfvectorizer
add a comment |
up vote
3
down vote
favorite
Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...
I have an nlp pipeline that does basically the following:
rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])
I run it:
clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)
When I optimize I get accuracy in the high 90's with the following:
confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)
the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:
# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))
When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.
Any ideas?
scikit-learn nlp random-forest spacy tfidfvectorizer
Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...
I have an nlp pipeline that does basically the following:
rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])
I run it:
clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)
When I optimize I get accuracy in the high 90's with the following:
confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)
the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:
# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))
When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.
Any ideas?
scikit-learn nlp random-forest spacy tfidfvectorizer
Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...
I have an nlp pipeline that does basically the following:
rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])
I run it:
clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)
When I optimize I get accuracy in the high 90's with the following:
confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)
the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:
# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))
When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.
Any ideas?
scikit-learn nlp random-forest spacy tfidfvectorizer
scikit-learn nlp random-forest spacy tfidfvectorizer
asked Nov 10 at 18:33
Oct
697
697
Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13
add a comment |
Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13
Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13
Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13
add a comment |
1 Answer
1
active
oldest
votes
up vote
3
down vote
For test set, you can't call fit_transform()
, but just transform()
, otherwise elements of a tfidf vectors has different meaning.
Try this
# predict
clf.predict(tfidf.transform(X_test))
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
For test set, you can't call fit_transform()
, but just transform()
, otherwise elements of a tfidf vectors has different meaning.
Try this
# predict
clf.predict(tfidf.transform(X_test))
add a comment |
up vote
3
down vote
For test set, you can't call fit_transform()
, but just transform()
, otherwise elements of a tfidf vectors has different meaning.
Try this
# predict
clf.predict(tfidf.transform(X_test))
add a comment |
up vote
3
down vote
up vote
3
down vote
For test set, you can't call fit_transform()
, but just transform()
, otherwise elements of a tfidf vectors has different meaning.
Try this
# predict
clf.predict(tfidf.transform(X_test))
For test set, you can't call fit_transform()
, but just transform()
, otherwise elements of a tfidf vectors has different meaning.
Try this
# predict
clf.predict(tfidf.transform(X_test))
answered Nov 12 at 5:04
Tomáš Přinda
33317
33317
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242164%2frf-model-loses-accuracy-when-i-remove-it-from-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13