How to plot percentage with moving threshold in R
I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.
Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)
The data look like this:
Text Language CLD Textcat Word_count CLD_correct Textcat_correct
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1
What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.
So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.
I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.
Does someone know how this could be fixed in a more automated fashion?
r ggplot2
add a comment |
I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.
Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)
The data look like this:
Text Language CLD Textcat Word_count CLD_correct Textcat_correct
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1
What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.
So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.
I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.
Does someone know how this could be fixed in a more automated fashion?
r ggplot2
Just to understand your data correctly, doesCLD_correct
andTextcat_correct
means 1 - correct and 0 - incorrect? Would you also like to group your data by language?
– Mr_Z
Nov 15 '18 at 15:33
Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.
– MilanV
Nov 16 '18 at 14:41
add a comment |
I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.
Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)
The data look like this:
Text Language CLD Textcat Word_count CLD_correct Textcat_correct
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1
What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.
So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.
I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.
Does someone know how this could be fixed in a more automated fashion?
r ggplot2
I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.
Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)
The data look like this:
Text Language CLD Textcat Word_count CLD_correct Textcat_correct
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1
What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.
So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.
I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.
Does someone know how this could be fixed in a more automated fashion?
r ggplot2
r ggplot2
asked Nov 14 '18 at 23:43
MilanVMilanV
103
103
Just to understand your data correctly, doesCLD_correct
andTextcat_correct
means 1 - correct and 0 - incorrect? Would you also like to group your data by language?
– Mr_Z
Nov 15 '18 at 15:33
Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.
– MilanV
Nov 16 '18 at 14:41
add a comment |
Just to understand your data correctly, doesCLD_correct
andTextcat_correct
means 1 - correct and 0 - incorrect? Would you also like to group your data by language?
– Mr_Z
Nov 15 '18 at 15:33
Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.
– MilanV
Nov 16 '18 at 14:41
Just to understand your data correctly, does
CLD_correct
and Textcat_correct
means 1 - correct and 0 - incorrect? Would you also like to group your data by language?– Mr_Z
Nov 15 '18 at 15:33
Just to understand your data correctly, does
CLD_correct
and Textcat_correct
means 1 - correct and 0 - incorrect? Would you also like to group your data by language?– Mr_Z
Nov 15 '18 at 15:33
Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.
– MilanV
Nov 16 '18 at 14:41
Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.
– MilanV
Nov 16 '18 at 14:41
add a comment |
1 Answer
1
active
oldest
votes
First define a vector of used algorithms
algorithmrithms <- c('Textcat_correct', 'CLD_correct')
Then create a vector with the number of words for which you want to see the accuracy
word.size <- seq(5, 20, 5)
Now you can use the package dplyr
and lapply
to get a list for each word amount and algorithm.
library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})
This list you can convert to a dataframe
df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))
And now you can plot your results
library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")
As result you get this
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310422%2fhow-to-plot-percentage-with-moving-threshold-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First define a vector of used algorithms
algorithmrithms <- c('Textcat_correct', 'CLD_correct')
Then create a vector with the number of words for which you want to see the accuracy
word.size <- seq(5, 20, 5)
Now you can use the package dplyr
and lapply
to get a list for each word amount and algorithm.
library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})
This list you can convert to a dataframe
df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))
And now you can plot your results
library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")
As result you get this
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
add a comment |
First define a vector of used algorithms
algorithmrithms <- c('Textcat_correct', 'CLD_correct')
Then create a vector with the number of words for which you want to see the accuracy
word.size <- seq(5, 20, 5)
Now you can use the package dplyr
and lapply
to get a list for each word amount and algorithm.
library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})
This list you can convert to a dataframe
df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))
And now you can plot your results
library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")
As result you get this
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
add a comment |
First define a vector of used algorithms
algorithmrithms <- c('Textcat_correct', 'CLD_correct')
Then create a vector with the number of words for which you want to see the accuracy
word.size <- seq(5, 20, 5)
Now you can use the package dplyr
and lapply
to get a list for each word amount and algorithm.
library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})
This list you can convert to a dataframe
df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))
And now you can plot your results
library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")
As result you get this
First define a vector of used algorithms
algorithmrithms <- c('Textcat_correct', 'CLD_correct')
Then create a vector with the number of words for which you want to see the accuracy
word.size <- seq(5, 20, 5)
Now you can use the package dplyr
and lapply
to get a list for each word amount and algorithm.
library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})
This list you can convert to a dataframe
df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))
And now you can plot your results
library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")
As result you get this
edited Nov 21 '18 at 17:41
answered Nov 21 '18 at 13:53
Mr_ZMr_Z
18616
18616
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
add a comment |
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
Thank you for the clear explanation, this is a good way to solve this problem!
– MilanV
Nov 24 '18 at 1:53
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310422%2fhow-to-plot-percentage-with-moving-threshold-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Just to understand your data correctly, does
CLD_correct
andTextcat_correct
means 1 - correct and 0 - incorrect? Would you also like to group your data by language?– Mr_Z
Nov 15 '18 at 15:33
Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.
– MilanV
Nov 16 '18 at 14:41