How to plot percentage with moving threshold in R












0















I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.



Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)



The data look like this:



Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1


What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.



So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.



I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.



Does someone know how this could be fixed in a more automated fashion?










share|improve this question























  • Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

    – Mr_Z
    Nov 15 '18 at 15:33











  • Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

    – MilanV
    Nov 16 '18 at 14:41
















0















I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.



Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)



The data look like this:



Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1


What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.



So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.



I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.



Does someone know how this could be fixed in a more automated fashion?










share|improve this question























  • Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

    – Mr_Z
    Nov 15 '18 at 15:33











  • Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

    – MilanV
    Nov 16 '18 at 14:41














0












0








0








I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.



Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)



The data look like this:



Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1


What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.



So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.



I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.



Does someone know how this could be fixed in a more automated fashion?










share|improve this question














I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.



Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)



The data look like this:



Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 
String1 EN en en 20 1 1
String2 EN NA fr 5 0 0
String3 FR fr es 10 1 0
String4 ES ca es 7 0 1


What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.



So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.



I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.



Does someone know how this could be fixed in a more automated fashion?







r ggplot2






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 14 '18 at 23:43









MilanVMilanV

103




103













  • Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

    – Mr_Z
    Nov 15 '18 at 15:33











  • Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

    – MilanV
    Nov 16 '18 at 14:41



















  • Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

    – Mr_Z
    Nov 15 '18 at 15:33











  • Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

    – MilanV
    Nov 16 '18 at 14:41

















Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33





Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33













Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41





Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41












1 Answer
1






active

oldest

votes


















0














First define a vector of used algorithms



algorithmrithms <- c('Textcat_correct', 'CLD_correct')


Then create a vector with the number of words for which you want to see the accuracy



word.size <- seq(5, 20, 5)


Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.



library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})


This list you can convert to a dataframe



df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))


And now you can plot your results



library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")


As result you get this



enter image description here






share|improve this answer


























  • Thank you for the clear explanation, this is a good way to solve this problem!

    – MilanV
    Nov 24 '18 at 1:53











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310422%2fhow-to-plot-percentage-with-moving-threshold-in-r%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














First define a vector of used algorithms



algorithmrithms <- c('Textcat_correct', 'CLD_correct')


Then create a vector with the number of words for which you want to see the accuracy



word.size <- seq(5, 20, 5)


Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.



library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})


This list you can convert to a dataframe



df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))


And now you can plot your results



library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")


As result you get this



enter image description here






share|improve this answer


























  • Thank you for the clear explanation, this is a good way to solve this problem!

    – MilanV
    Nov 24 '18 at 1:53
















0














First define a vector of used algorithms



algorithmrithms <- c('Textcat_correct', 'CLD_correct')


Then create a vector with the number of words for which you want to see the accuracy



word.size <- seq(5, 20, 5)


Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.



library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})


This list you can convert to a dataframe



df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))


And now you can plot your results



library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")


As result you get this



enter image description here






share|improve this answer


























  • Thank you for the clear explanation, this is a good way to solve this problem!

    – MilanV
    Nov 24 '18 at 1:53














0












0








0







First define a vector of used algorithms



algorithmrithms <- c('Textcat_correct', 'CLD_correct')


Then create a vector with the number of words for which you want to see the accuracy



word.size <- seq(5, 20, 5)


Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.



library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})


This list you can convert to a dataframe



df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))


And now you can plot your results



library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")


As result you get this



enter image description here






share|improve this answer















First define a vector of used algorithms



algorithmrithms <- c('Textcat_correct', 'CLD_correct')


Then create a vector with the number of words for which you want to see the accuracy



word.size <- seq(5, 20, 5)


Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.



library(dplyr)
resultList <- lapply(word.size, function(y) {
lapply(algorithm, function(x) {
df %>%
rename(algorithm = x) %>%
filter(Word_count >= y) %>%
group_by(algorithm) %>%
summarise(all = sum(Word_count)) %>%
mutate(accuracy = all/sum(all)*100) %>%
filter(algorithm == 1) %>%
mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%
mutate(words = y) })
})


This list you can convert to a dataframe



df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))


And now you can plot your results



library(ggplot2)
ggplot(df2, aes(words, accuracy, fill=algorithm)) +
geom_bar(stat="identity", position="dodge")


As result you get this



enter image description here







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 21 '18 at 17:41

























answered Nov 21 '18 at 13:53









Mr_ZMr_Z

18616




18616













  • Thank you for the clear explanation, this is a good way to solve this problem!

    – MilanV
    Nov 24 '18 at 1:53



















  • Thank you for the clear explanation, this is a good way to solve this problem!

    – MilanV
    Nov 24 '18 at 1:53

















Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53





Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310422%2fhow-to-plot-percentage-with-moving-threshold-in-r%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues