How to plot percentage with moving threshold in R

I am working on a project in which I use several language detection algorithms, such as Textcat and CLD3. I have a dataframe in which I recorded what language a piece of text was written in, what the guess of each algorithm was and whether that guess was correct.

Because the length of the strings varies greatly, I want to evaluate the performance of each algorithm over a moving threshold (such as for all strings with more than 5 words, then more than 10 words, etc)

The data look like this:

Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 

String1 EN       en  en      20         1           1

String2 EN       NA  fr      5          0           0

String3 FR       fr  es      10         1           0

String4 ES       ca  es      7          0           1

What I would dearly like to do is to plot the accuracy for each threshold in terms of the number of words. For example, I found that overall CLD labels the language correctly in 75% of cases. However, when considering only strings with 7 words or more, this goes up to 85%.

So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.

I know how to do this by hand (subset the dataframe for value Word_count > x, calculate the accuracy for each algorithm, store those in a data frame, calculate for Word_count > y, and so on, and then plot it), but because my sample is very large, it would take a gargantuan amount of work to get this all done, and there must be a more intelligent way to do this. I considered iterating over different thresholds with a for-loop to calculate values for each and then storing those, but a large part of the strings in this data set can be over 100 words, and I am considering to do the same for character length.

Does someone know how this could be fixed in a more automated fashion?

asked Nov 14 '18 at 23:43

MilanV

103

Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33

Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41

add a comment |

The data look like this:

Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 

String1 EN       en  en      20         1           1

String2 EN       NA  fr      5          0           0

String3 FR       fr  es      10         1           0

String4 ES       ca  es      7          0           1

So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.

Does someone know how this could be fixed in a more automated fashion?

asked Nov 14 '18 at 23:43

MilanV

103

Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33

Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41

add a comment |

The data look like this:

Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 

String1 EN       en  en      20         1           1

String2 EN       NA  fr      5          0           0

String3 FR       fr  es      10         1           0

String4 ES       ca  es      7          0           1

So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.

Does someone know how this could be fixed in a more automated fashion?

asked Nov 14 '18 at 23:43

MilanV

103

The data look like this:

Text    Language CLD Textcat Word_count CLD_correct Textcat_correct 

String1 EN       en  en      20         1           1

String2 EN       NA  fr      5          0           0

String3 FR       fr  es      10         1           0

String4 ES       ca  es      7          0           1

So on the x-axis I want to plot the number of words for the threshold, on the y-axis the percentage of correct guesses made by the algorithm.

Does someone know how this could be fixed in a more automated fashion?

r ggplot2

asked Nov 14 '18 at 23:43

MilanV

103

asked Nov 14 '18 at 23:43

MilanV

103

asked Nov 14 '18 at 23:43

MilanV

103

asked Nov 14 '18 at 23:43

MilanV

103

asked Nov 14 '18 at 23:43

MilanV

103

Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33

Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41

add a comment |

Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33

Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41

Just to understand your data correctly, does CLD_correct and Textcat_correct means 1 - correct and 0 - incorrect? Would you also like to group your data by language?

– Mr_Z
Nov 15 '18 at 15:33

Yes, both are binary for whether either algorithm was correct. It does not have to be grouped by language, I just want to plot the percentage for each threshold.

– MilanV
Nov 16 '18 at 14:41

add a comment |

1 Answer
1

active

oldest

votes

First define a vector of used algorithms

algorithmrithms <- c('Textcat_correct', 'CLD_correct')

Then create a vector with the number of words for which you want to see the accuracy

word.size <- seq(5, 20, 5)

Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.

library(dplyr)

resultList <- lapply(word.size, function(y) { 

    lapply(algorithm, function(x) { 

        df %>%

        rename(algorithm = x) %>%

        filter(Word_count >= y) %>%

        group_by(algorithm) %>%

         summarise(all = sum(Word_count)) %>%

         mutate(accuracy = all/sum(all)*100) %>%

         filter(algorithm == 1) %>%

         mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%

         mutate(words = y) })

    })

This list you can convert to a dataframe

df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))

And now you can plot your results

library(ggplot2)

ggplot(df2, aes(words, accuracy, fill=algorithm)) + 

    geom_bar(stat="identity", position="dodge")

As result you get this

enter image description here

edited Nov 21 '18 at 17:41

answered Nov 21 '18 at 13:53

Mr_Z

18616

Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310422%2fhow-to-plot-percentage-with-moving-threshold-in-r%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

First define a vector of used algorithms

algorithmrithms <- c('Textcat_correct', 'CLD_correct')

Then create a vector with the number of words for which you want to see the accuracy

word.size <- seq(5, 20, 5)

Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.

library(dplyr)

resultList <- lapply(word.size, function(y) { 

    lapply(algorithm, function(x) { 

        df %>%

        rename(algorithm = x) %>%

        filter(Word_count >= y) %>%

        group_by(algorithm) %>%

         summarise(all = sum(Word_count)) %>%

         mutate(accuracy = all/sum(all)*100) %>%

         filter(algorithm == 1) %>%

         mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%

         mutate(words = y) })

    })

This list you can convert to a dataframe

df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))

And now you can plot your results

library(ggplot2)

ggplot(df2, aes(words, accuracy, fill=algorithm)) + 

    geom_bar(stat="identity", position="dodge")

As result you get this

enter image description here

edited Nov 21 '18 at 17:41

answered Nov 21 '18 at 13:53

Mr_Z

18616

Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53

add a comment |

First define a vector of used algorithms

algorithmrithms <- c('Textcat_correct', 'CLD_correct')

Then create a vector with the number of words for which you want to see the accuracy

word.size <- seq(5, 20, 5)

Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.

library(dplyr)

resultList <- lapply(word.size, function(y) { 

    lapply(algorithm, function(x) { 

        df %>%

        rename(algorithm = x) %>%

        filter(Word_count >= y) %>%

        group_by(algorithm) %>%

         summarise(all = sum(Word_count)) %>%

         mutate(accuracy = all/sum(all)*100) %>%

         filter(algorithm == 1) %>%

         mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%

         mutate(words = y) })

    })

This list you can convert to a dataframe

df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))

And now you can plot your results

library(ggplot2)

ggplot(df2, aes(words, accuracy, fill=algorithm)) + 

    geom_bar(stat="identity", position="dodge")

As result you get this

enter image description here

edited Nov 21 '18 at 17:41

answered Nov 21 '18 at 13:53

Mr_Z

18616

Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53

add a comment |

First define a vector of used algorithms

algorithmrithms <- c('Textcat_correct', 'CLD_correct')

Then create a vector with the number of words for which you want to see the accuracy

word.size <- seq(5, 20, 5)

Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.

library(dplyr)

resultList <- lapply(word.size, function(y) { 

    lapply(algorithm, function(x) { 

        df %>%

        rename(algorithm = x) %>%

        filter(Word_count >= y) %>%

        group_by(algorithm) %>%

         summarise(all = sum(Word_count)) %>%

         mutate(accuracy = all/sum(all)*100) %>%

         filter(algorithm == 1) %>%

         mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%

         mutate(words = y) })

    })

This list you can convert to a dataframe

df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))

And now you can plot your results

library(ggplot2)

ggplot(df2, aes(words, accuracy, fill=algorithm)) + 

    geom_bar(stat="identity", position="dodge")

As result you get this

enter image description here

edited Nov 21 '18 at 17:41

answered Nov 21 '18 at 13:53

Mr_Z

18616

First define a vector of used algorithms

algorithmrithms <- c('Textcat_correct', 'CLD_correct')

Then create a vector with the number of words for which you want to see the accuracy

word.size <- seq(5, 20, 5)

Now you can use the package dplyr and lapply to get a list for each word amount and algorithm.

library(dplyr)

resultList <- lapply(word.size, function(y) { 

    lapply(algorithm, function(x) { 

        df %>%

        rename(algorithm = x) %>%

        filter(Word_count >= y) %>%

        group_by(algorithm) %>%

         summarise(all = sum(Word_count)) %>%

         mutate(accuracy = all/sum(all)*100) %>%

         filter(algorithm == 1) %>%

         mutate(algorithm=replace(algorithm, algorithm == 1, x)) %>%

         mutate(words = y) })

    })

This list you can convert to a dataframe

df2 <- as.data.frame(do.call(rbind, unlist(resultList, recursive=F)))

And now you can plot your results

library(ggplot2)

ggplot(df2, aes(words, accuracy, fill=algorithm)) + 

    geom_bar(stat="identity", position="dodge")

As result you get this

enter image description here

edited Nov 21 '18 at 17:41

answered Nov 21 '18 at 13:53

Mr_Z

18616

edited Nov 21 '18 at 17:41

answered Nov 21 '18 at 13:53

Mr_Z

18616

answered Nov 21 '18 at 13:53

Mr_Z

18616

answered Nov 21 '18 at 13:53

Mr_Z

18616

Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53

add a comment |

Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53

Thank you for the clear explanation, this is a good way to solve this problem!

– MilanV
Nov 24 '18 at 1:53

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky