Remove duplicating docs of docs with high similarity












1















When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?










share|improve this question



























    1















    When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?










    share|improve this question

























      1












      1








      1








      When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?










      share|improve this question














      When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?







      r quanteda






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 '18 at 14:09









      fritsvegtersfritsvegters

      155




      155
























          2 Answers
          2






          active

          oldest

          votes


















          1














          Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.



          Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.



          We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.



          library("quanteda")
          # Loading required package: quanteda
          # Package version: 1.3.14

          mydocs <- c(a1 = "a a a a a b b c d w g j t",
          b1 = "l y y h x x x x x y y y y",
          a2 = "a a a a a b c s k w i r f",
          b2 = "p q w e d x x x x y y y y",
          a3 = "a a a a a b b x k w i r f")

          mydfm <- dfm(mydocs)

          (sim <- textstat_simil(mydfm))
          # a1 b1 a2 b2
          # b1 -0.22203788
          # a2 0.80492203 -0.23090513
          # b2 -0.23427416 0.90082239 -0.28140219
          # a3 0.81167608 -0.09065452 0.92242890 -0.12530944

          # create a data.frame of the unique pairs and their similarities
          sim_pair_names <- t(combn(docnames(mydfm), 2))
          sim_pairs <- data.frame(sim_pair_names,
          sim = as.numeric(sim),
          stringsAsFactors = FALSE)
          sim_pairs
          # X1 X2 sim
          # 1 a1 b1 -0.22203788
          # 2 a1 a2 0.80492203
          # 3 a1 b2 -0.23427416
          # 4 a1 a3 0.81167608
          # 5 b1 a2 -0.23090513
          # 6 b1 b2 0.90082239
          # 7 b1 a3 -0.09065452
          # 8 a2 b2 -0.28140219
          # 9 a2 a3 0.92242890
          # 10 b2 a3 -0.12530944


          Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().



          # set the threshold for similarity
          threshold <- 0.80

          # discard one of the pair if similarity > threshold
          todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
          todrop
          # [1] "a1" "a1" "b1" "a2"

          # then subset the dfm, keeping only the "keepers"
          dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
          # Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
          # 2 x 20 sparse Matrix of class "dfm"
          # features
          # docs a b c d w g j t l y h x s k i r f p q e
          # b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
          # a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0


          Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.






          share|improve this answer





















          • 1





            Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

            – fritsvegters
            Nov 15 '18 at 14:40



















          0














          If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().



          In this example, cosine similarity scores smaller than 0.9 are all ignored.



          library("quanteda")
          mydocs <- c(a1 = "a a a a a b b c d w g j t",
          b1 = "l y y h x x x x x y y y y",
          a2 = "a a a a a b c s k w i r f",
          b2 = "p q w e d x x x x y y y y",
          a3 = "a a a a a b b x k w i r f")
          mydfm <- dfm(mydocs)

          (sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
          # 5 x 5 sparse Matrix of class "dsTMatrix"
          # a1 b1 a2 b2 a3
          # a1 1 . . . .
          # b1 . 1.0000000 . 0.9113423 .
          # a2 . . 1.0000000 . 0.9415838
          # b2 . 0.9113423 . 1.0000000 .
          # a3 . . 0.9415838 . 1.0000000

          matrix2list <- function(x) {
          names(x@x) <- rownames(x)[x@i + 1]
          split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
          }

          matrix2list(sim)
          # $a1
          # a1
          # 1
          #
          # $b1
          # b1
          # 1
          #
          # $a2
          # a2
          # 1
          #
          # $b2
          # b1 b2
          # 0.9113423 1.0000000
          #
          # $a3
          # a2 a3
          # 0.9415838 1.0000000


          See https://koheiw.net/?p=839 for the performance differences.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302151%2fremove-duplicating-docs-of-docs-with-high-similarity%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.



            Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.



            We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.



            library("quanteda")
            # Loading required package: quanteda
            # Package version: 1.3.14

            mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")

            mydfm <- dfm(mydocs)

            (sim <- textstat_simil(mydfm))
            # a1 b1 a2 b2
            # b1 -0.22203788
            # a2 0.80492203 -0.23090513
            # b2 -0.23427416 0.90082239 -0.28140219
            # a3 0.81167608 -0.09065452 0.92242890 -0.12530944

            # create a data.frame of the unique pairs and their similarities
            sim_pair_names <- t(combn(docnames(mydfm), 2))
            sim_pairs <- data.frame(sim_pair_names,
            sim = as.numeric(sim),
            stringsAsFactors = FALSE)
            sim_pairs
            # X1 X2 sim
            # 1 a1 b1 -0.22203788
            # 2 a1 a2 0.80492203
            # 3 a1 b2 -0.23427416
            # 4 a1 a3 0.81167608
            # 5 b1 a2 -0.23090513
            # 6 b1 b2 0.90082239
            # 7 b1 a3 -0.09065452
            # 8 a2 b2 -0.28140219
            # 9 a2 a3 0.92242890
            # 10 b2 a3 -0.12530944


            Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().



            # set the threshold for similarity
            threshold <- 0.80

            # discard one of the pair if similarity > threshold
            todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
            todrop
            # [1] "a1" "a1" "b1" "a2"

            # then subset the dfm, keeping only the "keepers"
            dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
            # Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
            # 2 x 20 sparse Matrix of class "dfm"
            # features
            # docs a b c d w g j t l y h x s k i r f p q e
            # b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
            # a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0


            Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.






            share|improve this answer





















            • 1





              Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

              – fritsvegters
              Nov 15 '18 at 14:40
















            1














            Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.



            Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.



            We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.



            library("quanteda")
            # Loading required package: quanteda
            # Package version: 1.3.14

            mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")

            mydfm <- dfm(mydocs)

            (sim <- textstat_simil(mydfm))
            # a1 b1 a2 b2
            # b1 -0.22203788
            # a2 0.80492203 -0.23090513
            # b2 -0.23427416 0.90082239 -0.28140219
            # a3 0.81167608 -0.09065452 0.92242890 -0.12530944

            # create a data.frame of the unique pairs and their similarities
            sim_pair_names <- t(combn(docnames(mydfm), 2))
            sim_pairs <- data.frame(sim_pair_names,
            sim = as.numeric(sim),
            stringsAsFactors = FALSE)
            sim_pairs
            # X1 X2 sim
            # 1 a1 b1 -0.22203788
            # 2 a1 a2 0.80492203
            # 3 a1 b2 -0.23427416
            # 4 a1 a3 0.81167608
            # 5 b1 a2 -0.23090513
            # 6 b1 b2 0.90082239
            # 7 b1 a3 -0.09065452
            # 8 a2 b2 -0.28140219
            # 9 a2 a3 0.92242890
            # 10 b2 a3 -0.12530944


            Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().



            # set the threshold for similarity
            threshold <- 0.80

            # discard one of the pair if similarity > threshold
            todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
            todrop
            # [1] "a1" "a1" "b1" "a2"

            # then subset the dfm, keeping only the "keepers"
            dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
            # Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
            # 2 x 20 sparse Matrix of class "dfm"
            # features
            # docs a b c d w g j t l y h x s k i r f p q e
            # b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
            # a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0


            Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.






            share|improve this answer





















            • 1





              Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

              – fritsvegters
              Nov 15 '18 at 14:40














            1












            1








            1







            Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.



            Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.



            We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.



            library("quanteda")
            # Loading required package: quanteda
            # Package version: 1.3.14

            mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")

            mydfm <- dfm(mydocs)

            (sim <- textstat_simil(mydfm))
            # a1 b1 a2 b2
            # b1 -0.22203788
            # a2 0.80492203 -0.23090513
            # b2 -0.23427416 0.90082239 -0.28140219
            # a3 0.81167608 -0.09065452 0.92242890 -0.12530944

            # create a data.frame of the unique pairs and their similarities
            sim_pair_names <- t(combn(docnames(mydfm), 2))
            sim_pairs <- data.frame(sim_pair_names,
            sim = as.numeric(sim),
            stringsAsFactors = FALSE)
            sim_pairs
            # X1 X2 sim
            # 1 a1 b1 -0.22203788
            # 2 a1 a2 0.80492203
            # 3 a1 b2 -0.23427416
            # 4 a1 a3 0.81167608
            # 5 b1 a2 -0.23090513
            # 6 b1 b2 0.90082239
            # 7 b1 a3 -0.09065452
            # 8 a2 b2 -0.28140219
            # 9 a2 a3 0.92242890
            # 10 b2 a3 -0.12530944


            Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().



            # set the threshold for similarity
            threshold <- 0.80

            # discard one of the pair if similarity > threshold
            todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
            todrop
            # [1] "a1" "a1" "b1" "a2"

            # then subset the dfm, keeping only the "keepers"
            dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
            # Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
            # 2 x 20 sparse Matrix of class "dfm"
            # features
            # docs a b c d w g j t l y h x s k i r f p q e
            # b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
            # a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0


            Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.






            share|improve this answer















            Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.



            Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.



            We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.



            library("quanteda")
            # Loading required package: quanteda
            # Package version: 1.3.14

            mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")

            mydfm <- dfm(mydocs)

            (sim <- textstat_simil(mydfm))
            # a1 b1 a2 b2
            # b1 -0.22203788
            # a2 0.80492203 -0.23090513
            # b2 -0.23427416 0.90082239 -0.28140219
            # a3 0.81167608 -0.09065452 0.92242890 -0.12530944

            # create a data.frame of the unique pairs and their similarities
            sim_pair_names <- t(combn(docnames(mydfm), 2))
            sim_pairs <- data.frame(sim_pair_names,
            sim = as.numeric(sim),
            stringsAsFactors = FALSE)
            sim_pairs
            # X1 X2 sim
            # 1 a1 b1 -0.22203788
            # 2 a1 a2 0.80492203
            # 3 a1 b2 -0.23427416
            # 4 a1 a3 0.81167608
            # 5 b1 a2 -0.23090513
            # 6 b1 b2 0.90082239
            # 7 b1 a3 -0.09065452
            # 8 a2 b2 -0.28140219
            # 9 a2 a3 0.92242890
            # 10 b2 a3 -0.12530944


            Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().



            # set the threshold for similarity
            threshold <- 0.80

            # discard one of the pair if similarity > threshold
            todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
            todrop
            # [1] "a1" "a1" "b1" "a2"

            # then subset the dfm, keeping only the "keepers"
            dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
            # Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
            # 2 x 20 sparse Matrix of class "dfm"
            # features
            # docs a b c d w g j t l y h x s k i r f p q e
            # b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
            # a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0


            Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 14 '18 at 23:00

























            answered Nov 14 '18 at 15:49









            Ken BenoitKen Benoit

            7,2781635




            7,2781635








            • 1





              Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

              – fritsvegters
              Nov 15 '18 at 14:40














            • 1





              Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

              – fritsvegters
              Nov 15 '18 at 14:40








            1




            1





            Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

            – fritsvegters
            Nov 15 '18 at 14:40





            Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

            – fritsvegters
            Nov 15 '18 at 14:40













            0














            If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().



            In this example, cosine similarity scores smaller than 0.9 are all ignored.



            library("quanteda")
            mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")
            mydfm <- dfm(mydocs)

            (sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
            # 5 x 5 sparse Matrix of class "dsTMatrix"
            # a1 b1 a2 b2 a3
            # a1 1 . . . .
            # b1 . 1.0000000 . 0.9113423 .
            # a2 . . 1.0000000 . 0.9415838
            # b2 . 0.9113423 . 1.0000000 .
            # a3 . . 0.9415838 . 1.0000000

            matrix2list <- function(x) {
            names(x@x) <- rownames(x)[x@i + 1]
            split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
            }

            matrix2list(sim)
            # $a1
            # a1
            # 1
            #
            # $b1
            # b1
            # 1
            #
            # $a2
            # a2
            # 1
            #
            # $b2
            # b1 b2
            # 0.9113423 1.0000000
            #
            # $a3
            # a2 a3
            # 0.9415838 1.0000000


            See https://koheiw.net/?p=839 for the performance differences.






            share|improve this answer




























              0














              If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().



              In this example, cosine similarity scores smaller than 0.9 are all ignored.



              library("quanteda")
              mydocs <- c(a1 = "a a a a a b b c d w g j t",
              b1 = "l y y h x x x x x y y y y",
              a2 = "a a a a a b c s k w i r f",
              b2 = "p q w e d x x x x y y y y",
              a3 = "a a a a a b b x k w i r f")
              mydfm <- dfm(mydocs)

              (sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
              # 5 x 5 sparse Matrix of class "dsTMatrix"
              # a1 b1 a2 b2 a3
              # a1 1 . . . .
              # b1 . 1.0000000 . 0.9113423 .
              # a2 . . 1.0000000 . 0.9415838
              # b2 . 0.9113423 . 1.0000000 .
              # a3 . . 0.9415838 . 1.0000000

              matrix2list <- function(x) {
              names(x@x) <- rownames(x)[x@i + 1]
              split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
              }

              matrix2list(sim)
              # $a1
              # a1
              # 1
              #
              # $b1
              # b1
              # 1
              #
              # $a2
              # a2
              # 1
              #
              # $b2
              # b1 b2
              # 0.9113423 1.0000000
              #
              # $a3
              # a2 a3
              # 0.9415838 1.0000000


              See https://koheiw.net/?p=839 for the performance differences.






              share|improve this answer


























                0












                0








                0







                If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().



                In this example, cosine similarity scores smaller than 0.9 are all ignored.



                library("quanteda")
                mydocs <- c(a1 = "a a a a a b b c d w g j t",
                b1 = "l y y h x x x x x y y y y",
                a2 = "a a a a a b c s k w i r f",
                b2 = "p q w e d x x x x y y y y",
                a3 = "a a a a a b b x k w i r f")
                mydfm <- dfm(mydocs)

                (sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
                # 5 x 5 sparse Matrix of class "dsTMatrix"
                # a1 b1 a2 b2 a3
                # a1 1 . . . .
                # b1 . 1.0000000 . 0.9113423 .
                # a2 . . 1.0000000 . 0.9415838
                # b2 . 0.9113423 . 1.0000000 .
                # a3 . . 0.9415838 . 1.0000000

                matrix2list <- function(x) {
                names(x@x) <- rownames(x)[x@i + 1]
                split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
                }

                matrix2list(sim)
                # $a1
                # a1
                # 1
                #
                # $b1
                # b1
                # 1
                #
                # $a2
                # a2
                # 1
                #
                # $b2
                # b1 b2
                # 0.9113423 1.0000000
                #
                # $a3
                # a2 a3
                # 0.9415838 1.0000000


                See https://koheiw.net/?p=839 for the performance differences.






                share|improve this answer













                If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().



                In this example, cosine similarity scores smaller than 0.9 are all ignored.



                library("quanteda")
                mydocs <- c(a1 = "a a a a a b b c d w g j t",
                b1 = "l y y h x x x x x y y y y",
                a2 = "a a a a a b c s k w i r f",
                b2 = "p q w e d x x x x y y y y",
                a3 = "a a a a a b b x k w i r f")
                mydfm <- dfm(mydocs)

                (sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
                # 5 x 5 sparse Matrix of class "dsTMatrix"
                # a1 b1 a2 b2 a3
                # a1 1 . . . .
                # b1 . 1.0000000 . 0.9113423 .
                # a2 . . 1.0000000 . 0.9415838
                # b2 . 0.9113423 . 1.0000000 .
                # a3 . . 0.9415838 . 1.0000000

                matrix2list <- function(x) {
                names(x@x) <- rownames(x)[x@i + 1]
                split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
                }

                matrix2list(sim)
                # $a1
                # a1
                # 1
                #
                # $b1
                # b1
                # 1
                #
                # $a2
                # a2
                # 1
                #
                # $b2
                # b1 b2
                # 0.9113423 1.0000000
                #
                # $a3
                # a2 a3
                # 0.9415838 1.0000000


                See https://koheiw.net/?p=839 for the performance differences.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 17 '18 at 12:41









                Kohei WatanabeKohei Watanabe

                32815




                32815






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302151%2fremove-duplicating-docs-of-docs-with-high-similarity%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Florida Star v. B. J. F.

                    Error while running script in elastic search , gateway timeout

                    Adding quotations to stringified JSON object values