Remove duplicating docs of docs with high similarity

When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?

asked Nov 14 '18 at 14:09

fritsvegters

155

add a comment |

asked Nov 14 '18 at 14:09

fritsvegters

155

add a comment |

asked Nov 14 '18 at 14:09

fritsvegters

155

r quanteda

asked Nov 14 '18 at 14:09

fritsvegters

155

asked Nov 14 '18 at 14:09

fritsvegters

155

asked Nov 14 '18 at 14:09

fritsvegters

155

asked Nov 14 '18 at 14:09

fritsvegters

155

asked Nov 14 '18 at 14:09

fritsvegters

155

add a comment |

2 Answers
2

active

oldest

votes

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.

Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.

We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.

library("quanteda")

# Loading required package: quanteda

# Package version: 1.3.14



mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")



mydfm <- dfm(mydocs)



(sim <- textstat_simil(mydfm))

#             a1          b1          a2          b2

# b1 -0.22203788                                    

# a2  0.80492203 -0.23090513                        

# b2 -0.23427416  0.90082239 -0.28140219            

# a3  0.81167608 -0.09065452  0.92242890 -0.12530944



# create a data.frame of the unique pairs and their similarities

sim_pair_names <- t(combn(docnames(mydfm), 2))

sim_pairs <- data.frame(sim_pair_names,

                        sim = as.numeric(sim), 

                        stringsAsFactors = FALSE)

sim_pairs

#    X1 X2         sim

# 1  a1 b1 -0.22203788

# 2  a1 a2  0.80492203

# 3  a1 b2 -0.23427416

# 4  a1 a3  0.81167608

# 5  b1 a2 -0.23090513

# 6  b1 b2  0.90082239

# 7  b1 a3 -0.09065452

# 8  a2 b2 -0.28140219

# 9  a2 a3  0.92242890

# 10 b2 a3 -0.12530944

Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().

# set the threshold for similarity

threshold <- 0.80



# discard one of the pair if similarity > threshold

todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)

todrop

# [1] "a1" "a1" "b1" "a2"



# then subset the dfm, keeping only the "keepers"

dfm_subset(mydfm, !docnames(mydfm) %in% todrop)

# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).

# 2 x 20 sparse Matrix of class "dfm"

#     features

# docs a b c d w g j t l y h x s k i r f p q e

#   b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1

#   a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0

Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.

edited Nov 14 '18 at 23:00

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

1

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

– fritsvegters
Nov 15 '18 at 14:40

add a comment |

If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().

In this example, cosine similarity scores smaller than 0.9 are all ignored.

library("quanteda")

mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")

mydfm <- dfm(mydocs)



(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))

# 5 x 5 sparse Matrix of class "dsTMatrix"

#    a1        b1        a2        b2        a3    

# a1  1 .         .         .         .        

# b1  . 1.0000000 .         0.9113423 .        

# a2  . .         1.0000000 .         0.9415838

# b2  . 0.9113423 .         1.0000000 .        

# a3  . .         0.9415838 .         1.0000000



matrix2list <- function(x) {

    names(x@x) <- rownames(x)[x@i + 1]

    split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))

}



matrix2list(sim)

# $a1

# a1 

#  1 

# 

# $b1

# b1 

#  1 

# 

# $a2

# a2 

#  1 

# 

# $b2

#        b1        b2 

# 0.9113423 1.0000000 

# 

# $a3

#        a2        a3 

# 0.9415838 1.0000000

See https://koheiw.net/?p=839 for the performance differences.

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302151%2fremove-duplicating-docs-of-docs-with-high-similarity%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.

We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.

library("quanteda")

# Loading required package: quanteda

# Package version: 1.3.14



mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")



mydfm <- dfm(mydocs)



(sim <- textstat_simil(mydfm))

#             a1          b1          a2          b2

# b1 -0.22203788                                    

# a2  0.80492203 -0.23090513                        

# b2 -0.23427416  0.90082239 -0.28140219            

# a3  0.81167608 -0.09065452  0.92242890 -0.12530944



# create a data.frame of the unique pairs and their similarities

sim_pair_names <- t(combn(docnames(mydfm), 2))

sim_pairs <- data.frame(sim_pair_names,

                        sim = as.numeric(sim), 

                        stringsAsFactors = FALSE)

sim_pairs

#    X1 X2         sim

# 1  a1 b1 -0.22203788

# 2  a1 a2  0.80492203

# 3  a1 b2 -0.23427416

# 4  a1 a3  0.81167608

# 5  b1 a2 -0.23090513

# 6  b1 b2  0.90082239

# 7  b1 a3 -0.09065452

# 8  a2 b2 -0.28140219

# 9  a2 a3  0.92242890

# 10 b2 a3 -0.12530944

Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().

# set the threshold for similarity

threshold <- 0.80



# discard one of the pair if similarity > threshold

todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)

todrop

# [1] "a1" "a1" "b1" "a2"



# then subset the dfm, keeping only the "keepers"

dfm_subset(mydfm, !docnames(mydfm) %in% todrop)

# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).

# 2 x 20 sparse Matrix of class "dfm"

#     features

# docs a b c d w g j t l y h x s k i r f p q e

#   b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1

#   a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0

edited Nov 14 '18 at 23:00

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

1

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

– fritsvegters
Nov 15 '18 at 14:40

add a comment |

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.

We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.

library("quanteda")

# Loading required package: quanteda

# Package version: 1.3.14



mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")



mydfm <- dfm(mydocs)



(sim <- textstat_simil(mydfm))

#             a1          b1          a2          b2

# b1 -0.22203788                                    

# a2  0.80492203 -0.23090513                        

# b2 -0.23427416  0.90082239 -0.28140219            

# a3  0.81167608 -0.09065452  0.92242890 -0.12530944



# create a data.frame of the unique pairs and their similarities

sim_pair_names <- t(combn(docnames(mydfm), 2))

sim_pairs <- data.frame(sim_pair_names,

                        sim = as.numeric(sim), 

                        stringsAsFactors = FALSE)

sim_pairs

#    X1 X2         sim

# 1  a1 b1 -0.22203788

# 2  a1 a2  0.80492203

# 3  a1 b2 -0.23427416

# 4  a1 a3  0.81167608

# 5  b1 a2 -0.23090513

# 6  b1 b2  0.90082239

# 7  b1 a3 -0.09065452

# 8  a2 b2 -0.28140219

# 9  a2 a3  0.92242890

# 10 b2 a3 -0.12530944

Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().

# set the threshold for similarity

threshold <- 0.80



# discard one of the pair if similarity > threshold

todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)

todrop

# [1] "a1" "a1" "b1" "a2"



# then subset the dfm, keeping only the "keepers"

dfm_subset(mydfm, !docnames(mydfm) %in% todrop)

# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).

# 2 x 20 sparse Matrix of class "dfm"

#     features

# docs a b c d w g j t l y h x s k i r f p q e

#   b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1

#   a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0

edited Nov 14 '18 at 23:00

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

1

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

– fritsvegters
Nov 15 '18 at 14:40

add a comment |

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.

We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.

library("quanteda")

# Loading required package: quanteda

# Package version: 1.3.14



mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")



mydfm <- dfm(mydocs)



(sim <- textstat_simil(mydfm))

#             a1          b1          a2          b2

# b1 -0.22203788                                    

# a2  0.80492203 -0.23090513                        

# b2 -0.23427416  0.90082239 -0.28140219            

# a3  0.81167608 -0.09065452  0.92242890 -0.12530944



# create a data.frame of the unique pairs and their similarities

sim_pair_names <- t(combn(docnames(mydfm), 2))

sim_pairs <- data.frame(sim_pair_names,

                        sim = as.numeric(sim), 

                        stringsAsFactors = FALSE)

sim_pairs

#    X1 X2         sim

# 1  a1 b1 -0.22203788

# 2  a1 a2  0.80492203

# 3  a1 b2 -0.23427416

# 4  a1 a3  0.81167608

# 5  b1 a2 -0.23090513

# 6  b1 b2  0.90082239

# 7  b1 a3 -0.09065452

# 8  a2 b2 -0.28140219

# 9  a2 a3  0.92242890

# 10 b2 a3 -0.12530944

Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().

# set the threshold for similarity

threshold <- 0.80



# discard one of the pair if similarity > threshold

todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)

todrop

# [1] "a1" "a1" "b1" "a2"



# then subset the dfm, keeping only the "keepers"

dfm_subset(mydfm, !docnames(mydfm) %in% todrop)

# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).

# 2 x 20 sparse Matrix of class "dfm"

#     features

# docs a b c d w g j t l y h x s k i r f p q e

#   b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1

#   a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0

edited Nov 14 '18 at 23:00

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.

We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.

library("quanteda")

# Loading required package: quanteda

# Package version: 1.3.14



mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")



mydfm <- dfm(mydocs)



(sim <- textstat_simil(mydfm))

#             a1          b1          a2          b2

# b1 -0.22203788                                    

# a2  0.80492203 -0.23090513                        

# b2 -0.23427416  0.90082239 -0.28140219            

# a3  0.81167608 -0.09065452  0.92242890 -0.12530944



# create a data.frame of the unique pairs and their similarities

sim_pair_names <- t(combn(docnames(mydfm), 2))

sim_pairs <- data.frame(sim_pair_names,

                        sim = as.numeric(sim), 

                        stringsAsFactors = FALSE)

sim_pairs

#    X1 X2         sim

# 1  a1 b1 -0.22203788

# 2  a1 a2  0.80492203

# 3  a1 b2 -0.23427416

# 4  a1 a3  0.81167608

# 5  b1 a2 -0.23090513

# 6  b1 b2  0.90082239

# 7  b1 a3 -0.09065452

# 8  a2 b2 -0.28140219

# 9  a2 a3  0.92242890

# 10 b2 a3 -0.12530944

Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().

# set the threshold for similarity

threshold <- 0.80



# discard one of the pair if similarity > threshold

todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)

todrop

# [1] "a1" "a1" "b1" "a2"



# then subset the dfm, keeping only the "keepers"

dfm_subset(mydfm, !docnames(mydfm) %in% todrop)

# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).

# 2 x 20 sparse Matrix of class "dfm"

#     features

# docs a b c d w g j t l y h x s k i r f p q e

#   b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1

#   a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0

edited Nov 14 '18 at 23:00

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

edited Nov 14 '18 at 23:00

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

answered Nov 14 '18 at 15:49

Ken Benoit

7,2781635

1

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

– fritsvegters
Nov 15 '18 at 14:40

add a comment |

1

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

– fritsvegters
Nov 15 '18 at 14:40

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this.

– fritsvegters
Nov 15 '18 at 14:40

add a comment |

In this example, cosine similarity scores smaller than 0.9 are all ignored.

library("quanteda")

mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")

mydfm <- dfm(mydocs)



(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))

# 5 x 5 sparse Matrix of class "dsTMatrix"

#    a1        b1        a2        b2        a3    

# a1  1 .         .         .         .        

# b1  . 1.0000000 .         0.9113423 .        

# a2  . .         1.0000000 .         0.9415838

# b2  . 0.9113423 .         1.0000000 .        

# a3  . .         0.9415838 .         1.0000000



matrix2list <- function(x) {

    names(x@x) <- rownames(x)[x@i + 1]

    split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))

}



matrix2list(sim)

# $a1

# a1 

#  1 

# 

# $b1

# b1 

#  1 

# 

# $a2

# a2 

#  1 

# 

# $b2

#        b1        b2 

# 0.9113423 1.0000000 

# 

# $a3

#        a2        a3 

# 0.9415838 1.0000000

See https://koheiw.net/?p=839 for the performance differences.

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

add a comment |

In this example, cosine similarity scores smaller than 0.9 are all ignored.

library("quanteda")

mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")

mydfm <- dfm(mydocs)



(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))

# 5 x 5 sparse Matrix of class "dsTMatrix"

#    a1        b1        a2        b2        a3    

# a1  1 .         .         .         .        

# b1  . 1.0000000 .         0.9113423 .        

# a2  . .         1.0000000 .         0.9415838

# b2  . 0.9113423 .         1.0000000 .        

# a3  . .         0.9415838 .         1.0000000



matrix2list <- function(x) {

    names(x@x) <- rownames(x)[x@i + 1]

    split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))

}



matrix2list(sim)

# $a1

# a1 

#  1 

# 

# $b1

# b1 

#  1 

# 

# $a2

# a2 

#  1 

# 

# $b2

#        b1        b2 

# 0.9113423 1.0000000 

# 

# $a3

#        a2        a3 

# 0.9415838 1.0000000

See https://koheiw.net/?p=839 for the performance differences.

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

add a comment |

In this example, cosine similarity scores smaller than 0.9 are all ignored.

library("quanteda")

mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")

mydfm <- dfm(mydocs)



(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))

# 5 x 5 sparse Matrix of class "dsTMatrix"

#    a1        b1        a2        b2        a3    

# a1  1 .         .         .         .        

# b1  . 1.0000000 .         0.9113423 .        

# a2  . .         1.0000000 .         0.9415838

# b2  . 0.9113423 .         1.0000000 .        

# a3  . .         0.9415838 .         1.0000000



matrix2list <- function(x) {

    names(x@x) <- rownames(x)[x@i + 1]

    split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))

}



matrix2list(sim)

# $a1

# a1 

#  1 

# 

# $b1

# b1 

#  1 

# 

# $a2

# a2 

#  1 

# 

# $b2

#        b1        b2 

# 0.9113423 1.0000000 

# 

# $a3

#        a2        a3 

# 0.9415838 1.0000000

See https://koheiw.net/?p=839 for the performance differences.

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

In this example, cosine similarity scores smaller than 0.9 are all ignored.

library("quanteda")

mydocs <- c(a1 = "a a a a a b b c d w g j t",

            b1 = "l y y h x x x x x y y y y",

            a2 = "a a a a a b c s k w i r f",

            b2 = "p q w e d x x x x y y y y",

            a3 = "a a a a a b b x k w i r f")

mydfm <- dfm(mydocs)



(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))

# 5 x 5 sparse Matrix of class "dsTMatrix"

#    a1        b1        a2        b2        a3    

# a1  1 .         .         .         .        

# b1  . 1.0000000 .         0.9113423 .        

# a2  . .         1.0000000 .         0.9415838

# b2  . 0.9113423 .         1.0000000 .        

# a3  . .         0.9415838 .         1.0000000



matrix2list <- function(x) {

    names(x@x) <- rownames(x)[x@i + 1]

    split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))

}



matrix2list(sim)

# $a1

# a1 

#  1 

# 

# $b1

# b1 

#  1 

# 

# $a2

# a2 

#  1 

# 

# $b2

#        b1        b2 

# 0.9113423 1.0000000 

# 

# $a3

#        a2        a3 

# 0.9415838 1.0000000

See https://koheiw.net/?p=839 for the performance differences.

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

answered Nov 17 '18 at 12:41

Kohei Watanabe

32815

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

YtZC1Y3O52,bDI9 I94BrZ,3up FhOUHP

搜尋此網誌

Ndtyjky