
Showing posts from February 16, 2019

Remove duplicating docs of docs with high similarity

1 When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas? r quanteda share | improve this question asked Nov 14 '18 at 14:09 fritsvegters fritsvegters 15 5 add a comment  |