Remove duplicating docs of docs with high similarity
1
When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?
r quanteda
share | improve this question
asked Nov 14 '18 at 14:09
fritsvegters fritsvegters
15 5
add a comment |
...