Text clustering/NLP [closed]












1















Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?



Let's consider both cases: data might be tagged as well as untagged.



Thanks in advance.










share|improve this question















closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.



















  • Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

    – cricket_007
    Nov 15 '18 at 7:36













  • I will try it, haven't heard of it. Thanks for the sharing.

    – BC1554
    Nov 17 '18 at 17:16













  • HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

    – BC1554
    Dec 6 '18 at 6:04











  • Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

    – cricket_007
    Dec 6 '18 at 14:07











  • @cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

    – BC1554
    Dec 6 '18 at 14:32
















1















Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?



Let's consider both cases: data might be tagged as well as untagged.



Thanks in advance.










share|improve this question















closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.



















  • Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

    – cricket_007
    Nov 15 '18 at 7:36













  • I will try it, haven't heard of it. Thanks for the sharing.

    – BC1554
    Nov 17 '18 at 17:16













  • HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

    – BC1554
    Dec 6 '18 at 6:04











  • Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

    – cricket_007
    Dec 6 '18 at 14:07











  • @cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

    – BC1554
    Dec 6 '18 at 14:32














1












1








1








Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?



Let's consider both cases: data might be tagged as well as untagged.



Thanks in advance.










share|improve this question
















Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?



Let's consider both cases: data might be tagged as well as untagged.



Thanks in advance.







python machine-learning nlp text-classification






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 17 '18 at 21:15







BC1554

















asked Nov 15 '18 at 7:22









BC1554BC1554

297




297




closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.















  • Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

    – cricket_007
    Nov 15 '18 at 7:36













  • I will try it, haven't heard of it. Thanks for the sharing.

    – BC1554
    Nov 17 '18 at 17:16













  • HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

    – BC1554
    Dec 6 '18 at 6:04











  • Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

    – cricket_007
    Dec 6 '18 at 14:07











  • @cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

    – BC1554
    Dec 6 '18 at 14:32



















  • Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

    – cricket_007
    Nov 15 '18 at 7:36













  • I will try it, haven't heard of it. Thanks for the sharing.

    – BC1554
    Nov 17 '18 at 17:16













  • HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

    – BC1554
    Dec 6 '18 at 6:04











  • Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

    – cricket_007
    Dec 6 '18 at 14:07











  • @cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

    – BC1554
    Dec 6 '18 at 14:32

















Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36







Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36















I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16







I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16















HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04





HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04













Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07





Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07













@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32





@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32












1 Answer
1






active

oldest

votes


















2














A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.



Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.



There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.






share|improve this answer
























  • @BC1554 Any update on whether this approach was useful?

    – Ivo Merchiers
    Nov 30 '18 at 8:55











  • Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

    – BC1554
    Dec 6 '18 at 6:03


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.



Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.



There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.






share|improve this answer
























  • @BC1554 Any update on whether this approach was useful?

    – Ivo Merchiers
    Nov 30 '18 at 8:55











  • Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

    – BC1554
    Dec 6 '18 at 6:03
















2














A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.



Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.



There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.






share|improve this answer
























  • @BC1554 Any update on whether this approach was useful?

    – Ivo Merchiers
    Nov 30 '18 at 8:55











  • Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

    – BC1554
    Dec 6 '18 at 6:03














2












2








2







A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.



Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.



There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.






share|improve this answer













A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.



Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.



There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 15 '18 at 7:44









Ivo MerchiersIvo Merchiers

643116




643116













  • @BC1554 Any update on whether this approach was useful?

    – Ivo Merchiers
    Nov 30 '18 at 8:55











  • Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

    – BC1554
    Dec 6 '18 at 6:03



















  • @BC1554 Any update on whether this approach was useful?

    – Ivo Merchiers
    Nov 30 '18 at 8:55











  • Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

    – BC1554
    Dec 6 '18 at 6:03

















@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55





@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55













Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03





Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03





Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Lugert, Oklahoma