Fast removal of low frequency words in Python











up vote
1
down vote

favorite












In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.










share|improve this question


















  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17

















up vote
1
down vote

favorite












In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.










share|improve this question


















  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17















up vote
1
down vote

favorite









up vote
1
down vote

favorite











In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.










share|improve this question













In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.



doc=[w for w in doc if not w in lowFrequencyWords]


The problem is that this piece of code lasts forever.



I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.







python-3.x nlp nltk






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 20:03









DanielTheRocketMan

1,94242137




1,94242137








  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17
















  • 2




    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
    – Patrick Artner
    Nov 10 at 20:06








  • 1




    also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
    – Patrick Artner
    Nov 10 at 20:10






  • 1




    because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
    – Patrick Artner
    Nov 10 at 20:15








  • 2




    @PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
    – Jon Clements
    Nov 10 at 20:15








  • 1




    @JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
    – Patrick Artner
    Nov 10 at 20:17










2




2




make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06






make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06






1




1




also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10




also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10




1




1




because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15






because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15






2




2




@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements
Nov 10 at 20:15






@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements
Nov 10 at 20:15






1




1




@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17






@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17














1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242925%2ffast-removal-of-low-frequency-words-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    from comments: @Patrick Artner
    make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






    share|improve this answer



























      up vote
      1
      down vote



      accepted










      from comments: @Patrick Artner
      make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






      share|improve this answer

























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        from comments: @Patrick Artner
        make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with






        share|improve this answer














        from comments: @Patrick Artner
        make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with







        share|improve this answer














        share|improve this answer



        share|improve this answer








        answered Nov 10 at 20:31


























        community wiki





        Paritosh Singh































             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242925%2ffast-removal-of-low-frequency-words-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues