Fast removal of low frequency words in Python

up vote
1
down vote

favorite

In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.

doc=[w for w in doc if not w in lowFrequencyWords]

The problem is that this piece of code lasts forever.

I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

2

make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06

1

also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10

1

because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15

2

@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15

1

@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17

|
show 5 more comments

up vote
1
down vote

favorite

In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.

doc=[w for w in doc if not w in lowFrequencyWords]

The problem is that this piece of code lasts forever.

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

2

make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06

1

also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10

1

because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15

2

@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15

1

@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17

|
show 5 more comments

up vote
1
down vote

favorite

In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.

doc=[w for w in doc if not w in lowFrequencyWords]

The problem is that this piece of code lasts forever.

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.

doc=[w for w in doc if not w in lowFrequencyWords]

The problem is that this piece of code lasts forever.

python-3.x nlp nltk

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

asked Nov 10 at 20:03

DanielTheRocketMan

1,94242137

2

make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06

1

also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10

1

because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15

2

@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15

1

@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17

|
show 5 more comments

2

make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06

1

also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10

1

because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15

2

@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15

1

@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17

make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with ...
– Patrick Artner
Nov 10 at 20:06

also using doc[:] = [w for w in doc if w not in frozenset_of_lowFreq] might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10

because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15

@PatrickArtner The doc[:] = there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a... then do a = [4, 5, 6]... you just rebind the name a to a new list and b is still referring to the original.... if you do a[:] = [4, 5, 6] then you're mutating the underlying list that a and b refer to so b will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15

@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17

|
show 5 more comments

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with

answered Nov 10 at 20:31

community wiki

Paritosh Singh

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242925%2ffast-removal-of-low-frequency-words-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with

answered Nov 10 at 20:31

community wiki

Paritosh Singh

add a comment |

up vote
1
down vote

accepted

from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with

answered Nov 10 at 20:31

community wiki

Paritosh Singh

add a comment |

up vote
1
down vote

accepted

from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with

answered Nov 10 at 20:31

community wiki

Paritosh Singh

from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with

answered Nov 10 at 20:31

community wiki

Paritosh Singh

answered Nov 10 at 20:31

community wiki

Paritosh Singh

community wiki

Paritosh Singh

community wiki

Paritosh Singh

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

lpUoYGQsWDuyb,UH1nZqm7hBJCODmdE CjHH2 eyzrqEigRULJ,p rC 1Arff8k9PCTO,CcVlFKq,zeLyAPzW Q

搜尋此網誌

Ndtyjky