Fast removal of low frequency words in Python
up vote
1
down vote
favorite
In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.
doc=[w for w in doc if not w in lowFrequencyWords]
The problem is that this piece of code lasts forever.
I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.
python-3.x nlp nltk
|
show 5 more comments
up vote
1
down vote
favorite
In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.
doc=[w for w in doc if not w in lowFrequencyWords]
The problem is that this piece of code lasts forever.
I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.
python-3.x nlp nltk
2
makelowFrequencyWords
afrozenset(lowFrequencyWords)
to begin with ...
– Patrick Artner
Nov 10 at 20:06
1
also usingdoc[:] = [w for w in doc if w not in frozenset_of_lowFreq]
might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10
1
because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15
2
@PatrickArtner Thedoc[:] =
there is nothing to do with speed... Just mutability...a = [1, 2, 3]; b = a
... then doa = [4, 5, 6]
... you just rebind the namea
to a new list andb
is still referring to the original.... if you doa[:] = [4, 5, 6]
then you're mutating the underlying list thata
andb
refer to sob
will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15
1
@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17
|
show 5 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.
doc=[w for w in doc if not w in lowFrequencyWords]
The problem is that this piece of code lasts forever.
I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.
python-3.x nlp nltk
In the code below, lowFrequencyWords is a list with the low frequency words and doc is a list of tokens.
doc=[w for w in doc if not w in lowFrequencyWords]
The problem is that this piece of code lasts forever.
I am note sure, but I believe that the problem is that the operation of removal of an intermediate element from a list costs O(n), where n is the size of the list. Since the number of lowFrequencyWords is giant, python has to repeat that many times. I looked for linked lists, but I believe that they are not available in Python.
python-3.x nlp nltk
python-3.x nlp nltk
asked Nov 10 at 20:03
DanielTheRocketMan
1,94242137
1,94242137
2
makelowFrequencyWords
afrozenset(lowFrequencyWords)
to begin with ...
– Patrick Artner
Nov 10 at 20:06
1
also usingdoc[:] = [w for w in doc if w not in frozenset_of_lowFreq]
might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10
1
because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15
2
@PatrickArtner Thedoc[:] =
there is nothing to do with speed... Just mutability...a = [1, 2, 3]; b = a
... then doa = [4, 5, 6]
... you just rebind the namea
to a new list andb
is still referring to the original.... if you doa[:] = [4, 5, 6]
then you're mutating the underlying list thata
andb
refer to sob
will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15
1
@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17
|
show 5 more comments
2
makelowFrequencyWords
afrozenset(lowFrequencyWords)
to begin with ...
– Patrick Artner
Nov 10 at 20:06
1
also usingdoc[:] = [w for w in doc if w not in frozenset_of_lowFreq]
might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...
– Patrick Artner
Nov 10 at 20:10
1
because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15
2
@PatrickArtner Thedoc[:] =
there is nothing to do with speed... Just mutability...a = [1, 2, 3]; b = a
... then doa = [4, 5, 6]
... you just rebind the namea
to a new list andb
is still referring to the original.... if you doa[:] = [4, 5, 6]
then you're mutating the underlying list thata
andb
refer to sob
will also be referring to the updated list.
– Jon Clements♦
Nov 10 at 20:15
1
@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17
2
2
make
lowFrequencyWords
a frozenset(lowFrequencyWords)
to begin with ...– Patrick Artner
Nov 10 at 20:06
make
lowFrequencyWords
a frozenset(lowFrequencyWords)
to begin with ...– Patrick Artner
Nov 10 at 20:06
1
1
also using
doc[:] = [w for w in doc if w not in frozenset_of_lowFreq]
might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...– Patrick Artner
Nov 10 at 20:10
also using
doc[:] = [w for w in doc if w not in frozenset_of_lowFreq]
might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...– Patrick Artner
Nov 10 at 20:10
1
1
because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15
because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15
2
2
@PatrickArtner The
doc[:] =
there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a
... then do a = [4, 5, 6]
... you just rebind the name a
to a new list and b
is still referring to the original.... if you do a[:] = [4, 5, 6]
then you're mutating the underlying list that a
and b
refer to so b
will also be referring to the updated list.– Jon Clements♦
Nov 10 at 20:15
@PatrickArtner The
doc[:] =
there is nothing to do with speed... Just mutability... a = [1, 2, 3]; b = a
... then do a = [4, 5, 6]
... you just rebind the name a
to a new list and b
is still referring to the original.... if you do a[:] = [4, 5, 6]
then you're mutating the underlying list that a
and b
refer to so b
will also be referring to the updated list.– Jon Clements♦
Nov 10 at 20:15
1
1
@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17
@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17
|
show 5 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with
add a comment |
up vote
1
down vote
accepted
from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with
from comments: @Patrick Artner
make lowFrequencyWords a frozenset(lowFrequencyWords) to begin with
answered Nov 10 at 20:31
community wiki
Paritosh Singh
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242925%2ffast-removal-of-low-frequency-words-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
make
lowFrequencyWords
afrozenset(lowFrequencyWords)
to begin with ...– Patrick Artner
Nov 10 at 20:06
1
also using
doc[:] = [w for w in doc if w not in frozenset_of_lowFreq]
might help - not sure - someone told me thats some kind of inplace replacing which might be faster. dont ask me, just heard it...– Patrick Artner
Nov 10 at 20:10
1
because you mostly look into the list with O(n) lookup and never find your words (they are low freq). So you essentially iterate all of lowFreqWords all the time. sets have constant lookups - thats fast compared. frozensets are just the immutable cousins of sets.
– Patrick Artner
Nov 10 at 20:15
2
@PatrickArtner The
doc[:] =
there is nothing to do with speed... Just mutability...a = [1, 2, 3]; b = a
... then doa = [4, 5, 6]
... you just rebind the namea
to a new list andb
is still referring to the original.... if you doa[:] = [4, 5, 6]
then you're mutating the underlying list thata
andb
refer to sob
will also be referring to the updated list.– Jon Clements♦
Nov 10 at 20:15
1
@JonClements thanks for explaining - it was some kind of "obscure" tip I got 15 months back when starting on pyhton or so and never understood correctly it seems. Ok, banned from the "optimize speed" category of list comps :)
– Patrick Artner
Nov 10 at 20:17