Problems while trying to create a new id column based on three criteria?

up vote
3
down vote

favorite

I have a dataframe with conversations and timestamps like this:

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...

2018-10-05 23:07:13 01  a large text blob...

2018-10-05 23:07:23 01  a large text blob...

2018-10-05 23:07:36 01  a large text blob...

2018-10-05 23:08:02 01  a large text blob...

2018-10-05 23:09:16 01  a large text blob...

2018-10-05 23:09:21 01  a large text blob...

2018-10-05 23:09:39 01  a large text blob...

2018-10-05 23:09:47 01  a large text blob...

2018-10-05 23:10:01 01  a large text blob...

2018-10-05 23:10:11 01  a large text blob...

2018-10-05 23:10:23 01  restart             

2018-10-05 23:10:59 01  a large text blob...

2018-10-05 23:11:03 01  a large text blob...

2018-10-08 23:11:32 02  a large text blob...

2018-10-08 23:12:58 02  a large text blob...

2018-10-08 23:13:16 02  a large text blob...

2018-10-08 23:14:04 02  a large text blob...

2018-10-08 03:38:36 02  a large text blob...

2018-10-08 03:38:42 02  a large text blob...

2018-10-08 03:38:52 02  a large text blob...

2018-10-08 03:38:57 02  a large text blob...

2018-10-08 03:39:10 02  a large text blob...

2018-10-08 03:39:27 02  Restart             

2018-10-08 03:40:47 02  a large text blob...

2018-10-08 03:40:54 02  a large text blob...

2018-10-08 03:41:02 02  a large text blob...

2018-10-08 03:41:12 02  a large text blob...

2018-10-08 03:41:32 02  a large text blob...

2018-10-08 03:41:39 02  a large text blob...

2018-10-08 03:42:20 02  a large text blob...

2018-10-08 03:44:58 02  a large text blob...

2018-10-08 03:45:54 02  a large text blob...

2018-10-08 03:46:06 02  a large text blob...

2018-10-08 05:06:42 03  a large text blob...

2018-10-08 05:06:53 03  a large text blob...

2018-10-08 05:08:49 03  a large text blob...

2018-10-08 05:08:58 03  a large text blob...

2018-10-08 05:58:18 04  a large text blob...

2018-10-08 05:58:26 04  a large text blob...

2018-10-08 05:58:37 04  a large text blob...

2018-10-08 05:58:58 04  a large text blob...

2018-10-08 06:00:31 04  a large text blob...

2018-10-08 06:01:00 04  a large text blob...

2018-10-08 06:01:14 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:06:03 04  a large text blob...

2018-10-08 06:10:00 04  a large text blob...

2018-10-08 09:07:03 04  a large text blob...

2018-10-08 09:09:03 04  a large text blob...

2018-10-09 10:01:00 04  a large text blob...

2018-10-09 10:02:00 04  a large text blob...

2018-10-09 10:03:00 04  a large text blob...

2018-10-09 10:09:00 04  a large text blob...

2018-10-09 10:09:00 05  a large text blob...

At the moment I would like to identify with an id the conversations inside the dataframe. The problem is that a user can have several conversations (i.e. an userID can have multiple textBlob associated). Thus, I would like to add a new_id in order to be able to identify the conversations inside the above dataframe.

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...    001

2018-10-05 23:07:13 01  a large text blob...    001

2018-10-05 23:07:23 01  a large text blob...    001

2018-10-05 23:07:36 01  a large text blob...    001

2018-10-05 23:08:02 01  a large text blob...    001

2018-10-05 23:09:16 01  a large text blob...    001

2018-10-05 23:09:21 01  a large text blob...    001

2018-10-05 23:09:39 01  a large text blob...    001

2018-10-05 23:09:47 01  a large text blob...    001

2018-10-05 23:10:01 01  a large text blob...    001

2018-10-05 23:10:11 01  a large text blob...    001

2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)

2018-10-05 23:10:59 01  a large text blob...    002

2018-10-05 23:11:03 01  a large text blob...    002

2018-10-08 23:11:32 02  a large text blob...    002

2018-10-08 23:12:58 02  a large text blob...    002

2018-10-08 23:13:16 02  a large text blob...    002

2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)

2018-10-08 03:38:36 02  a large text blob...    003

2018-10-08 03:38:42 02  a large text blob...    003

2018-10-08 03:38:52 02  a large text blob...    003

2018-10-08 03:38:57 02  a large text blob...    003

2018-10-08 03:39:10 02  a large text blob...    003

2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)

2018-10-08 03:40:47 02  a large text blob...    004

2018-10-08 03:40:54 02  a large text blob...    004

2018-10-08 03:41:02 02  a large text blob...    004

2018-10-08 03:41:12 02  a large text blob...    004

2018-10-08 03:41:32 02  a large text blob...    004

2018-10-08 03:41:39 02  a large text blob...    004

2018-10-08 03:42:20 02  a large text blob...    004

2018-10-08 03:44:58 02  a large text blob...    004

2018-10-08 03:45:54 02  a large text blob...    004

2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 05:06:42 03  a large text blob...    005

2018-10-08 05:06:53 03  a large text blob...    005

2018-10-08 05:08:49 03  a large text blob...    005

2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)

2018-10-08 05:58:18 04  a large text blob...    006

2018-10-08 05:58:26 04  a large text blob...    006

2018-10-08 05:58:37 04  a large text blob...    006

2018-10-08 05:58:58 04  a large text blob...    006

2018-10-08 06:00:31 04  a large text blob...    006

2018-10-08 06:01:00 04  a large text blob...    006

2018-10-08 06:01:14 04  a large text blob...    006

2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 06:02:03 04  a large text blob...    007

2018-10-08 06:06:03 04  a large text blob...    007

2018-10-08 06:10:00 04  a large text blob...    007

2018-10-08 09:07:03 04  a large text blob...    007

2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-09 10:01:00 04  a large text blob...    008

2018-10-09 10:02:00 04  a large text blob...    008

2018-10-09 10:03:00 04  a large text blob...    008

2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)

2018-10-09 10:09:00 05  a large text blob...    010

So far I tried to:

searchfor = ['restart','Restart']

df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()

periods = dif > pd.Timedelta('10 min')

times = periods.cumsum().apply(lambda x: x+1)

df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

92182353

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

up vote
3
down vote

favorite

I have a dataframe with conversations and timestamps like this:

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...

2018-10-05 23:07:13 01  a large text blob...

2018-10-05 23:07:23 01  a large text blob...

2018-10-05 23:07:36 01  a large text blob...

2018-10-05 23:08:02 01  a large text blob...

2018-10-05 23:09:16 01  a large text blob...

2018-10-05 23:09:21 01  a large text blob...

2018-10-05 23:09:39 01  a large text blob...

2018-10-05 23:09:47 01  a large text blob...

2018-10-05 23:10:01 01  a large text blob...

2018-10-05 23:10:11 01  a large text blob...

2018-10-05 23:10:23 01  restart             

2018-10-05 23:10:59 01  a large text blob...

2018-10-05 23:11:03 01  a large text blob...

2018-10-08 23:11:32 02  a large text blob...

2018-10-08 23:12:58 02  a large text blob...

2018-10-08 23:13:16 02  a large text blob...

2018-10-08 23:14:04 02  a large text blob...

2018-10-08 03:38:36 02  a large text blob...

2018-10-08 03:38:42 02  a large text blob...

2018-10-08 03:38:52 02  a large text blob...

2018-10-08 03:38:57 02  a large text blob...

2018-10-08 03:39:10 02  a large text blob...

2018-10-08 03:39:27 02  Restart             

2018-10-08 03:40:47 02  a large text blob...

2018-10-08 03:40:54 02  a large text blob...

2018-10-08 03:41:02 02  a large text blob...

2018-10-08 03:41:12 02  a large text blob...

2018-10-08 03:41:32 02  a large text blob...

2018-10-08 03:41:39 02  a large text blob...

2018-10-08 03:42:20 02  a large text blob...

2018-10-08 03:44:58 02  a large text blob...

2018-10-08 03:45:54 02  a large text blob...

2018-10-08 03:46:06 02  a large text blob...

2018-10-08 05:06:42 03  a large text blob...

2018-10-08 05:06:53 03  a large text blob...

2018-10-08 05:08:49 03  a large text blob...

2018-10-08 05:08:58 03  a large text blob...

2018-10-08 05:58:18 04  a large text blob...

2018-10-08 05:58:26 04  a large text blob...

2018-10-08 05:58:37 04  a large text blob...

2018-10-08 05:58:58 04  a large text blob...

2018-10-08 06:00:31 04  a large text blob...

2018-10-08 06:01:00 04  a large text blob...

2018-10-08 06:01:14 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:06:03 04  a large text blob...

2018-10-08 06:10:00 04  a large text blob...

2018-10-08 09:07:03 04  a large text blob...

2018-10-08 09:09:03 04  a large text blob...

2018-10-09 10:01:00 04  a large text blob...

2018-10-09 10:02:00 04  a large text blob...

2018-10-09 10:03:00 04  a large text blob...

2018-10-09 10:09:00 04  a large text blob...

2018-10-09 10:09:00 05  a large text blob...

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...    001

2018-10-05 23:07:13 01  a large text blob...    001

2018-10-05 23:07:23 01  a large text blob...    001

2018-10-05 23:07:36 01  a large text blob...    001

2018-10-05 23:08:02 01  a large text blob...    001

2018-10-05 23:09:16 01  a large text blob...    001

2018-10-05 23:09:21 01  a large text blob...    001

2018-10-05 23:09:39 01  a large text blob...    001

2018-10-05 23:09:47 01  a large text blob...    001

2018-10-05 23:10:01 01  a large text blob...    001

2018-10-05 23:10:11 01  a large text blob...    001

2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)

2018-10-05 23:10:59 01  a large text blob...    002

2018-10-05 23:11:03 01  a large text blob...    002

2018-10-08 23:11:32 02  a large text blob...    002

2018-10-08 23:12:58 02  a large text blob...    002

2018-10-08 23:13:16 02  a large text blob...    002

2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)

2018-10-08 03:38:36 02  a large text blob...    003

2018-10-08 03:38:42 02  a large text blob...    003

2018-10-08 03:38:52 02  a large text blob...    003

2018-10-08 03:38:57 02  a large text blob...    003

2018-10-08 03:39:10 02  a large text blob...    003

2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)

2018-10-08 03:40:47 02  a large text blob...    004

2018-10-08 03:40:54 02  a large text blob...    004

2018-10-08 03:41:02 02  a large text blob...    004

2018-10-08 03:41:12 02  a large text blob...    004

2018-10-08 03:41:32 02  a large text blob...    004

2018-10-08 03:41:39 02  a large text blob...    004

2018-10-08 03:42:20 02  a large text blob...    004

2018-10-08 03:44:58 02  a large text blob...    004

2018-10-08 03:45:54 02  a large text blob...    004

2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 05:06:42 03  a large text blob...    005

2018-10-08 05:06:53 03  a large text blob...    005

2018-10-08 05:08:49 03  a large text blob...    005

2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)

2018-10-08 05:58:18 04  a large text blob...    006

2018-10-08 05:58:26 04  a large text blob...    006

2018-10-08 05:58:37 04  a large text blob...    006

2018-10-08 05:58:58 04  a large text blob...    006

2018-10-08 06:00:31 04  a large text blob...    006

2018-10-08 06:01:00 04  a large text blob...    006

2018-10-08 06:01:14 04  a large text blob...    006

2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 06:02:03 04  a large text blob...    007

2018-10-08 06:06:03 04  a large text blob...    007

2018-10-08 06:10:00 04  a large text blob...    007

2018-10-08 09:07:03 04  a large text blob...    007

2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-09 10:01:00 04  a large text blob...    008

2018-10-09 10:02:00 04  a large text blob...    008

2018-10-09 10:03:00 04  a large text blob...    008

2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)

2018-10-09 10:09:00 05  a large text blob...    010

So far I tried to:

searchfor = ['restart','Restart']

df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()

periods = dif > pd.Timedelta('10 min')

times = periods.cumsum().apply(lambda x: x+1)

df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

92182353

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

up vote
3
down vote

favorite

I have a dataframe with conversations and timestamps like this:

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...

2018-10-05 23:07:13 01  a large text blob...

2018-10-05 23:07:23 01  a large text blob...

2018-10-05 23:07:36 01  a large text blob...

2018-10-05 23:08:02 01  a large text blob...

2018-10-05 23:09:16 01  a large text blob...

2018-10-05 23:09:21 01  a large text blob...

2018-10-05 23:09:39 01  a large text blob...

2018-10-05 23:09:47 01  a large text blob...

2018-10-05 23:10:01 01  a large text blob...

2018-10-05 23:10:11 01  a large text blob...

2018-10-05 23:10:23 01  restart             

2018-10-05 23:10:59 01  a large text blob...

2018-10-05 23:11:03 01  a large text blob...

2018-10-08 23:11:32 02  a large text blob...

2018-10-08 23:12:58 02  a large text blob...

2018-10-08 23:13:16 02  a large text blob...

2018-10-08 23:14:04 02  a large text blob...

2018-10-08 03:38:36 02  a large text blob...

2018-10-08 03:38:42 02  a large text blob...

2018-10-08 03:38:52 02  a large text blob...

2018-10-08 03:38:57 02  a large text blob...

2018-10-08 03:39:10 02  a large text blob...

2018-10-08 03:39:27 02  Restart             

2018-10-08 03:40:47 02  a large text blob...

2018-10-08 03:40:54 02  a large text blob...

2018-10-08 03:41:02 02  a large text blob...

2018-10-08 03:41:12 02  a large text blob...

2018-10-08 03:41:32 02  a large text blob...

2018-10-08 03:41:39 02  a large text blob...

2018-10-08 03:42:20 02  a large text blob...

2018-10-08 03:44:58 02  a large text blob...

2018-10-08 03:45:54 02  a large text blob...

2018-10-08 03:46:06 02  a large text blob...

2018-10-08 05:06:42 03  a large text blob...

2018-10-08 05:06:53 03  a large text blob...

2018-10-08 05:08:49 03  a large text blob...

2018-10-08 05:08:58 03  a large text blob...

2018-10-08 05:58:18 04  a large text blob...

2018-10-08 05:58:26 04  a large text blob...

2018-10-08 05:58:37 04  a large text blob...

2018-10-08 05:58:58 04  a large text blob...

2018-10-08 06:00:31 04  a large text blob...

2018-10-08 06:01:00 04  a large text blob...

2018-10-08 06:01:14 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:06:03 04  a large text blob...

2018-10-08 06:10:00 04  a large text blob...

2018-10-08 09:07:03 04  a large text blob...

2018-10-08 09:09:03 04  a large text blob...

2018-10-09 10:01:00 04  a large text blob...

2018-10-09 10:02:00 04  a large text blob...

2018-10-09 10:03:00 04  a large text blob...

2018-10-09 10:09:00 04  a large text blob...

2018-10-09 10:09:00 05  a large text blob...

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...    001

2018-10-05 23:07:13 01  a large text blob...    001

2018-10-05 23:07:23 01  a large text blob...    001

2018-10-05 23:07:36 01  a large text blob...    001

2018-10-05 23:08:02 01  a large text blob...    001

2018-10-05 23:09:16 01  a large text blob...    001

2018-10-05 23:09:21 01  a large text blob...    001

2018-10-05 23:09:39 01  a large text blob...    001

2018-10-05 23:09:47 01  a large text blob...    001

2018-10-05 23:10:01 01  a large text blob...    001

2018-10-05 23:10:11 01  a large text blob...    001

2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)

2018-10-05 23:10:59 01  a large text blob...    002

2018-10-05 23:11:03 01  a large text blob...    002

2018-10-08 23:11:32 02  a large text blob...    002

2018-10-08 23:12:58 02  a large text blob...    002

2018-10-08 23:13:16 02  a large text blob...    002

2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)

2018-10-08 03:38:36 02  a large text blob...    003

2018-10-08 03:38:42 02  a large text blob...    003

2018-10-08 03:38:52 02  a large text blob...    003

2018-10-08 03:38:57 02  a large text blob...    003

2018-10-08 03:39:10 02  a large text blob...    003

2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)

2018-10-08 03:40:47 02  a large text blob...    004

2018-10-08 03:40:54 02  a large text blob...    004

2018-10-08 03:41:02 02  a large text blob...    004

2018-10-08 03:41:12 02  a large text blob...    004

2018-10-08 03:41:32 02  a large text blob...    004

2018-10-08 03:41:39 02  a large text blob...    004

2018-10-08 03:42:20 02  a large text blob...    004

2018-10-08 03:44:58 02  a large text blob...    004

2018-10-08 03:45:54 02  a large text blob...    004

2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 05:06:42 03  a large text blob...    005

2018-10-08 05:06:53 03  a large text blob...    005

2018-10-08 05:08:49 03  a large text blob...    005

2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)

2018-10-08 05:58:18 04  a large text blob...    006

2018-10-08 05:58:26 04  a large text blob...    006

2018-10-08 05:58:37 04  a large text blob...    006

2018-10-08 05:58:58 04  a large text blob...    006

2018-10-08 06:00:31 04  a large text blob...    006

2018-10-08 06:01:00 04  a large text blob...    006

2018-10-08 06:01:14 04  a large text blob...    006

2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 06:02:03 04  a large text blob...    007

2018-10-08 06:06:03 04  a large text blob...    007

2018-10-08 06:10:00 04  a large text blob...    007

2018-10-08 09:07:03 04  a large text blob...    007

2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-09 10:01:00 04  a large text blob...    008

2018-10-09 10:02:00 04  a large text blob...    008

2018-10-09 10:03:00 04  a large text blob...    008

2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)

2018-10-09 10:09:00 05  a large text blob...    010

So far I tried to:

searchfor = ['restart','Restart']

df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()

periods = dif > pd.Timedelta('10 min')

times = periods.cumsum().apply(lambda x: x+1)

df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

92182353

I have a dataframe with conversations and timestamps like this:

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...

2018-10-05 23:07:13 01  a large text blob...

2018-10-05 23:07:23 01  a large text blob...

2018-10-05 23:07:36 01  a large text blob...

2018-10-05 23:08:02 01  a large text blob...

2018-10-05 23:09:16 01  a large text blob...

2018-10-05 23:09:21 01  a large text blob...

2018-10-05 23:09:39 01  a large text blob...

2018-10-05 23:09:47 01  a large text blob...

2018-10-05 23:10:01 01  a large text blob...

2018-10-05 23:10:11 01  a large text blob...

2018-10-05 23:10:23 01  restart             

2018-10-05 23:10:59 01  a large text blob...

2018-10-05 23:11:03 01  a large text blob...

2018-10-08 23:11:32 02  a large text blob...

2018-10-08 23:12:58 02  a large text blob...

2018-10-08 23:13:16 02  a large text blob...

2018-10-08 23:14:04 02  a large text blob...

2018-10-08 03:38:36 02  a large text blob...

2018-10-08 03:38:42 02  a large text blob...

2018-10-08 03:38:52 02  a large text blob...

2018-10-08 03:38:57 02  a large text blob...

2018-10-08 03:39:10 02  a large text blob...

2018-10-08 03:39:27 02  Restart             

2018-10-08 03:40:47 02  a large text blob...

2018-10-08 03:40:54 02  a large text blob...

2018-10-08 03:41:02 02  a large text blob...

2018-10-08 03:41:12 02  a large text blob...

2018-10-08 03:41:32 02  a large text blob...

2018-10-08 03:41:39 02  a large text blob...

2018-10-08 03:42:20 02  a large text blob...

2018-10-08 03:44:58 02  a large text blob...

2018-10-08 03:45:54 02  a large text blob...

2018-10-08 03:46:06 02  a large text blob...

2018-10-08 05:06:42 03  a large text blob...

2018-10-08 05:06:53 03  a large text blob...

2018-10-08 05:08:49 03  a large text blob...

2018-10-08 05:08:58 03  a large text blob...

2018-10-08 05:58:18 04  a large text blob...

2018-10-08 05:58:26 04  a large text blob...

2018-10-08 05:58:37 04  a large text blob...

2018-10-08 05:58:58 04  a large text blob...

2018-10-08 06:00:31 04  a large text blob...

2018-10-08 06:01:00 04  a large text blob...

2018-10-08 06:01:14 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:02:03 04  a large text blob...

2018-10-08 06:06:03 04  a large text blob...

2018-10-08 06:10:00 04  a large text blob...

2018-10-08 09:07:03 04  a large text blob...

2018-10-08 09:09:03 04  a large text blob...

2018-10-09 10:01:00 04  a large text blob...

2018-10-09 10:02:00 04  a large text blob...

2018-10-09 10:03:00 04  a large text blob...

2018-10-09 10:09:00 04  a large text blob...

2018-10-09 10:09:00 05  a large text blob...

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp   userID  textBlob    new_id

2018-10-05 23:07:02 01  a large text blob...    001

2018-10-05 23:07:13 01  a large text blob...    001

2018-10-05 23:07:23 01  a large text blob...    001

2018-10-05 23:07:36 01  a large text blob...    001

2018-10-05 23:08:02 01  a large text blob...    001

2018-10-05 23:09:16 01  a large text blob...    001

2018-10-05 23:09:21 01  a large text blob...    001

2018-10-05 23:09:39 01  a large text blob...    001

2018-10-05 23:09:47 01  a large text blob...    001

2018-10-05 23:10:01 01  a large text blob...    001

2018-10-05 23:10:11 01  a large text blob...    001

2018-10-05 23:10:23 01  restart                 001   ---- (The word restart appeared so a new id is created ↓)

2018-10-05 23:10:59 01  a large text blob...    002

2018-10-05 23:11:03 01  a large text blob...    002

2018-10-08 23:11:32 02  a large text blob...    002

2018-10-08 23:12:58 02  a large text blob...    002

2018-10-08 23:13:16 02  a large text blob...    002

2018-10-08 23:14:04 02  a large text blob...    002   --- (The conversation ends because the 10 minutes time threshold was exceeded)

2018-10-08 03:38:36 02  a large text blob...    003

2018-10-08 03:38:42 02  a large text blob...    003

2018-10-08 03:38:52 02  a large text blob...    003

2018-10-08 03:38:57 02  a large text blob...    003

2018-10-08 03:39:10 02  a large text blob...    003

2018-10-08 03:39:27 02  Restart                 003   ---- (The word restart appeared so a new id is created ↓)

2018-10-08 03:40:47 02  a large text blob...    004

2018-10-08 03:40:54 02  a large text blob...    004

2018-10-08 03:41:02 02  a large text blob...    004

2018-10-08 03:41:12 02  a large text blob...    004

2018-10-08 03:41:32 02  a large text blob...    004

2018-10-08 03:41:39 02  a large text blob...    004

2018-10-08 03:42:20 02  a large text blob...    004

2018-10-08 03:44:58 02  a large text blob...    004

2018-10-08 03:45:54 02  a large text blob...    004

2018-10-08 03:46:06 02  a large text blob...    004     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 05:06:42 03  a large text blob...    005

2018-10-08 05:06:53 03  a large text blob...    005

2018-10-08 05:08:49 03  a large text blob...    005

2018-10-08 05:08:58 03  a large text blob...    005     ---- (no more conversations from user id 03, thus the a new id is assigned)

2018-10-08 05:58:18 04  a large text blob...    006

2018-10-08 05:58:26 04  a large text blob...    006

2018-10-08 05:58:37 04  a large text blob...    006

2018-10-08 05:58:58 04  a large text blob...    006

2018-10-08 06:00:31 04  a large text blob...    006

2018-10-08 06:01:00 04  a large text blob...    006

2018-10-08 06:01:14 04  a large text blob...    006

2018-10-08 06:02:03 04  a large text blob...    006     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-08 06:02:03 04  a large text blob...    007

2018-10-08 06:06:03 04  a large text blob...    007

2018-10-08 06:10:00 04  a large text blob...    007

2018-10-08 09:07:03 04  a large text blob...    007

2018-10-08 09:09:03 04  a large text blob...    007     ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)

2018-10-09 10:01:00 04  a large text blob...    008

2018-10-09 10:02:00 04  a large text blob...    008

2018-10-09 10:03:00 04  a large text blob...    008

2018-10-09 10:09:00 04  a large text blob...    008     ---- (no more conversations from user id 04, thus the a new id is assigned)

2018-10-09 10:09:00 05  a large text blob...    010

So far I tried to:

searchfor = ['restart','Restart']

df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()

periods = dif > pd.Timedelta('10 min')

times = periods.cumsum().apply(lambda x: x+1)

df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

python pandas dataframe

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

92182353

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

92182353

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

92182353

asked Nov 11 at 11:30

tumbleweed

92182353

asked Nov 11 at 11:30

tumbleweed

92182353

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 

mask2 = df['userID'].diff() != 0

mask3 = df['textBlob'].shift().str.lower() == 'restart'



df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()



# Result:

print(df.to_string(index=False))



timestamp  userID              textBlob  new_id

2018-10-05 23:07:02       1  a_large_text_blob...       1

2018-10-05 23:07:13       1  a_large_text_blob...       1

2018-10-05 23:07:23       1  a_large_text_blob...       1

2018-10-05 23:07:36       1  a_large_text_blob...       1

2018-10-05 23:08:02       1  a_large_text_blob...       1

2018-10-05 23:09:16       1  a_large_text_blob...       1

2018-10-05 23:09:21       1  a_large_text_blob...       1

2018-10-05 23:09:39       1  a_large_text_blob...       1

2018-10-05 23:09:47       1  a_large_text_blob...       1

2018-10-05 23:10:01       1  a_large_text_blob...       1

2018-10-05 23:10:11       1  a_large_text_blob...       1

2018-10-05 23:10:23       1               restart       1

2018-10-05 23:10:59       1  a_large_text_blob...       2

2018-10-05 23:11:03       1  a_large_text_blob...       2

2018-10-08 03:11:32       2  a_large_text_blob...       3

2018-10-08 03:12:58       2  a_large_text_blob...       3

2018-10-08 03:13:16       2  a_large_text_blob...       3

2018-10-08 03:14:04       2  a_large_text_blob...       3

2018-10-08 03:38:36       2  a_large_text_blob...       4

2018-10-08 03:38:42       2  a_large_text_blob...       4

2018-10-08 03:38:52       2  a_large_text_blob...       4

2018-10-08 03:38:57       2  a_large_text_blob...       4

2018-10-08 03:39:10       2  a_large_text_blob...       4

2018-10-08 03:39:27       2               Restart       4

2018-10-08 03:40:47       2  a_large_text_blob...       5

2018-10-08 03:40:54       2  a_large_text_blob...       5

2018-10-08 03:41:02       2  a_large_text_blob...       5

2018-10-08 03:41:12       2  a_large_text_blob...       5

2018-10-08 03:41:32       2  a_large_text_blob...       5

2018-10-08 03:41:39       2  a_large_text_blob...       5

2018-10-08 03:42:20       2  a_large_text_blob...       5

2018-10-08 03:44:58       2  a_large_text_blob...       5

2018-10-08 03:45:54       2  a_large_text_blob...       5

2018-10-08 03:46:06       2  a_large_text_blob...       5

2018-10-08 05:06:42       3  a_large_text_blob...       6

2018-10-08 05:06:53       3  a_large_text_blob...       6

2018-10-08 05:08:49       3  a_large_text_blob...       6

2018-10-08 05:08:58       3  a_large_text_blob...       6

2018-10-08 05:58:18       4  a_large_text_blob...       7

2018-10-08 05:58:26       4  a_large_text_blob...       7

2018-10-08 05:58:37       4  a_large_text_blob...       7

2018-10-08 05:58:58       4  a_large_text_blob...       7

2018-10-08 06:00:31       4  a_large_text_blob...       7

2018-10-08 06:01:00       4  a_large_text_blob...       7

2018-10-08 06:01:14       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:06:03       4  a_large_text_blob...       7

2018-10-08 06:10:00       4  a_large_text_blob...       7

2018-10-08 09:07:03       4  a_large_text_blob...       8

2018-10-08 09:09:03       4  a_large_text_blob...       8

2018-10-09 10:01:00       4  a_large_text_blob...       9

2018-10-09 10:02:00       4  a_large_text_blob...       9

2018-10-09 10:03:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       5  a_large_text_blob...      10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])

restart = df.textBlob.str.contains('|'.join(['restart','Restart']))

user_change = df.userID == df.userID.shift().fillna(method='bfill')

df['new_id'] = (restart | ~user_change).cumsum()

current_id = 0

new_id_prev = 0

start_time = df.timestamp.iloc[0]



for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):

    timedelta = timestamp - start_time



    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):

        current_id += 1

        start_time = timestamp



    new_id_prev = new_id    

    df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,345611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53248280%2fproblems-while-trying-to-create-a-new-id-column-based-on-three-criteria%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 

mask2 = df['userID'].diff() != 0

mask3 = df['textBlob'].shift().str.lower() == 'restart'



df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()



# Result:

print(df.to_string(index=False))



timestamp  userID              textBlob  new_id

2018-10-05 23:07:02       1  a_large_text_blob...       1

2018-10-05 23:07:13       1  a_large_text_blob...       1

2018-10-05 23:07:23       1  a_large_text_blob...       1

2018-10-05 23:07:36       1  a_large_text_blob...       1

2018-10-05 23:08:02       1  a_large_text_blob...       1

2018-10-05 23:09:16       1  a_large_text_blob...       1

2018-10-05 23:09:21       1  a_large_text_blob...       1

2018-10-05 23:09:39       1  a_large_text_blob...       1

2018-10-05 23:09:47       1  a_large_text_blob...       1

2018-10-05 23:10:01       1  a_large_text_blob...       1

2018-10-05 23:10:11       1  a_large_text_blob...       1

2018-10-05 23:10:23       1               restart       1

2018-10-05 23:10:59       1  a_large_text_blob...       2

2018-10-05 23:11:03       1  a_large_text_blob...       2

2018-10-08 03:11:32       2  a_large_text_blob...       3

2018-10-08 03:12:58       2  a_large_text_blob...       3

2018-10-08 03:13:16       2  a_large_text_blob...       3

2018-10-08 03:14:04       2  a_large_text_blob...       3

2018-10-08 03:38:36       2  a_large_text_blob...       4

2018-10-08 03:38:42       2  a_large_text_blob...       4

2018-10-08 03:38:52       2  a_large_text_blob...       4

2018-10-08 03:38:57       2  a_large_text_blob...       4

2018-10-08 03:39:10       2  a_large_text_blob...       4

2018-10-08 03:39:27       2               Restart       4

2018-10-08 03:40:47       2  a_large_text_blob...       5

2018-10-08 03:40:54       2  a_large_text_blob...       5

2018-10-08 03:41:02       2  a_large_text_blob...       5

2018-10-08 03:41:12       2  a_large_text_blob...       5

2018-10-08 03:41:32       2  a_large_text_blob...       5

2018-10-08 03:41:39       2  a_large_text_blob...       5

2018-10-08 03:42:20       2  a_large_text_blob...       5

2018-10-08 03:44:58       2  a_large_text_blob...       5

2018-10-08 03:45:54       2  a_large_text_blob...       5

2018-10-08 03:46:06       2  a_large_text_blob...       5

2018-10-08 05:06:42       3  a_large_text_blob...       6

2018-10-08 05:06:53       3  a_large_text_blob...       6

2018-10-08 05:08:49       3  a_large_text_blob...       6

2018-10-08 05:08:58       3  a_large_text_blob...       6

2018-10-08 05:58:18       4  a_large_text_blob...       7

2018-10-08 05:58:26       4  a_large_text_blob...       7

2018-10-08 05:58:37       4  a_large_text_blob...       7

2018-10-08 05:58:58       4  a_large_text_blob...       7

2018-10-08 06:00:31       4  a_large_text_blob...       7

2018-10-08 06:01:00       4  a_large_text_blob...       7

2018-10-08 06:01:14       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:06:03       4  a_large_text_blob...       7

2018-10-08 06:10:00       4  a_large_text_blob...       7

2018-10-08 09:07:03       4  a_large_text_blob...       8

2018-10-08 09:09:03       4  a_large_text_blob...       8

2018-10-09 10:01:00       4  a_large_text_blob...       9

2018-10-09 10:02:00       4  a_large_text_blob...       9

2018-10-09 10:03:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       5  a_large_text_blob...      10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 

mask2 = df['userID'].diff() != 0

mask3 = df['textBlob'].shift().str.lower() == 'restart'



df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()



# Result:

print(df.to_string(index=False))



timestamp  userID              textBlob  new_id

2018-10-05 23:07:02       1  a_large_text_blob...       1

2018-10-05 23:07:13       1  a_large_text_blob...       1

2018-10-05 23:07:23       1  a_large_text_blob...       1

2018-10-05 23:07:36       1  a_large_text_blob...       1

2018-10-05 23:08:02       1  a_large_text_blob...       1

2018-10-05 23:09:16       1  a_large_text_blob...       1

2018-10-05 23:09:21       1  a_large_text_blob...       1

2018-10-05 23:09:39       1  a_large_text_blob...       1

2018-10-05 23:09:47       1  a_large_text_blob...       1

2018-10-05 23:10:01       1  a_large_text_blob...       1

2018-10-05 23:10:11       1  a_large_text_blob...       1

2018-10-05 23:10:23       1               restart       1

2018-10-05 23:10:59       1  a_large_text_blob...       2

2018-10-05 23:11:03       1  a_large_text_blob...       2

2018-10-08 03:11:32       2  a_large_text_blob...       3

2018-10-08 03:12:58       2  a_large_text_blob...       3

2018-10-08 03:13:16       2  a_large_text_blob...       3

2018-10-08 03:14:04       2  a_large_text_blob...       3

2018-10-08 03:38:36       2  a_large_text_blob...       4

2018-10-08 03:38:42       2  a_large_text_blob...       4

2018-10-08 03:38:52       2  a_large_text_blob...       4

2018-10-08 03:38:57       2  a_large_text_blob...       4

2018-10-08 03:39:10       2  a_large_text_blob...       4

2018-10-08 03:39:27       2               Restart       4

2018-10-08 03:40:47       2  a_large_text_blob...       5

2018-10-08 03:40:54       2  a_large_text_blob...       5

2018-10-08 03:41:02       2  a_large_text_blob...       5

2018-10-08 03:41:12       2  a_large_text_blob...       5

2018-10-08 03:41:32       2  a_large_text_blob...       5

2018-10-08 03:41:39       2  a_large_text_blob...       5

2018-10-08 03:42:20       2  a_large_text_blob...       5

2018-10-08 03:44:58       2  a_large_text_blob...       5

2018-10-08 03:45:54       2  a_large_text_blob...       5

2018-10-08 03:46:06       2  a_large_text_blob...       5

2018-10-08 05:06:42       3  a_large_text_blob...       6

2018-10-08 05:06:53       3  a_large_text_blob...       6

2018-10-08 05:08:49       3  a_large_text_blob...       6

2018-10-08 05:08:58       3  a_large_text_blob...       6

2018-10-08 05:58:18       4  a_large_text_blob...       7

2018-10-08 05:58:26       4  a_large_text_blob...       7

2018-10-08 05:58:37       4  a_large_text_blob...       7

2018-10-08 05:58:58       4  a_large_text_blob...       7

2018-10-08 06:00:31       4  a_large_text_blob...       7

2018-10-08 06:01:00       4  a_large_text_blob...       7

2018-10-08 06:01:14       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:06:03       4  a_large_text_blob...       7

2018-10-08 06:10:00       4  a_large_text_blob...       7

2018-10-08 09:07:03       4  a_large_text_blob...       8

2018-10-08 09:09:03       4  a_large_text_blob...       8

2018-10-09 10:01:00       4  a_large_text_blob...       9

2018-10-09 10:02:00       4  a_large_text_blob...       9

2018-10-09 10:03:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       5  a_large_text_blob...      10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 

mask2 = df['userID'].diff() != 0

mask3 = df['textBlob'].shift().str.lower() == 'restart'



df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()



# Result:

print(df.to_string(index=False))



timestamp  userID              textBlob  new_id

2018-10-05 23:07:02       1  a_large_text_blob...       1

2018-10-05 23:07:13       1  a_large_text_blob...       1

2018-10-05 23:07:23       1  a_large_text_blob...       1

2018-10-05 23:07:36       1  a_large_text_blob...       1

2018-10-05 23:08:02       1  a_large_text_blob...       1

2018-10-05 23:09:16       1  a_large_text_blob...       1

2018-10-05 23:09:21       1  a_large_text_blob...       1

2018-10-05 23:09:39       1  a_large_text_blob...       1

2018-10-05 23:09:47       1  a_large_text_blob...       1

2018-10-05 23:10:01       1  a_large_text_blob...       1

2018-10-05 23:10:11       1  a_large_text_blob...       1

2018-10-05 23:10:23       1               restart       1

2018-10-05 23:10:59       1  a_large_text_blob...       2

2018-10-05 23:11:03       1  a_large_text_blob...       2

2018-10-08 03:11:32       2  a_large_text_blob...       3

2018-10-08 03:12:58       2  a_large_text_blob...       3

2018-10-08 03:13:16       2  a_large_text_blob...       3

2018-10-08 03:14:04       2  a_large_text_blob...       3

2018-10-08 03:38:36       2  a_large_text_blob...       4

2018-10-08 03:38:42       2  a_large_text_blob...       4

2018-10-08 03:38:52       2  a_large_text_blob...       4

2018-10-08 03:38:57       2  a_large_text_blob...       4

2018-10-08 03:39:10       2  a_large_text_blob...       4

2018-10-08 03:39:27       2               Restart       4

2018-10-08 03:40:47       2  a_large_text_blob...       5

2018-10-08 03:40:54       2  a_large_text_blob...       5

2018-10-08 03:41:02       2  a_large_text_blob...       5

2018-10-08 03:41:12       2  a_large_text_blob...       5

2018-10-08 03:41:32       2  a_large_text_blob...       5

2018-10-08 03:41:39       2  a_large_text_blob...       5

2018-10-08 03:42:20       2  a_large_text_blob...       5

2018-10-08 03:44:58       2  a_large_text_blob...       5

2018-10-08 03:45:54       2  a_large_text_blob...       5

2018-10-08 03:46:06       2  a_large_text_blob...       5

2018-10-08 05:06:42       3  a_large_text_blob...       6

2018-10-08 05:06:53       3  a_large_text_blob...       6

2018-10-08 05:08:49       3  a_large_text_blob...       6

2018-10-08 05:08:58       3  a_large_text_blob...       6

2018-10-08 05:58:18       4  a_large_text_blob...       7

2018-10-08 05:58:26       4  a_large_text_blob...       7

2018-10-08 05:58:37       4  a_large_text_blob...       7

2018-10-08 05:58:58       4  a_large_text_blob...       7

2018-10-08 06:00:31       4  a_large_text_blob...       7

2018-10-08 06:01:00       4  a_large_text_blob...       7

2018-10-08 06:01:14       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:06:03       4  a_large_text_blob...       7

2018-10-08 06:10:00       4  a_large_text_blob...       7

2018-10-08 09:07:03       4  a_large_text_blob...       8

2018-10-08 09:09:03       4  a_large_text_blob...       8

2018-10-09 10:01:00       4  a_large_text_blob...       9

2018-10-09 10:02:00       4  a_large_text_blob...       9

2018-10-09 10:03:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       5  a_large_text_blob...      10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 

mask2 = df['userID'].diff() != 0

mask3 = df['textBlob'].shift().str.lower() == 'restart'



df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()



# Result:

print(df.to_string(index=False))



timestamp  userID              textBlob  new_id

2018-10-05 23:07:02       1  a_large_text_blob...       1

2018-10-05 23:07:13       1  a_large_text_blob...       1

2018-10-05 23:07:23       1  a_large_text_blob...       1

2018-10-05 23:07:36       1  a_large_text_blob...       1

2018-10-05 23:08:02       1  a_large_text_blob...       1

2018-10-05 23:09:16       1  a_large_text_blob...       1

2018-10-05 23:09:21       1  a_large_text_blob...       1

2018-10-05 23:09:39       1  a_large_text_blob...       1

2018-10-05 23:09:47       1  a_large_text_blob...       1

2018-10-05 23:10:01       1  a_large_text_blob...       1

2018-10-05 23:10:11       1  a_large_text_blob...       1

2018-10-05 23:10:23       1               restart       1

2018-10-05 23:10:59       1  a_large_text_blob...       2

2018-10-05 23:11:03       1  a_large_text_blob...       2

2018-10-08 03:11:32       2  a_large_text_blob...       3

2018-10-08 03:12:58       2  a_large_text_blob...       3

2018-10-08 03:13:16       2  a_large_text_blob...       3

2018-10-08 03:14:04       2  a_large_text_blob...       3

2018-10-08 03:38:36       2  a_large_text_blob...       4

2018-10-08 03:38:42       2  a_large_text_blob...       4

2018-10-08 03:38:52       2  a_large_text_blob...       4

2018-10-08 03:38:57       2  a_large_text_blob...       4

2018-10-08 03:39:10       2  a_large_text_blob...       4

2018-10-08 03:39:27       2               Restart       4

2018-10-08 03:40:47       2  a_large_text_blob...       5

2018-10-08 03:40:54       2  a_large_text_blob...       5

2018-10-08 03:41:02       2  a_large_text_blob...       5

2018-10-08 03:41:12       2  a_large_text_blob...       5

2018-10-08 03:41:32       2  a_large_text_blob...       5

2018-10-08 03:41:39       2  a_large_text_blob...       5

2018-10-08 03:42:20       2  a_large_text_blob...       5

2018-10-08 03:44:58       2  a_large_text_blob...       5

2018-10-08 03:45:54       2  a_large_text_blob...       5

2018-10-08 03:46:06       2  a_large_text_blob...       5

2018-10-08 05:06:42       3  a_large_text_blob...       6

2018-10-08 05:06:53       3  a_large_text_blob...       6

2018-10-08 05:08:49       3  a_large_text_blob...       6

2018-10-08 05:08:58       3  a_large_text_blob...       6

2018-10-08 05:58:18       4  a_large_text_blob...       7

2018-10-08 05:58:26       4  a_large_text_blob...       7

2018-10-08 05:58:37       4  a_large_text_blob...       7

2018-10-08 05:58:58       4  a_large_text_blob...       7

2018-10-08 06:00:31       4  a_large_text_blob...       7

2018-10-08 06:01:00       4  a_large_text_blob...       7

2018-10-08 06:01:14       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:02:03       4  a_large_text_blob...       7

2018-10-08 06:06:03       4  a_large_text_blob...       7

2018-10-08 06:10:00       4  a_large_text_blob...       7

2018-10-08 09:07:03       4  a_large_text_blob...       8

2018-10-08 09:09:03       4  a_large_text_blob...       8

2018-10-09 10:01:00       4  a_large_text_blob...       9

2018-10-09 10:02:00       4  a_large_text_blob...       9

2018-10-09 10:03:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       4  a_large_text_blob...       9

2018-10-09 10:09:00       5  a_large_text_blob...      10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])

restart = df.textBlob.str.contains('|'.join(['restart','Restart']))

user_change = df.userID == df.userID.shift().fillna(method='bfill')

df['new_id'] = (restart | ~user_change).cumsum()

current_id = 0

new_id_prev = 0

start_time = df.timestamp.iloc[0]



for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):

    timedelta = timestamp - start_time



    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):

        current_id += 1

        start_time = timestamp



    new_id_prev = new_id    

    df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,345611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])

restart = df.textBlob.str.contains('|'.join(['restart','Restart']))

user_change = df.userID == df.userID.shift().fillna(method='bfill')

df['new_id'] = (restart | ~user_change).cumsum()

current_id = 0

new_id_prev = 0

start_time = df.timestamp.iloc[0]



for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):

    timedelta = timestamp - start_time



    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):

        current_id += 1

        start_time = timestamp



    new_id_prev = new_id    

    df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,345611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])

restart = df.textBlob.str.contains('|'.join(['restart','Restart']))

user_change = df.userID == df.userID.shift().fillna(method='bfill')

df['new_id'] = (restart | ~user_change).cumsum()

current_id = 0

new_id_prev = 0

start_time = df.timestamp.iloc[0]



for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):

    timedelta = timestamp - start_time



    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):

        current_id += 1

        start_time = timestamp



    new_id_prev = new_id    

    df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,345611

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])

restart = df.textBlob.str.contains('|'.join(['restart','Restart']))

user_change = df.userID == df.userID.shift().fillna(method='bfill')

df['new_id'] = (restart | ~user_change).cumsum()

current_id = 0

new_id_prev = 0

start_time = df.timestamp.iloc[0]



for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):

    timedelta = timestamp - start_time



    if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):

        current_id += 1

        start_time = timestamp



    new_id_prev = new_id    

    df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,345611

answered Nov 11 at 17:04

Franco Piccolo

1,345611

answered Nov 11 at 17:04

Franco Piccolo

1,345611

answered Nov 11 at 17:04

Franco Piccolo

1,345611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky