Pandas DataFrame: amount of same values in different columns in sequence for each row












0















Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 Nina Nina Nina Nina
Unit2 Lena Lena NaN Lena
Unit3 Alex Maria Alex Alex
Unit4 Emilia NaN NaN NaN
Unit5 NaN Corinna Petra NaN


and can be recreated like:



import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])


Context: I would like to find out in which units leaders stay for very short or very long periods (in months) to later find out whether there are team conflicts in specific units of my company.



I would like to add to the dataframe the minimum and maximum values (in months) per unit of how long leaders have been there in one uninterrupted period. Because of possible interruptions (see Unit 2 and Unit 3) I cannot just use value_counts for the different names in each row. I rather need to find the length of sequences of different leader names separated by NaN values and other names. To see what I consider a sequence check the different colors in this photo:



sequences_colored



As you might see, an interruption as in Unit 2 and 3 should result in multiple lengths of stay. The amount of NaN-months in a sequence should not be counted.



The result should look like:



      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1

Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1


I know this might be rather complicated to follow, but I would love any kind of help/tips, etc. because I am a little lost here.










share|improve this question

























  • Please consider accepting an answer if it helped you

    – RunOrVeith
    Nov 15 '18 at 23:04
















0















Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 Nina Nina Nina Nina
Unit2 Lena Lena NaN Lena
Unit3 Alex Maria Alex Alex
Unit4 Emilia NaN NaN NaN
Unit5 NaN Corinna Petra NaN


and can be recreated like:



import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])


Context: I would like to find out in which units leaders stay for very short or very long periods (in months) to later find out whether there are team conflicts in specific units of my company.



I would like to add to the dataframe the minimum and maximum values (in months) per unit of how long leaders have been there in one uninterrupted period. Because of possible interruptions (see Unit 2 and Unit 3) I cannot just use value_counts for the different names in each row. I rather need to find the length of sequences of different leader names separated by NaN values and other names. To see what I consider a sequence check the different colors in this photo:



sequences_colored



As you might see, an interruption as in Unit 2 and 3 should result in multiple lengths of stay. The amount of NaN-months in a sequence should not be counted.



The result should look like:



      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1

Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1


I know this might be rather complicated to follow, but I would love any kind of help/tips, etc. because I am a little lost here.










share|improve this question

























  • Please consider accepting an answer if it helped you

    – RunOrVeith
    Nov 15 '18 at 23:04














0












0








0


1






Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 Nina Nina Nina Nina
Unit2 Lena Lena NaN Lena
Unit3 Alex Maria Alex Alex
Unit4 Emilia NaN NaN NaN
Unit5 NaN Corinna Petra NaN


and can be recreated like:



import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])


Context: I would like to find out in which units leaders stay for very short or very long periods (in months) to later find out whether there are team conflicts in specific units of my company.



I would like to add to the dataframe the minimum and maximum values (in months) per unit of how long leaders have been there in one uninterrupted period. Because of possible interruptions (see Unit 2 and Unit 3) I cannot just use value_counts for the different names in each row. I rather need to find the length of sequences of different leader names separated by NaN values and other names. To see what I consider a sequence check the different colors in this photo:



sequences_colored



As you might see, an interruption as in Unit 2 and 3 should result in multiple lengths of stay. The amount of NaN-months in a sequence should not be counted.



The result should look like:



      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1

Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1


I know this might be rather complicated to follow, but I would love any kind of help/tips, etc. because I am a little lost here.










share|improve this question
















Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 Nina Nina Nina Nina
Unit2 Lena Lena NaN Lena
Unit3 Alex Maria Alex Alex
Unit4 Emilia NaN NaN NaN
Unit5 NaN Corinna Petra NaN


and can be recreated like:



import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])


Context: I would like to find out in which units leaders stay for very short or very long periods (in months) to later find out whether there are team conflicts in specific units of my company.



I would like to add to the dataframe the minimum and maximum values (in months) per unit of how long leaders have been there in one uninterrupted period. Because of possible interruptions (see Unit 2 and Unit 3) I cannot just use value_counts for the different names in each row. I rather need to find the length of sequences of different leader names separated by NaN values and other names. To see what I consider a sequence check the different colors in this photo:



sequences_colored



As you might see, an interruption as in Unit 2 and 3 should result in multiple lengths of stay. The amount of NaN-months in a sequence should not be counted.



The result should look like:



      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1

Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1


I know this might be rather complicated to follow, but I would love any kind of help/tips, etc. because I am a little lost here.







python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 10:33







monart

















asked Nov 13 '18 at 11:25









monartmonart

496




496













  • Please consider accepting an answer if it helped you

    – RunOrVeith
    Nov 15 '18 at 23:04



















  • Please consider accepting an answer if it helped you

    – RunOrVeith
    Nov 15 '18 at 23:04

















Please consider accepting an answer if it helped you

– RunOrVeith
Nov 15 '18 at 23:04





Please consider accepting an answer if it helped you

– RunOrVeith
Nov 15 '18 at 23:04












2 Answers
2






active

oldest

votes


















2














This is actually quite easy using itertools.groupby:



from itertools import groupby

def min_max_durations(row):
# the group object consumes the iterator, but we don't care about the values
# so we just sum "1" to get the length.
# Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
return min(durations), max(durations)

data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))


The instance check for float is just a quick way to remove the NaN value from counting here, you can make this arbitrarily complex.



This outputs the correct results (note that copy pasting your reproduction code has 3 "Alex" entries in Unit3, unlike your example)



      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Maria Alex Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1
max_lengths_of_stay
Unit1 4
Unit2 2
Unit3 3
Unit4 1
Unit5 1





share|improve this answer


























  • Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

    – monart
    Nov 16 '18 at 10:37



















1














This should get you started -



temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

mins = temp.min(1)
maxs = temp.max(1)
mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
mins.loc[mask] = maxs.loc[mask]
mins.name='Min_length_of_stay_leaders'
maxs.name='Max_length_of_stay_leaders'

df.join(mins).join(maxs)


Output



      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1

Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1


Explanation



temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)


This gives you a consecutive count of leaders grouped by their name -



    Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
Unit1 1 2 3 4
Unit2 1 2 1 1
Unit3 1 1 1 2
Unit4 1 1 1 1
Unit5 1 1 1 1


Simply extract the max and min -



mins = temp.min(1)
maxs = temp.max(1)


The problem then comes for Nina - she did her tenure throughout, so in that case the min needs to be 4 too.



So just for that edge case, the mask object detects strictly monotonically increasing series and replaces with the max if that is the case.



I am still not sure whether it will work for all cases or not, so please check






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279998%2fpandas-dataframe-amount-of-same-values-in-different-columns-in-sequence-for-eac%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    This is actually quite easy using itertools.groupby:



    from itertools import groupby

    def min_max_durations(row):
    # the group object consumes the iterator, but we don't care about the values
    # so we just sum "1" to get the length.
    # Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
    durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
    return min(durations), max(durations)

    data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))


    The instance check for float is just a quick way to remove the NaN value from counting here, you can make this arbitrarily complex.



    This outputs the correct results (note that copy pasting your reproduction code has 3 "Alex" entries in Unit3, unlike your example)



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  
    Unit1 Nina Nina Nina Nina 4
    Unit2 Lena Lena NaN Lena 1
    Unit3 Maria Alex Alex Alex 1
    Unit4 Emilia NaN NaN NaN 1
    Unit5 NaN Corinna Petra NaN 1
    max_lengths_of_stay
    Unit1 4
    Unit2 2
    Unit3 3
    Unit4 1
    Unit5 1





    share|improve this answer


























    • Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

      – monart
      Nov 16 '18 at 10:37
















    2














    This is actually quite easy using itertools.groupby:



    from itertools import groupby

    def min_max_durations(row):
    # the group object consumes the iterator, but we don't care about the values
    # so we just sum "1" to get the length.
    # Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
    durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
    return min(durations), max(durations)

    data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))


    The instance check for float is just a quick way to remove the NaN value from counting here, you can make this arbitrarily complex.



    This outputs the correct results (note that copy pasting your reproduction code has 3 "Alex" entries in Unit3, unlike your example)



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  
    Unit1 Nina Nina Nina Nina 4
    Unit2 Lena Lena NaN Lena 1
    Unit3 Maria Alex Alex Alex 1
    Unit4 Emilia NaN NaN NaN 1
    Unit5 NaN Corinna Petra NaN 1
    max_lengths_of_stay
    Unit1 4
    Unit2 2
    Unit3 3
    Unit4 1
    Unit5 1





    share|improve this answer


























    • Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

      – monart
      Nov 16 '18 at 10:37














    2












    2








    2







    This is actually quite easy using itertools.groupby:



    from itertools import groupby

    def min_max_durations(row):
    # the group object consumes the iterator, but we don't care about the values
    # so we just sum "1" to get the length.
    # Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
    durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
    return min(durations), max(durations)

    data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))


    The instance check for float is just a quick way to remove the NaN value from counting here, you can make this arbitrarily complex.



    This outputs the correct results (note that copy pasting your reproduction code has 3 "Alex" entries in Unit3, unlike your example)



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  
    Unit1 Nina Nina Nina Nina 4
    Unit2 Lena Lena NaN Lena 1
    Unit3 Maria Alex Alex Alex 1
    Unit4 Emilia NaN NaN NaN 1
    Unit5 NaN Corinna Petra NaN 1
    max_lengths_of_stay
    Unit1 4
    Unit2 2
    Unit3 3
    Unit4 1
    Unit5 1





    share|improve this answer















    This is actually quite easy using itertools.groupby:



    from itertools import groupby

    def min_max_durations(row):
    # the group object consumes the iterator, but we don't care about the values
    # so we just sum "1" to get the length.
    # Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
    durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
    return min(durations), max(durations)

    data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))


    The instance check for float is just a quick way to remove the NaN value from counting here, you can make this arbitrarily complex.



    This outputs the correct results (note that copy pasting your reproduction code has 3 "Alex" entries in Unit3, unlike your example)



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  
    Unit1 Nina Nina Nina Nina 4
    Unit2 Lena Lena NaN Lena 1
    Unit3 Maria Alex Alex Alex 1
    Unit4 Emilia NaN NaN NaN 1
    Unit5 NaN Corinna Petra NaN 1
    max_lengths_of_stay
    Unit1 4
    Unit2 2
    Unit3 3
    Unit4 1
    Unit5 1






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 '18 at 13:36

























    answered Nov 13 '18 at 13:30









    RunOrVeithRunOrVeith

    1,1571023




    1,1571023













    • Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

      – monart
      Nov 16 '18 at 10:37



















    • Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

      – monart
      Nov 16 '18 at 10:37

















    Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

    – monart
    Nov 16 '18 at 10:37





    Thanks, you are right about the reproduction code vs. the picture, I have corrected it!

    – monart
    Nov 16 '18 at 10:37













    1














    This should get you started -



    temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

    mins = temp.min(1)
    maxs = temp.max(1)
    mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
    mins.loc[mask] = maxs.loc[mask]
    mins.name='Min_length_of_stay_leaders'
    maxs.name='Max_length_of_stay_leaders'

    df.join(mins).join(maxs)


    Output



          Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
    Unit1 Nina Nina Nina Nina 4
    Unit2 Lena Lena NaN Lena 1
    Unit3 Alex Maria Alex Alex 1
    Unit4 Emilia NaN NaN NaN 1
    Unit5 NaN Corinna Petra NaN 1

    Max_length_of_stay_leaders
    Unit1 4
    Unit2 2
    Unit3 2
    Unit4 1
    Unit5 1


    Explanation



    temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)


    This gives you a consecutive count of leaders grouped by their name -



        Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
    Unit1 1 2 3 4
    Unit2 1 2 1 1
    Unit3 1 1 1 2
    Unit4 1 1 1 1
    Unit5 1 1 1 1


    Simply extract the max and min -



    mins = temp.min(1)
    maxs = temp.max(1)


    The problem then comes for Nina - she did her tenure throughout, so in that case the min needs to be 4 too.



    So just for that edge case, the mask object detects strictly monotonically increasing series and replaces with the max if that is the case.



    I am still not sure whether it will work for all cases or not, so please check






    share|improve this answer




























      1














      This should get you started -



      temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

      mins = temp.min(1)
      maxs = temp.max(1)
      mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
      mins.loc[mask] = maxs.loc[mask]
      mins.name='Min_length_of_stay_leaders'
      maxs.name='Max_length_of_stay_leaders'

      df.join(mins).join(maxs)


      Output



            Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
      Unit1 Nina Nina Nina Nina 4
      Unit2 Lena Lena NaN Lena 1
      Unit3 Alex Maria Alex Alex 1
      Unit4 Emilia NaN NaN NaN 1
      Unit5 NaN Corinna Petra NaN 1

      Max_length_of_stay_leaders
      Unit1 4
      Unit2 2
      Unit3 2
      Unit4 1
      Unit5 1


      Explanation



      temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)


      This gives you a consecutive count of leaders grouped by their name -



          Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
      Unit1 1 2 3 4
      Unit2 1 2 1 1
      Unit3 1 1 1 2
      Unit4 1 1 1 1
      Unit5 1 1 1 1


      Simply extract the max and min -



      mins = temp.min(1)
      maxs = temp.max(1)


      The problem then comes for Nina - she did her tenure throughout, so in that case the min needs to be 4 too.



      So just for that edge case, the mask object detects strictly monotonically increasing series and replaces with the max if that is the case.



      I am still not sure whether it will work for all cases or not, so please check






      share|improve this answer


























        1












        1








        1







        This should get you started -



        temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

        mins = temp.min(1)
        maxs = temp.max(1)
        mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
        mins.loc[mask] = maxs.loc[mask]
        mins.name='Min_length_of_stay_leaders'
        maxs.name='Max_length_of_stay_leaders'

        df.join(mins).join(maxs)


        Output



              Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
        Unit1 Nina Nina Nina Nina 4
        Unit2 Lena Lena NaN Lena 1
        Unit3 Alex Maria Alex Alex 1
        Unit4 Emilia NaN NaN NaN 1
        Unit5 NaN Corinna Petra NaN 1

        Max_length_of_stay_leaders
        Unit1 4
        Unit2 2
        Unit3 2
        Unit4 1
        Unit5 1


        Explanation



        temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)


        This gives you a consecutive count of leaders grouped by their name -



            Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
        Unit1 1 2 3 4
        Unit2 1 2 1 1
        Unit3 1 1 1 2
        Unit4 1 1 1 1
        Unit5 1 1 1 1


        Simply extract the max and min -



        mins = temp.min(1)
        maxs = temp.max(1)


        The problem then comes for Nina - she did her tenure throughout, so in that case the min needs to be 4 too.



        So just for that edge case, the mask object detects strictly monotonically increasing series and replaces with the max if that is the case.



        I am still not sure whether it will work for all cases or not, so please check






        share|improve this answer













        This should get you started -



        temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

        mins = temp.min(1)
        maxs = temp.max(1)
        mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
        mins.loc[mask] = maxs.loc[mask]
        mins.name='Min_length_of_stay_leaders'
        maxs.name='Max_length_of_stay_leaders'

        df.join(mins).join(maxs)


        Output



              Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  
        Unit1 Nina Nina Nina Nina 4
        Unit2 Lena Lena NaN Lena 1
        Unit3 Alex Maria Alex Alex 1
        Unit4 Emilia NaN NaN NaN 1
        Unit5 NaN Corinna Petra NaN 1

        Max_length_of_stay_leaders
        Unit1 4
        Unit2 2
        Unit3 2
        Unit4 1
        Unit5 1


        Explanation



        temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)


        This gives you a consecutive count of leaders grouped by their name -



            Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
        Unit1 1 2 3 4
        Unit2 1 2 1 1
        Unit3 1 1 1 2
        Unit4 1 1 1 1
        Unit5 1 1 1 1


        Simply extract the max and min -



        mins = temp.min(1)
        maxs = temp.max(1)


        The problem then comes for Nina - she did her tenure throughout, so in that case the min needs to be 4 too.



        So just for that edge case, the mask object detects strictly monotonically increasing series and replaces with the max if that is the case.



        I am still not sure whether it will work for all cases or not, so please check







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 '18 at 13:24









        Vivek KalyanaranganVivek Kalyanarangan

        5,0361827




        5,0361827






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279998%2fpandas-dataframe-amount-of-same-values-in-different-columns-in-sequence-for-eac%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues