R Create new data frame for each unique id












2















I have created a feature vector (data.frame) that has an id, feat1, feat2, feat3, boolean, but in this data frame there are duplicates of ids, which is done purposefully. What I want to do is as I iterate over this data frame build new data frame per id.



For simplicity lets assume I have following two columns.



          X1         X2      X3
1 000000001 -1.4061361 1
2 000000001 -0.1973846 1
3 000000002 -0.4385071 1
4 000000001 -0.6593677 0
5 000000001 -1.2592415 0
6 000000001 -0.5463655 1
7 000000002 0.4231117 0
8 000000002 -0.1640883 1
9 000000002 0.7157506 0
10 000000002 2.3234110 1


I want to build different data frame based on X1 basically I want to get all the same X1 into their own data frames. I wrote using multiple for loops but It takes super long time since this is a large data set. What is the best way to do this?










share|improve this question


















  • 5





    Use split with X1

    – Tyler Rinker
    Sep 6 '13 at 0:23






  • 2





    Note that creating all these copies will double your memory usage, at least. So if you plan to do some analysis on each chunk and save only a small set of summary results, check out function by().

    – Ferdinand.kraft
    Sep 6 '13 at 0:44











  • @Ferdinand.kraft Yes I plan on doing analysis infact reason I am doing this I want run randomforest on each so I was actually worried about the memory consumption. How do you suggest I use by on this case?

    – Null-Hypothesis
    Sep 6 '13 at 18:15











  • @find-missing-semicolon sorry, I don't use randomforest... But by() accepts any function that works on a dataframe chunk and returns summarized data.

    – Ferdinand.kraft
    Sep 6 '13 at 20:48
















2















I have created a feature vector (data.frame) that has an id, feat1, feat2, feat3, boolean, but in this data frame there are duplicates of ids, which is done purposefully. What I want to do is as I iterate over this data frame build new data frame per id.



For simplicity lets assume I have following two columns.



          X1         X2      X3
1 000000001 -1.4061361 1
2 000000001 -0.1973846 1
3 000000002 -0.4385071 1
4 000000001 -0.6593677 0
5 000000001 -1.2592415 0
6 000000001 -0.5463655 1
7 000000002 0.4231117 0
8 000000002 -0.1640883 1
9 000000002 0.7157506 0
10 000000002 2.3234110 1


I want to build different data frame based on X1 basically I want to get all the same X1 into their own data frames. I wrote using multiple for loops but It takes super long time since this is a large data set. What is the best way to do this?










share|improve this question


















  • 5





    Use split with X1

    – Tyler Rinker
    Sep 6 '13 at 0:23






  • 2





    Note that creating all these copies will double your memory usage, at least. So if you plan to do some analysis on each chunk and save only a small set of summary results, check out function by().

    – Ferdinand.kraft
    Sep 6 '13 at 0:44











  • @Ferdinand.kraft Yes I plan on doing analysis infact reason I am doing this I want run randomforest on each so I was actually worried about the memory consumption. How do you suggest I use by on this case?

    – Null-Hypothesis
    Sep 6 '13 at 18:15











  • @find-missing-semicolon sorry, I don't use randomforest... But by() accepts any function that works on a dataframe chunk and returns summarized data.

    – Ferdinand.kraft
    Sep 6 '13 at 20:48














2












2








2








I have created a feature vector (data.frame) that has an id, feat1, feat2, feat3, boolean, but in this data frame there are duplicates of ids, which is done purposefully. What I want to do is as I iterate over this data frame build new data frame per id.



For simplicity lets assume I have following two columns.



          X1         X2      X3
1 000000001 -1.4061361 1
2 000000001 -0.1973846 1
3 000000002 -0.4385071 1
4 000000001 -0.6593677 0
5 000000001 -1.2592415 0
6 000000001 -0.5463655 1
7 000000002 0.4231117 0
8 000000002 -0.1640883 1
9 000000002 0.7157506 0
10 000000002 2.3234110 1


I want to build different data frame based on X1 basically I want to get all the same X1 into their own data frames. I wrote using multiple for loops but It takes super long time since this is a large data set. What is the best way to do this?










share|improve this question














I have created a feature vector (data.frame) that has an id, feat1, feat2, feat3, boolean, but in this data frame there are duplicates of ids, which is done purposefully. What I want to do is as I iterate over this data frame build new data frame per id.



For simplicity lets assume I have following two columns.



          X1         X2      X3
1 000000001 -1.4061361 1
2 000000001 -0.1973846 1
3 000000002 -0.4385071 1
4 000000001 -0.6593677 0
5 000000001 -1.2592415 0
6 000000001 -0.5463655 1
7 000000002 0.4231117 0
8 000000002 -0.1640883 1
9 000000002 0.7157506 0
10 000000002 2.3234110 1


I want to build different data frame based on X1 basically I want to get all the same X1 into their own data frames. I wrote using multiple for loops but It takes super long time since this is a large data set. What is the best way to do this?







r dataframe






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 6 '13 at 0:06









Null-HypothesisNull-Hypothesis

5,87132106178




5,87132106178








  • 5





    Use split with X1

    – Tyler Rinker
    Sep 6 '13 at 0:23






  • 2





    Note that creating all these copies will double your memory usage, at least. So if you plan to do some analysis on each chunk and save only a small set of summary results, check out function by().

    – Ferdinand.kraft
    Sep 6 '13 at 0:44











  • @Ferdinand.kraft Yes I plan on doing analysis infact reason I am doing this I want run randomforest on each so I was actually worried about the memory consumption. How do you suggest I use by on this case?

    – Null-Hypothesis
    Sep 6 '13 at 18:15











  • @find-missing-semicolon sorry, I don't use randomforest... But by() accepts any function that works on a dataframe chunk and returns summarized data.

    – Ferdinand.kraft
    Sep 6 '13 at 20:48














  • 5





    Use split with X1

    – Tyler Rinker
    Sep 6 '13 at 0:23






  • 2





    Note that creating all these copies will double your memory usage, at least. So if you plan to do some analysis on each chunk and save only a small set of summary results, check out function by().

    – Ferdinand.kraft
    Sep 6 '13 at 0:44











  • @Ferdinand.kraft Yes I plan on doing analysis infact reason I am doing this I want run randomforest on each so I was actually worried about the memory consumption. How do you suggest I use by on this case?

    – Null-Hypothesis
    Sep 6 '13 at 18:15











  • @find-missing-semicolon sorry, I don't use randomforest... But by() accepts any function that works on a dataframe chunk and returns summarized data.

    – Ferdinand.kraft
    Sep 6 '13 at 20:48








5




5





Use split with X1

– Tyler Rinker
Sep 6 '13 at 0:23





Use split with X1

– Tyler Rinker
Sep 6 '13 at 0:23




2




2





Note that creating all these copies will double your memory usage, at least. So if you plan to do some analysis on each chunk and save only a small set of summary results, check out function by().

– Ferdinand.kraft
Sep 6 '13 at 0:44





Note that creating all these copies will double your memory usage, at least. So if you plan to do some analysis on each chunk and save only a small set of summary results, check out function by().

– Ferdinand.kraft
Sep 6 '13 at 0:44













@Ferdinand.kraft Yes I plan on doing analysis infact reason I am doing this I want run randomforest on each so I was actually worried about the memory consumption. How do you suggest I use by on this case?

– Null-Hypothesis
Sep 6 '13 at 18:15





@Ferdinand.kraft Yes I plan on doing analysis infact reason I am doing this I want run randomforest on each so I was actually worried about the memory consumption. How do you suggest I use by on this case?

– Null-Hypothesis
Sep 6 '13 at 18:15













@find-missing-semicolon sorry, I don't use randomforest... But by() accepts any function that works on a dataframe chunk and returns summarized data.

– Ferdinand.kraft
Sep 6 '13 at 20:48





@find-missing-semicolon sorry, I don't use randomforest... But by() accepts any function that works on a dataframe chunk and returns summarized data.

– Ferdinand.kraft
Sep 6 '13 at 20:48












3 Answers
3






active

oldest

votes


















3














As suggested in the comments, use split. If you really want to have new objects created, use split in conjunction with list2env as follows:



## What is in the workspace presently?
ls()
# [1] "mydf"

## This is where most R users would probably stop
split(mydf, mydf$X1)
# $`1`
# X1 X2 X3
# 1 1 -1.4061361 1
# 2 1 -0.1973846 1
# 4 1 -0.6593677 0
# 5 1 -1.2592415 0
# 6 1 -0.5463655 1
#
# $`2`
# X1 X2 X3
# 3 2 -0.4385071 1
# 7 2 0.4231117 0
# 8 2 -0.1640883 1
# 9 2 0.7157506 0
# 10 2 2.3234110 1


The above command creates a list, which is a very convenient format to have if you are going to be doing similar calculations on each list item. Most R users would stop there. If you really need separate objects in your workspace, use list2env:



list2env(split(mydf, mydf$X1), envir=.GlobalEnv)
# <environment: R_GlobalEnv>

## How many objects do we have now?
ls()
# [1] "1" "2" "mydf"


Note that these names are not syntactically valid, so you need to use backticks (</code>) to access them. (Or, alternatively,get("1")`).



`1`
# X1 X2 X3
# 1 1 -1.4061361 1
# 2 1 -0.1973846 1
# 4 1 -0.6593677 0
# 5 1 -1.2592415 0
# 6 1 -0.5463655 1
`2`
# X1 X2 X3
# 3 2 -0.4385071 1
# 7 2 0.4231117 0
# 8 2 -0.1640883 1
# 9 2 0.7157506 0
# 10 2 2.3234110 1





share|improve this answer































    1














    This uses one for loop - better?



    ids <- unique(df$X1)

    for(i in 1:length(ids)){
    id <- ids[i]
    mini.df <- data.frame(df[df$X1 == id, ])
    assign(paste("mini.df", i, sep="."), mini.df)
    # or alternatively, if you wanted the data.frames to be assigned by id,
    # assign(id, mini.df)
    }





    share|improve this answer

































      0














      It sounds like you want to be able to fit models to each subset of data (and likely extract summaries of the models). You can use broom, dplyr, purrr and tidyr to do this functionally. Here's an example:



      library(broom)
      library(dplyr)
      library(purrr)
      library(tidyr)

      mtcars %>%
      group_by(cyl) %>%
      nest() %>%
      mutate(model = map(data, lm, formula = mpg ~ disp + hp),
      results = map(model, tidy)) %>%
      unnest(results)





      share|improve this answer























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f18647855%2fr-create-new-data-frame-for-each-unique-id%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        3














        As suggested in the comments, use split. If you really want to have new objects created, use split in conjunction with list2env as follows:



        ## What is in the workspace presently?
        ls()
        # [1] "mydf"

        ## This is where most R users would probably stop
        split(mydf, mydf$X1)
        # $`1`
        # X1 X2 X3
        # 1 1 -1.4061361 1
        # 2 1 -0.1973846 1
        # 4 1 -0.6593677 0
        # 5 1 -1.2592415 0
        # 6 1 -0.5463655 1
        #
        # $`2`
        # X1 X2 X3
        # 3 2 -0.4385071 1
        # 7 2 0.4231117 0
        # 8 2 -0.1640883 1
        # 9 2 0.7157506 0
        # 10 2 2.3234110 1


        The above command creates a list, which is a very convenient format to have if you are going to be doing similar calculations on each list item. Most R users would stop there. If you really need separate objects in your workspace, use list2env:



        list2env(split(mydf, mydf$X1), envir=.GlobalEnv)
        # <environment: R_GlobalEnv>

        ## How many objects do we have now?
        ls()
        # [1] "1" "2" "mydf"


        Note that these names are not syntactically valid, so you need to use backticks (</code>) to access them. (Or, alternatively,get("1")`).



        `1`
        # X1 X2 X3
        # 1 1 -1.4061361 1
        # 2 1 -0.1973846 1
        # 4 1 -0.6593677 0
        # 5 1 -1.2592415 0
        # 6 1 -0.5463655 1
        `2`
        # X1 X2 X3
        # 3 2 -0.4385071 1
        # 7 2 0.4231117 0
        # 8 2 -0.1640883 1
        # 9 2 0.7157506 0
        # 10 2 2.3234110 1





        share|improve this answer




























          3














          As suggested in the comments, use split. If you really want to have new objects created, use split in conjunction with list2env as follows:



          ## What is in the workspace presently?
          ls()
          # [1] "mydf"

          ## This is where most R users would probably stop
          split(mydf, mydf$X1)
          # $`1`
          # X1 X2 X3
          # 1 1 -1.4061361 1
          # 2 1 -0.1973846 1
          # 4 1 -0.6593677 0
          # 5 1 -1.2592415 0
          # 6 1 -0.5463655 1
          #
          # $`2`
          # X1 X2 X3
          # 3 2 -0.4385071 1
          # 7 2 0.4231117 0
          # 8 2 -0.1640883 1
          # 9 2 0.7157506 0
          # 10 2 2.3234110 1


          The above command creates a list, which is a very convenient format to have if you are going to be doing similar calculations on each list item. Most R users would stop there. If you really need separate objects in your workspace, use list2env:



          list2env(split(mydf, mydf$X1), envir=.GlobalEnv)
          # <environment: R_GlobalEnv>

          ## How many objects do we have now?
          ls()
          # [1] "1" "2" "mydf"


          Note that these names are not syntactically valid, so you need to use backticks (</code>) to access them. (Or, alternatively,get("1")`).



          `1`
          # X1 X2 X3
          # 1 1 -1.4061361 1
          # 2 1 -0.1973846 1
          # 4 1 -0.6593677 0
          # 5 1 -1.2592415 0
          # 6 1 -0.5463655 1
          `2`
          # X1 X2 X3
          # 3 2 -0.4385071 1
          # 7 2 0.4231117 0
          # 8 2 -0.1640883 1
          # 9 2 0.7157506 0
          # 10 2 2.3234110 1





          share|improve this answer


























            3












            3








            3







            As suggested in the comments, use split. If you really want to have new objects created, use split in conjunction with list2env as follows:



            ## What is in the workspace presently?
            ls()
            # [1] "mydf"

            ## This is where most R users would probably stop
            split(mydf, mydf$X1)
            # $`1`
            # X1 X2 X3
            # 1 1 -1.4061361 1
            # 2 1 -0.1973846 1
            # 4 1 -0.6593677 0
            # 5 1 -1.2592415 0
            # 6 1 -0.5463655 1
            #
            # $`2`
            # X1 X2 X3
            # 3 2 -0.4385071 1
            # 7 2 0.4231117 0
            # 8 2 -0.1640883 1
            # 9 2 0.7157506 0
            # 10 2 2.3234110 1


            The above command creates a list, which is a very convenient format to have if you are going to be doing similar calculations on each list item. Most R users would stop there. If you really need separate objects in your workspace, use list2env:



            list2env(split(mydf, mydf$X1), envir=.GlobalEnv)
            # <environment: R_GlobalEnv>

            ## How many objects do we have now?
            ls()
            # [1] "1" "2" "mydf"


            Note that these names are not syntactically valid, so you need to use backticks (</code>) to access them. (Or, alternatively,get("1")`).



            `1`
            # X1 X2 X3
            # 1 1 -1.4061361 1
            # 2 1 -0.1973846 1
            # 4 1 -0.6593677 0
            # 5 1 -1.2592415 0
            # 6 1 -0.5463655 1
            `2`
            # X1 X2 X3
            # 3 2 -0.4385071 1
            # 7 2 0.4231117 0
            # 8 2 -0.1640883 1
            # 9 2 0.7157506 0
            # 10 2 2.3234110 1





            share|improve this answer













            As suggested in the comments, use split. If you really want to have new objects created, use split in conjunction with list2env as follows:



            ## What is in the workspace presently?
            ls()
            # [1] "mydf"

            ## This is where most R users would probably stop
            split(mydf, mydf$X1)
            # $`1`
            # X1 X2 X3
            # 1 1 -1.4061361 1
            # 2 1 -0.1973846 1
            # 4 1 -0.6593677 0
            # 5 1 -1.2592415 0
            # 6 1 -0.5463655 1
            #
            # $`2`
            # X1 X2 X3
            # 3 2 -0.4385071 1
            # 7 2 0.4231117 0
            # 8 2 -0.1640883 1
            # 9 2 0.7157506 0
            # 10 2 2.3234110 1


            The above command creates a list, which is a very convenient format to have if you are going to be doing similar calculations on each list item. Most R users would stop there. If you really need separate objects in your workspace, use list2env:



            list2env(split(mydf, mydf$X1), envir=.GlobalEnv)
            # <environment: R_GlobalEnv>

            ## How many objects do we have now?
            ls()
            # [1] "1" "2" "mydf"


            Note that these names are not syntactically valid, so you need to use backticks (</code>) to access them. (Or, alternatively,get("1")`).



            `1`
            # X1 X2 X3
            # 1 1 -1.4061361 1
            # 2 1 -0.1973846 1
            # 4 1 -0.6593677 0
            # 5 1 -1.2592415 0
            # 6 1 -0.5463655 1
            `2`
            # X1 X2 X3
            # 3 2 -0.4385071 1
            # 7 2 0.4231117 0
            # 8 2 -0.1640883 1
            # 9 2 0.7157506 0
            # 10 2 2.3234110 1






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Sep 6 '13 at 2:22









            A5C1D2H2I1M1N2O1R2T1A5C1D2H2I1M1N2O1R2T1

            154k19291383




            154k19291383

























                1














                This uses one for loop - better?



                ids <- unique(df$X1)

                for(i in 1:length(ids)){
                id <- ids[i]
                mini.df <- data.frame(df[df$X1 == id, ])
                assign(paste("mini.df", i, sep="."), mini.df)
                # or alternatively, if you wanted the data.frames to be assigned by id,
                # assign(id, mini.df)
                }





                share|improve this answer






























                  1














                  This uses one for loop - better?



                  ids <- unique(df$X1)

                  for(i in 1:length(ids)){
                  id <- ids[i]
                  mini.df <- data.frame(df[df$X1 == id, ])
                  assign(paste("mini.df", i, sep="."), mini.df)
                  # or alternatively, if you wanted the data.frames to be assigned by id,
                  # assign(id, mini.df)
                  }





                  share|improve this answer




























                    1












                    1








                    1







                    This uses one for loop - better?



                    ids <- unique(df$X1)

                    for(i in 1:length(ids)){
                    id <- ids[i]
                    mini.df <- data.frame(df[df$X1 == id, ])
                    assign(paste("mini.df", i, sep="."), mini.df)
                    # or alternatively, if you wanted the data.frames to be assigned by id,
                    # assign(id, mini.df)
                    }





                    share|improve this answer















                    This uses one for loop - better?



                    ids <- unique(df$X1)

                    for(i in 1:length(ids)){
                    id <- ids[i]
                    mini.df <- data.frame(df[df$X1 == id, ])
                    assign(paste("mini.df", i, sep="."), mini.df)
                    # or alternatively, if you wanted the data.frames to be assigned by id,
                    # assign(id, mini.df)
                    }






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Sep 6 '13 at 0:48









                    Ferdinand.kraft

                    9,99153662




                    9,99153662










                    answered Sep 6 '13 at 0:35









                    Hillary SandersHillary Sanders

                    2,13232039




                    2,13232039























                        0














                        It sounds like you want to be able to fit models to each subset of data (and likely extract summaries of the models). You can use broom, dplyr, purrr and tidyr to do this functionally. Here's an example:



                        library(broom)
                        library(dplyr)
                        library(purrr)
                        library(tidyr)

                        mtcars %>%
                        group_by(cyl) %>%
                        nest() %>%
                        mutate(model = map(data, lm, formula = mpg ~ disp + hp),
                        results = map(model, tidy)) %>%
                        unnest(results)





                        share|improve this answer




























                          0














                          It sounds like you want to be able to fit models to each subset of data (and likely extract summaries of the models). You can use broom, dplyr, purrr and tidyr to do this functionally. Here's an example:



                          library(broom)
                          library(dplyr)
                          library(purrr)
                          library(tidyr)

                          mtcars %>%
                          group_by(cyl) %>%
                          nest() %>%
                          mutate(model = map(data, lm, formula = mpg ~ disp + hp),
                          results = map(model, tidy)) %>%
                          unnest(results)





                          share|improve this answer


























                            0












                            0








                            0







                            It sounds like you want to be able to fit models to each subset of data (and likely extract summaries of the models). You can use broom, dplyr, purrr and tidyr to do this functionally. Here's an example:



                            library(broom)
                            library(dplyr)
                            library(purrr)
                            library(tidyr)

                            mtcars %>%
                            group_by(cyl) %>%
                            nest() %>%
                            mutate(model = map(data, lm, formula = mpg ~ disp + hp),
                            results = map(model, tidy)) %>%
                            unnest(results)





                            share|improve this answer













                            It sounds like you want to be able to fit models to each subset of data (and likely extract summaries of the models). You can use broom, dplyr, purrr and tidyr to do this functionally. Here's an example:



                            library(broom)
                            library(dplyr)
                            library(purrr)
                            library(tidyr)

                            mtcars %>%
                            group_by(cyl) %>%
                            nest() %>%
                            mutate(model = map(data, lm, formula = mpg ~ disp + hp),
                            results = map(model, tidy)) %>%
                            unnest(results)






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 15 '18 at 20:41









                            dmcadmca

                            4681515




                            4681515






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f18647855%2fr-create-new-data-frame-for-each-unique-id%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                The Sandy Post

                                Danny Elfman

                                Pages that link to "Head v. Amoskeag Manufacturing Co."