Keeping archived data schema up to date with running data warehouse












0















recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.



I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.



however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.



is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.



should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?










share|improve this question























  • "Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

    – Rick James
    Dec 30 '18 at 2:00











  • I am running low on disk space, yes.

    – OhMyGawd
    Jan 2 at 17:45
















0















recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.



I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.



however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.



is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.



should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?










share|improve this question























  • "Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

    – Rick James
    Dec 30 '18 at 2:00











  • I am running low on disk space, yes.

    – OhMyGawd
    Jan 2 at 17:45














0












0








0








recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.



I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.



however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.



is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.



should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?










share|improve this question














recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.



I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.



however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.



is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.



should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?







amazon-s3 data-warehouse amazon-athena






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 14 '18 at 20:41









OhMyGawdOhMyGawd

1069




1069













  • "Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

    – Rick James
    Dec 30 '18 at 2:00











  • I am running low on disk space, yes.

    – OhMyGawd
    Jan 2 at 17:45



















  • "Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

    – Rick James
    Dec 30 '18 at 2:00











  • I am running low on disk space, yes.

    – OhMyGawd
    Jan 2 at 17:45

















"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00





"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00













I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45





I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45












2 Answers
2






active

oldest

votes


















1














There are tons of materials in internet if you search the term "Schema evolution" in big data space.



The Athena documentation has a chapter on schema updates case by case example here.



If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.



Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.






share|improve this answer































    0














    Plan A:



    It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.



    I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?



    If the table were partitioned by Year or Quarter or Month, you could




    • Every period, "Export tablespace" to remove the oldest from the partition scheme.

    • That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.


    At about the same time, you would build a new partition to receive new data.



    (I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)



    A benefit of the method is that there is virtually zero impact on the big table during the processing.



    An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).



    Plan B:



    Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...




    • Don't archive the old data.

    • Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53308424%2fkeeping-archived-data-schema-up-to-date-with-running-data-warehouse%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      There are tons of materials in internet if you search the term "Schema evolution" in big data space.



      The Athena documentation has a chapter on schema updates case by case example here.



      If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.



      Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.






      share|improve this answer




























        1














        There are tons of materials in internet if you search the term "Schema evolution" in big data space.



        The Athena documentation has a chapter on schema updates case by case example here.



        If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.



        Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.






        share|improve this answer


























          1












          1








          1







          There are tons of materials in internet if you search the term "Schema evolution" in big data space.



          The Athena documentation has a chapter on schema updates case by case example here.



          If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.



          Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.






          share|improve this answer













          There are tons of materials in internet if you search the term "Schema evolution" in big data space.



          The Athena documentation has a chapter on schema updates case by case example here.



          If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.



          Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 23 '18 at 11:25









          Tanveer UddinTanveer Uddin

          61629




          61629

























              0














              Plan A:



              It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.



              I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?



              If the table were partitioned by Year or Quarter or Month, you could




              • Every period, "Export tablespace" to remove the oldest from the partition scheme.

              • That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.


              At about the same time, you would build a new partition to receive new data.



              (I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)



              A benefit of the method is that there is virtually zero impact on the big table during the processing.



              An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).



              Plan B:



              Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...




              • Don't archive the old data.

              • Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.






              share|improve this answer




























                0














                Plan A:



                It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.



                I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?



                If the table were partitioned by Year or Quarter or Month, you could




                • Every period, "Export tablespace" to remove the oldest from the partition scheme.

                • That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.


                At about the same time, you would build a new partition to receive new data.



                (I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)



                A benefit of the method is that there is virtually zero impact on the big table during the processing.



                An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).



                Plan B:



                Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...




                • Don't archive the old data.

                • Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.






                share|improve this answer


























                  0












                  0








                  0







                  Plan A:



                  It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.



                  I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?



                  If the table were partitioned by Year or Quarter or Month, you could




                  • Every period, "Export tablespace" to remove the oldest from the partition scheme.

                  • That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.


                  At about the same time, you would build a new partition to receive new data.



                  (I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)



                  A benefit of the method is that there is virtually zero impact on the big table during the processing.



                  An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).



                  Plan B:



                  Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...




                  • Don't archive the old data.

                  • Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.






                  share|improve this answer













                  Plan A:



                  It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.



                  I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?



                  If the table were partitioned by Year or Quarter or Month, you could




                  • Every period, "Export tablespace" to remove the oldest from the partition scheme.

                  • That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.


                  At about the same time, you would build a new partition to receive new data.



                  (I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)



                  A benefit of the method is that there is virtually zero impact on the big table during the processing.



                  An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).



                  Plan B:



                  Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...




                  • Don't archive the old data.

                  • Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 2 at 19:54









                  Rick JamesRick James

                  69k561101




                  69k561101






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53308424%2fkeeping-archived-data-schema-up-to-date-with-running-data-warehouse%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Florida Star v. B. J. F.

                      Error while running script in elastic search , gateway timeout

                      Adding quotations to stringified JSON object values