Keeping archived data schema up to date with running data warehouse
recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
amazon-s3 data-warehouse amazon-athena
add a comment |
recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
amazon-s3 data-warehouse amazon-athena
"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?
– Rick James
Dec 30 '18 at 2:00
I am running low on disk space, yes.
– OhMyGawd
Jan 2 at 17:45
add a comment |
recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
amazon-s3 data-warehouse amazon-athena
recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
amazon-s3 data-warehouse amazon-athena
amazon-s3 data-warehouse amazon-athena
asked Nov 14 '18 at 20:41
OhMyGawdOhMyGawd
1069
1069
"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?
– Rick James
Dec 30 '18 at 2:00
I am running low on disk space, yes.
– OhMyGawd
Jan 2 at 17:45
add a comment |
"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?
– Rick James
Dec 30 '18 at 2:00
I am running low on disk space, yes.
– OhMyGawd
Jan 2 at 17:45
"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?
– Rick James
Dec 30 '18 at 2:00
"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?
– Rick James
Dec 30 '18 at 2:00
I am running low on disk space, yes.
– OhMyGawd
Jan 2 at 17:45
I am running low on disk space, yes.
– OhMyGawd
Jan 2 at 17:45
add a comment |
2 Answers
2
active
oldest
votes
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
add a comment |
Plan A:
It's too late to do this, but PARTITIONing
is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
- Every period, "Export tablespace" to remove the oldest from the partition scheme.
- That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON
).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
- Don't archive the old data.
- Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53308424%2fkeeping-archived-data-schema-up-to-date-with-running-data-warehouse%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
add a comment |
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
add a comment |
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
answered Nov 23 '18 at 11:25
Tanveer UddinTanveer Uddin
61629
61629
add a comment |
add a comment |
Plan A:
It's too late to do this, but PARTITIONing
is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
- Every period, "Export tablespace" to remove the oldest from the partition scheme.
- That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON
).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
- Don't archive the old data.
- Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.
add a comment |
Plan A:
It's too late to do this, but PARTITIONing
is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
- Every period, "Export tablespace" to remove the oldest from the partition scheme.
- That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON
).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
- Don't archive the old data.
- Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.
add a comment |
Plan A:
It's too late to do this, but PARTITIONing
is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
- Every period, "Export tablespace" to remove the oldest from the partition scheme.
- That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON
).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
- Don't archive the old data.
- Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.
Plan A:
It's too late to do this, but PARTITIONing
is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
- Every period, "Export tablespace" to remove the oldest from the partition scheme.
- That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON
).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
- Don't archive the old data.
- Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.
answered Jan 2 at 19:54
Rick JamesRick James
69k561101
69k561101
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53308424%2fkeeping-archived-data-schema-up-to-date-with-running-data-warehouse%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?
– Rick James
Dec 30 '18 at 2:00
I am running low on disk space, yes.
– OhMyGawd
Jan 2 at 17:45