Keeping archived data schema up to date with running data warehouse

recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.

I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.

however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.

is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.

should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

asked Nov 14 '18 at 20:41

OhMyGawd

1069

"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00

I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45

add a comment |

I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.

is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.

should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

asked Nov 14 '18 at 20:41

OhMyGawd

1069

"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00

I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45

add a comment |

I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.

is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.

should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

asked Nov 14 '18 at 20:41

OhMyGawd

1069

I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.

is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.

should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?

amazon-s3 data-warehouse amazon-athena

asked Nov 14 '18 at 20:41

OhMyGawd

1069

asked Nov 14 '18 at 20:41

OhMyGawd

1069

asked Nov 14 '18 at 20:41

OhMyGawd

1069

asked Nov 14 '18 at 20:41

OhMyGawd

1069

asked Nov 14 '18 at 20:41

OhMyGawd

1069

"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00

I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45

add a comment |

"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00

I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45

"Clear up space" -- Meaning you are running low on disk space? Or you are worried about performance?

– Rick James
Dec 30 '18 at 2:00

I am running low on disk space, yes.

– OhMyGawd
Jan 2 at 17:45

add a comment |

2 Answers
2

active

oldest

votes

There are tons of materials in internet if you search the term "Schema evolution" in big data space.

The Athena documentation has a chapter on schema updates case by case example here.

If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.

Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

add a comment |

Plan A:

It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.

I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?

If the table were partitioned by Year or Quarter or Month, you could

Every period, "Export tablespace" to remove the oldest from the partition scheme.

That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.

At about the same time, you would build a new partition to receive new data.

(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)

A benefit of the method is that there is virtually zero impact on the big table during the processing.

An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).

Plan B:

Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...

Don't archive the old data.

Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

answered Jan 2 at 19:54

Rick James

69k561101

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53308424%2fkeeping-archived-data-schema-up-to-date-with-running-data-warehouse%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

There are tons of materials in internet if you search the term "Schema evolution" in big data space.

The Athena documentation has a chapter on schema updates case by case example here.

If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

add a comment |

There are tons of materials in internet if you search the term "Schema evolution" in big data space.

The Athena documentation has a chapter on schema updates case by case example here.

If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

add a comment |

There are tons of materials in internet if you search the term "Schema evolution" in big data space.

The Athena documentation has a chapter on schema updates case by case example here.

If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

There are tons of materials in internet if you search the term "Schema evolution" in big data space.

The Athena documentation has a chapter on schema updates case by case example here.

If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

answered Nov 23 '18 at 11:25

Tanveer Uddin

61629

add a comment |

Plan A:

It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.

I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?

If the table were partitioned by Year or Quarter or Month, you could

Every period, "Export tablespace" to remove the oldest from the partition scheme.

That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.

At about the same time, you would build a new partition to receive new data.

(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)

A benefit of the method is that there is virtually zero impact on the big table during the processing.

An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).

Plan B:

Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...

Don't archive the old data.

Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

answered Jan 2 at 19:54

Rick James

69k561101

add a comment |

Plan A:

It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.

I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?

If the table were partitioned by Year or Quarter or Month, you could

Every period, "Export tablespace" to remove the oldest from the partition scheme.

That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.

At about the same time, you would build a new partition to receive new data.

(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)

A benefit of the method is that there is virtually zero impact on the big table during the processing.

An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).

Plan B:

Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...

Don't archive the old data.

Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

answered Jan 2 at 19:54

Rick James

69k561101

add a comment |

Plan A:

It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.

I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?

If the table were partitioned by Year or Quarter or Month, you could

Every period, "Export tablespace" to remove the oldest from the partition scheme.

That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.

At about the same time, you would build a new partition to receive new data.

(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)

A benefit of the method is that there is virtually zero impact on the big table during the processing.

An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).

Plan B:

Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...

Don't archive the old data.

Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

answered Jan 2 at 19:54

Rick James

69k561101

Plan A:

It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.

I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?

If the table were partitioned by Year or Quarter or Month, you could

Every period, "Export tablespace" to remove the oldest from the partition scheme.

That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.

At about the same time, you would build a new partition to receive new data.

(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)

A benefit of the method is that there is virtually zero impact on the big table during the processing.

An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).

Plan B:

Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...

Don't archive the old data.

Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.

answered Jan 2 at 19:54

Rick James

69k561101

answered Jan 2 at 19:54

Rick James

69k561101

answered Jan 2 at 19:54

Rick James

69k561101

answered Jan 2 at 19:54

Rick James

69k561101

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Bptgfag3GWlCk9vJw8Gr1UHlRmeUJ9VV 2o,c CLdF,C4CPwls9 hOO,S5Z0 fPr

搜尋此網誌

Ndtyjky