Writing partitioned dataset to HDFS/S3 with _SUCCESS file in each partition
when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?
apache-spark pyspark hdfs
add a comment |
when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?
apache-spark pyspark hdfs
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.
– jww
May 2 '18 at 4:44
@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.
– user6910411
May 2 '18 at 16:31
@femibyte Why would you need that?_SUCCESS
marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?
– user6910411
May 2 '18 at 16:33
I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.
– femibyte
May 2 '18 at 18:15
I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.
– matmat
Nov 14 '18 at 0:28
add a comment |
when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?
apache-spark pyspark hdfs
when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?
apache-spark pyspark hdfs
apache-spark pyspark hdfs
asked Apr 26 '18 at 20:09
femibytefemibyte
91721537
91721537
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.
– jww
May 2 '18 at 4:44
@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.
– user6910411
May 2 '18 at 16:31
@femibyte Why would you need that?_SUCCESS
marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?
– user6910411
May 2 '18 at 16:33
I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.
– femibyte
May 2 '18 at 18:15
I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.
– matmat
Nov 14 '18 at 0:28
add a comment |
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.
– jww
May 2 '18 at 4:44
@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.
– user6910411
May 2 '18 at 16:31
@femibyte Why would you need that?_SUCCESS
marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?
– user6910411
May 2 '18 at 16:33
I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.
– femibyte
May 2 '18 at 18:15
I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.
– matmat
Nov 14 '18 at 0:28
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.
– jww
May 2 '18 at 4:44
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.
– jww
May 2 '18 at 4:44
@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.
– user6910411
May 2 '18 at 16:31
@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.
– user6910411
May 2 '18 at 16:31
@femibyte Why would you need that?
_SUCCESS
marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?– user6910411
May 2 '18 at 16:33
@femibyte Why would you need that?
_SUCCESS
marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?– user6910411
May 2 '18 at 16:33
I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.
– femibyte
May 2 '18 at 18:15
I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.
– femibyte
May 2 '18 at 18:15
I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.
– matmat
Nov 14 '18 at 0:28
I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.
– matmat
Nov 14 '18 at 0:28
add a comment |
1 Answer
1
active
oldest
votes
For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar
and not using the Parquet writer's partitionBy
argument.
FWIW, I also believe that _SUCCESS
files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50051103%2fwriting-partitioned-dataset-to-hdfs-s3-with-success-file-in-each-partition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar
and not using the Parquet writer's partitionBy
argument.
FWIW, I also believe that _SUCCESS
files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
add a comment |
For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar
and not using the Parquet writer's partitionBy
argument.
FWIW, I also believe that _SUCCESS
files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
add a comment |
For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar
and not using the Parquet writer's partitionBy
argument.
FWIW, I also believe that _SUCCESS
files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.
For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar
and not using the Parquet writer's partitionBy
argument.
FWIW, I also believe that _SUCCESS
files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.
answered Nov 14 '18 at 0:49
matmatmatmat
41537
41537
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
add a comment |
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
I've filed a bug report for this.
– matmat
Nov 14 '18 at 1:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50051103%2fwriting-partitioned-dataset-to-hdfs-s3-with-success-file-in-each-partition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.
– jww
May 2 '18 at 4:44
@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.
– user6910411
May 2 '18 at 16:31
@femibyte Why would you need that?
_SUCCESS
marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?– user6910411
May 2 '18 at 16:33
I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.
– femibyte
May 2 '18 at 18:15
I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.
– matmat
Nov 14 '18 at 0:28