Writing partitioned dataset to HDFS/S3 with _SUCCESS file in each partition












2















when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?










share|improve this question























  • Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.

    – jww
    May 2 '18 at 4:44











  • @jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.

    – user6910411
    May 2 '18 at 16:31











  • @femibyte Why would you need that? _SUCCESS marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?

    – user6910411
    May 2 '18 at 16:33











  • I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.

    – femibyte
    May 2 '18 at 18:15











  • I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.

    – matmat
    Nov 14 '18 at 0:28
















2















when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?










share|improve this question























  • Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.

    – jww
    May 2 '18 at 4:44











  • @jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.

    – user6910411
    May 2 '18 at 16:31











  • @femibyte Why would you need that? _SUCCESS marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?

    – user6910411
    May 2 '18 at 16:33











  • I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.

    – femibyte
    May 2 '18 at 18:15











  • I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.

    – matmat
    Nov 14 '18 at 0:28














2












2








2








when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?










share|improve this question














when writing a partitioned dataset to HDFS/S3, a _SUCCESS file is written to the output directory upon successful completion. I'm curious if there is way to get a _SUCCESS file written to each partitioned directory ?







apache-spark pyspark hdfs






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 26 '18 at 20:09









femibytefemibyte

91721537




91721537













  • Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.

    – jww
    May 2 '18 at 4:44











  • @jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.

    – user6910411
    May 2 '18 at 16:31











  • @femibyte Why would you need that? _SUCCESS marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?

    – user6910411
    May 2 '18 at 16:33











  • I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.

    – femibyte
    May 2 '18 at 18:15











  • I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.

    – matmat
    Nov 14 '18 at 0:28



















  • Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.

    – jww
    May 2 '18 at 4:44











  • @jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.

    – user6910411
    May 2 '18 at 16:31











  • @femibyte Why would you need that? _SUCCESS marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?

    – user6910411
    May 2 '18 at 16:33











  • I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.

    – femibyte
    May 2 '18 at 18:15











  • I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.

    – matmat
    Nov 14 '18 at 0:28

















Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.

– jww
May 2 '18 at 4:44





Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.

– jww
May 2 '18 at 4:44













@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.

– user6910411
May 2 '18 at 16:31





@jww It is a perfectly valid question and definitely not something that could be answered on Super User or Unix. Context might not be obvious without the context, but it clear if you consider the tags.

– user6910411
May 2 '18 at 16:31













@femibyte Why would you need that? _SUCCESS marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?

– user6910411
May 2 '18 at 16:33





@femibyte Why would you need that? _SUCCESS marks completion of the job and no partition can be considered completed, until a whole job is. Is there any particular use case here?

– user6910411
May 2 '18 at 16:33













I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.

– femibyte
May 2 '18 at 18:15





I want to be able to use the _SUCCESS flag as an indicator in a luigi workflow where the pipeline writes to a new daily s3 partition. Because the location is partitioned, the _SUCCESS flag is created at the "folder" above rather than the newly created partitioned directory itself.

– femibyte
May 2 '18 at 18:15













I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.

– matmat
Nov 14 '18 at 0:28





I'm facing this problem for a daily ETL. I need to to be able to keep a record of what ETLs succeeded, even while multiple ETLs may run at the same time or out of chronological order. Would love to see an elegant solution.

– matmat
Nov 14 '18 at 0:28












1 Answer
1






active

oldest

votes


















1














For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar and not using the Parquet writer's partitionBy argument.



FWIW, I also believe that _SUCCESS files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.






share|improve this answer
























  • I've filed a bug report for this.

    – matmat
    Nov 14 '18 at 1:47











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50051103%2fwriting-partitioned-dataset-to-hdfs-s3-with-success-file-in-each-partition%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar and not using the Parquet writer's partitionBy argument.



FWIW, I also believe that _SUCCESS files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.






share|improve this answer
























  • I've filed a bug report for this.

    – matmat
    Nov 14 '18 at 1:47
















1














For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar and not using the Parquet writer's partitionBy argument.



FWIW, I also believe that _SUCCESS files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.






share|improve this answer
























  • I've filed a bug report for this.

    – matmat
    Nov 14 '18 at 1:47














1












1








1







For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar and not using the Parquet writer's partitionBy argument.



FWIW, I also believe that _SUCCESS files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.






share|improve this answer













For the time being, you may be able to get your desired result by writing out files directly to path/to/table/partition_key1=foo/partition_key2=bar and not using the Parquet writer's partitionBy argument.



FWIW, I also believe that _SUCCESS files should be written out to every partition, especially given that SPARK-13207 and SPARK-20236 have been resolved.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 14 '18 at 0:49









matmatmatmat

41537




41537













  • I've filed a bug report for this.

    – matmat
    Nov 14 '18 at 1:47



















  • I've filed a bug report for this.

    – matmat
    Nov 14 '18 at 1:47

















I've filed a bug report for this.

– matmat
Nov 14 '18 at 1:47





I've filed a bug report for this.

– matmat
Nov 14 '18 at 1:47


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f50051103%2fwriting-partitioned-dataset-to-hdfs-s3-with-success-file-in-each-partition%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Error while running script in elastic search , gateway timeout

Adding quotations to stringified JSON object values