Nested parallel processing with conditional logic error












0














This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:



Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:



Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%  
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}


Here is a simplified version of what I am trying to do and the error I am running into:



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}

int1list <- seq(1,10)
int2list <- seq(1,15)

out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}


Error: Error in func1(i) : object 'i' not found



When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.



I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.










share|improve this question
























  • where that conditional logic *ERROR* part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
    – abhiieor
    Nov 12 '18 at 17:07












  • The error is coming into play when the first function is ran, since "i" is part of the function call.
    – actuary_meets_data
    Nov 12 '18 at 17:10










  • The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where I would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
    – r2evans
    Nov 12 '18 at 17:11












  • What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
    – actuary_meets_data
    Nov 12 '18 at 17:12






  • 1




    That is a good recommednation r2evans, I will edit with an update.
    – actuary_meets_data
    Nov 12 '18 at 17:12
















0














This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:



Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:



Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%  
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}


Here is a simplified version of what I am trying to do and the error I am running into:



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}

int1list <- seq(1,10)
int2list <- seq(1,15)

out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}


Error: Error in func1(i) : object 'i' not found



When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.



I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.










share|improve this question
























  • where that conditional logic *ERROR* part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
    – abhiieor
    Nov 12 '18 at 17:07












  • The error is coming into play when the first function is ran, since "i" is part of the function call.
    – actuary_meets_data
    Nov 12 '18 at 17:10










  • The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where I would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
    – r2evans
    Nov 12 '18 at 17:11












  • What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
    – actuary_meets_data
    Nov 12 '18 at 17:12






  • 1




    That is a good recommednation r2evans, I will edit with an update.
    – actuary_meets_data
    Nov 12 '18 at 17:12














0












0








0







This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:



Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:



Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%  
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}


Here is a simplified version of what I am trying to do and the error I am running into:



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}

int1list <- seq(1,10)
int2list <- seq(1,15)

out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}


Error: Error in func1(i) : object 'i' not found



When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.



I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.










share|improve this question















This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:



Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:



Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%  
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}


Here is a simplified version of what I am trying to do and the error I am running into:



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}

int1list <- seq(1,10)
int2list <- seq(1,15)

out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}


Error: Error in func1(i) : object 'i' not found



When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.



I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.







r parallel-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 '18 at 17:23

























asked Nov 12 '18 at 17:02









actuary_meets_data

235




235












  • where that conditional logic *ERROR* part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
    – abhiieor
    Nov 12 '18 at 17:07












  • The error is coming into play when the first function is ran, since "i" is part of the function call.
    – actuary_meets_data
    Nov 12 '18 at 17:10










  • The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where I would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
    – r2evans
    Nov 12 '18 at 17:11












  • What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
    – actuary_meets_data
    Nov 12 '18 at 17:12






  • 1




    That is a good recommednation r2evans, I will edit with an update.
    – actuary_meets_data
    Nov 12 '18 at 17:12


















  • where that conditional logic *ERROR* part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
    – abhiieor
    Nov 12 '18 at 17:07












  • The error is coming into play when the first function is ran, since "i" is part of the function call.
    – actuary_meets_data
    Nov 12 '18 at 17:10










  • The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where I would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
    – r2evans
    Nov 12 '18 at 17:11












  • What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
    – actuary_meets_data
    Nov 12 '18 at 17:12






  • 1




    That is a good recommednation r2evans, I will edit with an update.
    – actuary_meets_data
    Nov 12 '18 at 17:12
















where that conditional logic *ERROR* part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
– abhiieor
Nov 12 '18 at 17:07






where that conditional logic *ERROR* part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
– abhiieor
Nov 12 '18 at 17:07














The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10




The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10












The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where I would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
– r2evans
Nov 12 '18 at 17:11






The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where I would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
– r2evans
Nov 12 '18 at 17:11














What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12




What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12




1




1




That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12




That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12












2 Answers
2






active

oldest

votes


















0














I'm not a pro at foreach, but there are a few things to this that stand out:





  • func2 references both int1 and int2 but it is only given the latter; this might be an artifact of your simplified example, maybe not?


  • your code here needs to be enclosed in a curly block, i.e., you need to change from



    out <- foreach(i=1:length(int1list),.combine=rbind) %:%
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...


    to



    out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...
    }


  • the docs for foreach suggest that the binary operator %:% is a nesting operator that is used between two foreach calls, but you aren't doing that. I think I get it to work correctly with %do% (or %dopar%)

  • I don't think prints work well inside parallel foreach loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%

  • possibly again due to simplified example, you define but don't actually use the contents of int1list (just its length), I'll remedy in this example


  • next works in "normal" R loops, not in these specialized foreach loops; it isn't a problem, though, since your if/else structure provides the same effect


Here's your example, modified slightly to account for all of the above. I add UsedJ to indicate



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE




Edit



If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach's method of nesting loops with the %:% operator.



In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print as one might hope).



library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}


The use of the logging code requires an external way to read from that socket. I'm using netcat (nc or Nmap's ncat) with ncat -k -l 4000 here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log.)



I couldn't get the nested "foreach -> func1 -> foreach -> func2" to parallelize func2 correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1, and 2 seconds (two batches of three each) for the five calls to func2, but it takes 10 seconds (three parallel calls to func1, then five sequential calls to func2):



system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09


with the respective console output:



2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5


(note that the order is not guaranteed.)



So we can break it out into computing func1 stuff first:



system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE


console:



2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3


then work on func2 stuff:



system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE


Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:



2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5


You can see that func2 is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.



If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.






share|improve this answer























  • Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
    – actuary_meets_data
    Nov 12 '18 at 18:39












  • See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
    – r2evans
    Nov 12 '18 at 19:57










  • I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
    – actuary_meets_data
    Nov 12 '18 at 22:25












  • I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
    – actuary_meets_data
    Nov 12 '18 at 22:29












  • For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
    – r2evans
    Nov 12 '18 at 23:00





















0














I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.



This is the original proposed solution:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}

stopCluster(cl)
registerDoSEQ()

out


However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)

out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}

finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())

for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}

stopCluster(cl)
registerDoSEQ()

finalOut


This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.






share|improve this answer





















  • For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
    – r2evans
    Nov 14 '18 at 16:40










  • I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
    – actuary_meets_data
    Nov 14 '18 at 16:48










  • Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
    – r2evans
    Nov 14 '18 at 16:57











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53266848%2fnested-parallel-processing-with-conditional-logic-error%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I'm not a pro at foreach, but there are a few things to this that stand out:





  • func2 references both int1 and int2 but it is only given the latter; this might be an artifact of your simplified example, maybe not?


  • your code here needs to be enclosed in a curly block, i.e., you need to change from



    out <- foreach(i=1:length(int1list),.combine=rbind) %:%
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...


    to



    out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...
    }


  • the docs for foreach suggest that the binary operator %:% is a nesting operator that is used between two foreach calls, but you aren't doing that. I think I get it to work correctly with %do% (or %dopar%)

  • I don't think prints work well inside parallel foreach loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%

  • possibly again due to simplified example, you define but don't actually use the contents of int1list (just its length), I'll remedy in this example


  • next works in "normal" R loops, not in these specialized foreach loops; it isn't a problem, though, since your if/else structure provides the same effect


Here's your example, modified slightly to account for all of the above. I add UsedJ to indicate



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE




Edit



If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach's method of nesting loops with the %:% operator.



In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print as one might hope).



library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}


The use of the logging code requires an external way to read from that socket. I'm using netcat (nc or Nmap's ncat) with ncat -k -l 4000 here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log.)



I couldn't get the nested "foreach -> func1 -> foreach -> func2" to parallelize func2 correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1, and 2 seconds (two batches of three each) for the five calls to func2, but it takes 10 seconds (three parallel calls to func1, then five sequential calls to func2):



system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09


with the respective console output:



2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5


(note that the order is not guaranteed.)



So we can break it out into computing func1 stuff first:



system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE


console:



2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3


then work on func2 stuff:



system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE


Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:



2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5


You can see that func2 is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.



If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.






share|improve this answer























  • Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
    – actuary_meets_data
    Nov 12 '18 at 18:39












  • See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
    – r2evans
    Nov 12 '18 at 19:57










  • I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
    – actuary_meets_data
    Nov 12 '18 at 22:25












  • I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
    – actuary_meets_data
    Nov 12 '18 at 22:29












  • For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
    – r2evans
    Nov 12 '18 at 23:00


















0














I'm not a pro at foreach, but there are a few things to this that stand out:





  • func2 references both int1 and int2 but it is only given the latter; this might be an artifact of your simplified example, maybe not?


  • your code here needs to be enclosed in a curly block, i.e., you need to change from



    out <- foreach(i=1:length(int1list),.combine=rbind) %:%
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...


    to



    out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...
    }


  • the docs for foreach suggest that the binary operator %:% is a nesting operator that is used between two foreach calls, but you aren't doing that. I think I get it to work correctly with %do% (or %dopar%)

  • I don't think prints work well inside parallel foreach loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%

  • possibly again due to simplified example, you define but don't actually use the contents of int1list (just its length), I'll remedy in this example


  • next works in "normal" R loops, not in these specialized foreach loops; it isn't a problem, though, since your if/else structure provides the same effect


Here's your example, modified slightly to account for all of the above. I add UsedJ to indicate



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE




Edit



If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach's method of nesting loops with the %:% operator.



In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print as one might hope).



library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}


The use of the logging code requires an external way to read from that socket. I'm using netcat (nc or Nmap's ncat) with ncat -k -l 4000 here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log.)



I couldn't get the nested "foreach -> func1 -> foreach -> func2" to parallelize func2 correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1, and 2 seconds (two batches of three each) for the five calls to func2, but it takes 10 seconds (three parallel calls to func1, then five sequential calls to func2):



system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09


with the respective console output:



2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5


(note that the order is not guaranteed.)



So we can break it out into computing func1 stuff first:



system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE


console:



2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3


then work on func2 stuff:



system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE


Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:



2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5


You can see that func2 is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.



If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.






share|improve this answer























  • Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
    – actuary_meets_data
    Nov 12 '18 at 18:39












  • See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
    – r2evans
    Nov 12 '18 at 19:57










  • I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
    – actuary_meets_data
    Nov 12 '18 at 22:25












  • I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
    – actuary_meets_data
    Nov 12 '18 at 22:29












  • For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
    – r2evans
    Nov 12 '18 at 23:00
















0












0








0






I'm not a pro at foreach, but there are a few things to this that stand out:





  • func2 references both int1 and int2 but it is only given the latter; this might be an artifact of your simplified example, maybe not?


  • your code here needs to be enclosed in a curly block, i.e., you need to change from



    out <- foreach(i=1:length(int1list),.combine=rbind) %:%
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...


    to



    out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...
    }


  • the docs for foreach suggest that the binary operator %:% is a nesting operator that is used between two foreach calls, but you aren't doing that. I think I get it to work correctly with %do% (or %dopar%)

  • I don't think prints work well inside parallel foreach loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%

  • possibly again due to simplified example, you define but don't actually use the contents of int1list (just its length), I'll remedy in this example


  • next works in "normal" R loops, not in these specialized foreach loops; it isn't a problem, though, since your if/else structure provides the same effect


Here's your example, modified slightly to account for all of the above. I add UsedJ to indicate



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE




Edit



If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach's method of nesting loops with the %:% operator.



In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print as one might hope).



library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}


The use of the logging code requires an external way to read from that socket. I'm using netcat (nc or Nmap's ncat) with ncat -k -l 4000 here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log.)



I couldn't get the nested "foreach -> func1 -> foreach -> func2" to parallelize func2 correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1, and 2 seconds (two batches of three each) for the five calls to func2, but it takes 10 seconds (three parallel calls to func1, then five sequential calls to func2):



system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09


with the respective console output:



2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5


(note that the order is not guaranteed.)



So we can break it out into computing func1 stuff first:



system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE


console:



2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3


then work on func2 stuff:



system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE


Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:



2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5


You can see that func2 is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.



If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.






share|improve this answer














I'm not a pro at foreach, but there are a few things to this that stand out:





  • func2 references both int1 and int2 but it is only given the latter; this might be an artifact of your simplified example, maybe not?


  • your code here needs to be enclosed in a curly block, i.e., you need to change from



    out <- foreach(i=1:length(int1list),.combine=rbind) %:%
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...


    to



    out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
    out1 <- func1(i)
    if(out1[[2]]==FALSE) ...
    }


  • the docs for foreach suggest that the binary operator %:% is a nesting operator that is used between two foreach calls, but you aren't doing that. I think I get it to work correctly with %do% (or %dopar%)

  • I don't think prints work well inside parallel foreach loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%

  • possibly again due to simplified example, you define but don't actually use the contents of int1list (just its length), I'll remedy in this example


  • next works in "normal" R loops, not in these specialized foreach loops; it isn't a problem, though, since your if/else structure provides the same effect


Here's your example, modified slightly to account for all of the above. I add UsedJ to indicate



library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE




Edit



If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach's method of nesting loops with the %:% operator.



In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print as one might hope).



library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}


The use of the logging code requires an external way to read from that socket. I'm using netcat (nc or Nmap's ncat) with ncat -k -l 4000 here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log.)



I couldn't get the nested "foreach -> func1 -> foreach -> func2" to parallelize func2 correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1, and 2 seconds (two batches of three each) for the five calls to func2, but it takes 10 seconds (three parallel calls to func1, then five sequential calls to func2):



system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09


with the respective console output:



2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5


(note that the order is not guaranteed.)



So we can break it out into computing func1 stuff first:



system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE


console:



2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3


then work on func2 stuff:



system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE


Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:



2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5


You can see that func2 is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.



If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 '18 at 23:01

























answered Nov 12 '18 at 17:43









r2evans

25.7k32856




25.7k32856












  • Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
    – actuary_meets_data
    Nov 12 '18 at 18:39












  • See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
    – r2evans
    Nov 12 '18 at 19:57










  • I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
    – actuary_meets_data
    Nov 12 '18 at 22:25












  • I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
    – actuary_meets_data
    Nov 12 '18 at 22:29












  • For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
    – r2evans
    Nov 12 '18 at 23:00




















  • Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
    – actuary_meets_data
    Nov 12 '18 at 18:39












  • See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
    – r2evans
    Nov 12 '18 at 19:57










  • I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
    – actuary_meets_data
    Nov 12 '18 at 22:25












  • I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
    – actuary_meets_data
    Nov 12 '18 at 22:29












  • For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
    – r2evans
    Nov 12 '18 at 23:00


















Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39






Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39














See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57




See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57












I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25






I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25














I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29






I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29














For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00






For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g., ncat -k -l 4000) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00















0














I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.



This is the original proposed solution:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}

stopCluster(cl)
registerDoSEQ()

out


However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)

out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}

finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())

for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}

stopCluster(cl)
registerDoSEQ()

finalOut


This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.






share|improve this answer





















  • For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
    – r2evans
    Nov 14 '18 at 16:40










  • I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
    – actuary_meets_data
    Nov 14 '18 at 16:48










  • Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
    – r2evans
    Nov 14 '18 at 16:57
















0














I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.



This is the original proposed solution:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}

stopCluster(cl)
registerDoSEQ()

out


However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)

out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}

finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())

for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}

stopCluster(cl)
registerDoSEQ()

finalOut


This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.






share|improve this answer





















  • For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
    – r2evans
    Nov 14 '18 at 16:40










  • I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
    – actuary_meets_data
    Nov 14 '18 at 16:48










  • Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
    – r2evans
    Nov 14 '18 at 16:57














0












0








0






I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.



This is the original proposed solution:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}

stopCluster(cl)
registerDoSEQ()

out


However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)

out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}

finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())

for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}

stopCluster(cl)
registerDoSEQ()

finalOut


This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.






share|improve this answer












I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.



This is the original proposed solution:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}

stopCluster(cl)
registerDoSEQ()

out


However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:



library(doParallel)
library(foreach)

cl <- makeCluster(detectCores())
registerDoParallel(cl)

func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)

out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}

finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())

for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}

stopCluster(cl)
registerDoSEQ()

finalOut


This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 20:00









actuary_meets_data

235




235












  • For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
    – r2evans
    Nov 14 '18 at 16:40










  • I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
    – actuary_meets_data
    Nov 14 '18 at 16:48










  • Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
    – r2evans
    Nov 14 '18 at 16:57


















  • For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
    – r2evans
    Nov 14 '18 at 16:40










  • I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
    – actuary_meets_data
    Nov 14 '18 at 16:48










  • Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
    – r2evans
    Nov 14 '18 at 16:57
















For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40




For clarity ... ncat (or nc) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40












I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48




I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48












Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57




Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run > cmd", go to the right dir, type in ncat -k -l 4000, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53266848%2fnested-parallel-processing-with-conditional-logic-error%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Lugert, Oklahoma