Nested parallel processing with conditional logic error
This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:
Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:
Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}
Here is a simplified version of what I am trying to do and the error I am running into:
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}
int1list <- seq(1,10)
int2list <- seq(1,15)
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}
Error: Error in func1(i) : object 'i' not found
When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.
I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.
r parallel-processing
add a comment |
This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:
Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:
Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}
Here is a simplified version of what I am trying to do and the error I am running into:
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}
int1list <- seq(1,10)
int2list <- seq(1,15)
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}
Error: Error in func1(i) : object 'i' not found
When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.
I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.
r parallel-processing
where thatconditional logic *ERROR*
part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
– abhiieor
Nov 12 '18 at 17:07
The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10
The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including whereI
would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
– r2evans
Nov 12 '18 at 17:11
What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12
1
That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12
add a comment |
This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:
Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:
Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}
Here is a simplified version of what I am trying to do and the error I am running into:
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}
int1list <- seq(1,10)
int2list <- seq(1,15)
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}
Error: Error in func1(i) : object 'i' not found
When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.
I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.
r parallel-processing
This one is a bit complicated, so I don't think it would be worthwhile to share the exact code I am working with, but I should be able to get the point across fairly well using pseudocode:
Little bit of background:
Essentially I am trying to do parallel computing on a nested loop of operations.
I have two large functions, the first one needs to run and return TRUE in order for the second function to run, and if the second function runs it needs to loop through several iterations.
Now this is a nested loop because I need to run the entire above operation several times, for various scenarios.
The pseudocode I am trying to use is below:
Output <- foreach(1 to “I”, .packages=packages, .combine=rbind) %:%
Run the first function
If the first function is false:
Print and record
Else:
Foreach(1 to J, .packages=packages, .combine=rbind) %dopar%{
Run the second function
Create df summarizing each loop of second function
}
Here is a simplified version of what I am trying to do and the error I am running into:
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,TRUE)
return(results)
}
func2 <- function(int2){
return(int1/int2)
}
int1list <- seq(1,10)
int2list <- seq(1,15)
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE){
print("fail")
next
} else{
foreach(j=1:length(int2),.combine=rbind) %dopar% {
int3 <- func2(j)
data.frame("Scenario"=i,"Result"=int3)
}
}
Error: Error in func1(i) : object 'i' not found
When I run the above, it essentially tells me that it can’t even find the object “I”, which I assume is happening because I am running things that call “I” outside of the innermost loop. I have been able to get nested parallelized loops to work before, but I did not have anything that needed to run outside of the innermost loop, so I am assuming it is an issue with the package not knowing the order to perform things in.
I have a workaround where I can just run the first function in parallel and then run the second function in parallel based on the results of the first loop (essentially two separate loops instead of a nested loop), but I was wondering if there was a way to get something like the nested loop to work because I think it would be more efficient. When run in production this code will likely take hours to run, so saving some time would be worthwhile.
r parallel-processing
r parallel-processing
edited Nov 12 '18 at 17:23
asked Nov 12 '18 at 17:02
actuary_meets_data
235
235
where thatconditional logic *ERROR*
part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
– abhiieor
Nov 12 '18 at 17:07
The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10
The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including whereI
would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
– r2evans
Nov 12 '18 at 17:11
What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12
1
That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12
add a comment |
where thatconditional logic *ERROR*
part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).
– abhiieor
Nov 12 '18 at 17:07
The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10
The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including whereI
would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.
– r2evans
Nov 12 '18 at 17:11
What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12
1
That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12
where that
conditional logic *ERROR*
part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).– abhiieor
Nov 12 '18 at 17:07
where that
conditional logic *ERROR*
part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).– abhiieor
Nov 12 '18 at 17:07
The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10
The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10
The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where
I
would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.– r2evans
Nov 12 '18 at 17:11
The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where
I
would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.– r2evans
Nov 12 '18 at 17:11
What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12
What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12
1
1
That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12
That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12
add a comment |
2 Answers
2
active
oldest
votes
I'm not a pro at foreach
, but there are a few things to this that stand out:
func2
references bothint1
andint2
but it is only given the latter; this might be an artifact of your simplified example, maybe not?
your code here needs to be enclosed in a curly block, i.e., you need to change from
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
to
out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
}
- the docs for
foreach
suggest that the binary operator%:%
is a nesting operator that is used between twoforeach
calls, but you aren't doing that. I think I get it to work correctly with%do%
(or%dopar%
) - I don't think
print
s work well inside parallelforeach
loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%
- possibly again due to simplified example, you define but don't actually use the contents of
int1list
(just its length), I'll remedy in this example
next
works in "normal" R loops, not in these specializedforeach
loops; it isn't a problem, though, since yourif
/else
structure provides the same effect
Here's your example, modified slightly to account for all of the above. I add UsedJ
to indicate
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE
Edit
If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach
's method of nesting loops with the %:%
operator.
In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print
as one might hope).
library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}
The use of the logging code requires an external way to read from that socket. I'm using netcat (nc
or Nmap's ncat
) with ncat -k -l 4000
here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log
.)
I couldn't get the nested "foreach
-> func1
-> foreach
-> func2
" to parallelize func2
correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1
, and 2 seconds (two batches of three each) for the five calls to func2
, but it takes 10 seconds (three parallel calls to func1
, then five sequential calls to func2
):
system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09
with the respective console output:
2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5
(note that the order is not guaranteed.)
So we can break it out into computing func1
stuff first:
system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE
console:
2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3
then work on func2
stuff:
system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE
Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:
2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5
You can see that func2
is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.
If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00
|
show 1 more comment
I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.
This is the original proposed solution:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
stopCluster(cl)
registerDoSEQ()
out
However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}
finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())
for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}
stopCluster(cl)
registerDoSEQ()
finalOut
This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.
For clarity ...ncat
(ornc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >cmd
", go to the right dir, type inncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type inpath/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (withoutLog
) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53266848%2fnested-parallel-processing-with-conditional-logic-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm not a pro at foreach
, but there are a few things to this that stand out:
func2
references bothint1
andint2
but it is only given the latter; this might be an artifact of your simplified example, maybe not?
your code here needs to be enclosed in a curly block, i.e., you need to change from
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
to
out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
}
- the docs for
foreach
suggest that the binary operator%:%
is a nesting operator that is used between twoforeach
calls, but you aren't doing that. I think I get it to work correctly with%do%
(or%dopar%
) - I don't think
print
s work well inside parallelforeach
loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%
- possibly again due to simplified example, you define but don't actually use the contents of
int1list
(just its length), I'll remedy in this example
next
works in "normal" R loops, not in these specializedforeach
loops; it isn't a problem, though, since yourif
/else
structure provides the same effect
Here's your example, modified slightly to account for all of the above. I add UsedJ
to indicate
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE
Edit
If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach
's method of nesting loops with the %:%
operator.
In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print
as one might hope).
library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}
The use of the logging code requires an external way to read from that socket. I'm using netcat (nc
or Nmap's ncat
) with ncat -k -l 4000
here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log
.)
I couldn't get the nested "foreach
-> func1
-> foreach
-> func2
" to parallelize func2
correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1
, and 2 seconds (two batches of three each) for the five calls to func2
, but it takes 10 seconds (three parallel calls to func1
, then five sequential calls to func2
):
system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09
with the respective console output:
2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5
(note that the order is not guaranteed.)
So we can break it out into computing func1
stuff first:
system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE
console:
2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3
then work on func2
stuff:
system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE
Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:
2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5
You can see that func2
is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.
If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00
|
show 1 more comment
I'm not a pro at foreach
, but there are a few things to this that stand out:
func2
references bothint1
andint2
but it is only given the latter; this might be an artifact of your simplified example, maybe not?
your code here needs to be enclosed in a curly block, i.e., you need to change from
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
to
out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
}
- the docs for
foreach
suggest that the binary operator%:%
is a nesting operator that is used between twoforeach
calls, but you aren't doing that. I think I get it to work correctly with%do%
(or%dopar%
) - I don't think
print
s work well inside parallelforeach
loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%
- possibly again due to simplified example, you define but don't actually use the contents of
int1list
(just its length), I'll remedy in this example
next
works in "normal" R loops, not in these specializedforeach
loops; it isn't a problem, though, since yourif
/else
structure provides the same effect
Here's your example, modified slightly to account for all of the above. I add UsedJ
to indicate
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE
Edit
If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach
's method of nesting loops with the %:%
operator.
In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print
as one might hope).
library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}
The use of the logging code requires an external way to read from that socket. I'm using netcat (nc
or Nmap's ncat
) with ncat -k -l 4000
here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log
.)
I couldn't get the nested "foreach
-> func1
-> foreach
-> func2
" to parallelize func2
correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1
, and 2 seconds (two batches of three each) for the five calls to func2
, but it takes 10 seconds (three parallel calls to func1
, then five sequential calls to func2
):
system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09
with the respective console output:
2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5
(note that the order is not guaranteed.)
So we can break it out into computing func1
stuff first:
system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE
console:
2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3
then work on func2
stuff:
system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE
Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:
2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5
You can see that func2
is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.
If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00
|
show 1 more comment
I'm not a pro at foreach
, but there are a few things to this that stand out:
func2
references bothint1
andint2
but it is only given the latter; this might be an artifact of your simplified example, maybe not?
your code here needs to be enclosed in a curly block, i.e., you need to change from
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
to
out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
}
- the docs for
foreach
suggest that the binary operator%:%
is a nesting operator that is used between twoforeach
calls, but you aren't doing that. I think I get it to work correctly with%do%
(or%dopar%
) - I don't think
print
s work well inside parallelforeach
loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%
- possibly again due to simplified example, you define but don't actually use the contents of
int1list
(just its length), I'll remedy in this example
next
works in "normal" R loops, not in these specializedforeach
loops; it isn't a problem, though, since yourif
/else
structure provides the same effect
Here's your example, modified slightly to account for all of the above. I add UsedJ
to indicate
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE
Edit
If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach
's method of nesting loops with the %:%
operator.
In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print
as one might hope).
library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}
The use of the logging code requires an external way to read from that socket. I'm using netcat (nc
or Nmap's ncat
) with ncat -k -l 4000
here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log
.)
I couldn't get the nested "foreach
-> func1
-> foreach
-> func2
" to parallelize func2
correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1
, and 2 seconds (two batches of three each) for the five calls to func2
, but it takes 10 seconds (three parallel calls to func1
, then five sequential calls to func2
):
system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09
with the respective console output:
2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5
(note that the order is not guaranteed.)
So we can break it out into computing func1
stuff first:
system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE
console:
2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3
then work on func2
stuff:
system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE
Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:
2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5
You can see that func2
is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.
If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.
I'm not a pro at foreach
, but there are a few things to this that stand out:
func2
references bothint1
andint2
but it is only given the latter; this might be an artifact of your simplified example, maybe not?
your code here needs to be enclosed in a curly block, i.e., you need to change from
out <- foreach(i=1:length(int1list),.combine=rbind) %:%
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
to
out <- foreach(i=1:length(int1list),.combine=rbind) %:% {
out1 <- func1(i)
if(out1[[2]]==FALSE) ...
}
- the docs for
foreach
suggest that the binary operator%:%
is a nesting operator that is used between twoforeach
calls, but you aren't doing that. I think I get it to work correctly with%do%
(or%dopar%
) - I don't think
print
s work well inside parallelforeach
loops ... it might work find on the master node but not on all others, ref: How can I print when using %dopar%
- possibly again due to simplified example, you define but don't actually use the contents of
int1list
(just its length), I'll remedy in this example
next
works in "normal" R loops, not in these specializedforeach
loops; it isn't a problem, though, since yourif
/else
structure provides the same effect
Here's your example, modified slightly to account for all of the above. I add UsedJ
to indicate
library(doParallel)
library(foreach)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
out
# Scenario Result UsedJ
# 1 1 1.00 FALSE
# 2 2 2.00 FALSE
# 3 3 3.00 TRUE
# 4 3 1.50 TRUE
# 5 3 1.00 TRUE
# 6 3 0.75 TRUE
# 7 3 0.60 TRUE
Edit
If you aren't seeing parallelization, perhaps it's because you have not set up a "cluster" yet. There are also a few other changes to the work flow to get it to parallelize well, based on foreach
's method of nesting loops with the %:%
operator.
In order to "prove" this is working in parallel, I've added some logging based on How can I print when using %dopar% (because parallel processes do not print
as one might hope).
library(doParallel)
library(foreach)
Log <- function(text, ..., .port = 4000, .sock = make.socket(port=.port)) {
msg <- sprintf(paste0(as.character(Sys.time()), ": ", text, "n"), ...)
write.socket(.sock, msg)
close.socket(.sock)
}
func1 <- function(int1) {
Log(paste("func1", int1))
Sys.sleep(5)
results <- list(int1, int1 > 2)
return(results)
}
func2 <- function(int1, int2) {
Log(paste("func2", int1, int2))
Sys.sleep(1)
return(int1 / int2)
}
The use of the logging code requires an external way to read from that socket. I'm using netcat (nc
or Nmap's ncat
) with ncat -k -l 4000
here. It is certainly not required for the job to work, but is handy here to see how things are progressing. (Note: this listener/server needs to be running before you try to use Log
.)
I couldn't get the nested "foreach
-> func1
-> foreach
-> func2
" to parallelize func2
correctly. Based on the sleeps, this should take 5 seconds for the three calls to func1
, and 2 seconds (two batches of three each) for the five calls to func2
, but it takes 10 seconds (three parallel calls to func1
, then five sequential calls to func2
):
system.time(
out <- foreach(i=1:length(int1list), .combine=rbind, .packages="foreach") %dopar% {
out1 <- func1(int1list[i])
if (!out1[[2]]) {
data.frame(Scenario=i, Result=out1[[1]], UsedJ=FALSE)
} else {
foreach(j=1:length(int2list), .combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame(Scenario=i, Result=int3, UsedJ=TRUE)
}
}
}
)
# user system elapsed
# 0.02 0.00 10.09
with the respective console output:
2018-11-12 11:51:17: func1 2
2018-11-12 11:51:17: func1 1
2018-11-12 11:51:17: func1 3
2018-11-12 11:51:23: func2 3 1
2018-11-12 11:51:24: func2 3 2
2018-11-12 11:51:25: func2 3 3
2018-11-12 11:51:26: func2 3 4
2018-11-12 11:51:27: func2 3 5
(note that the order is not guaranteed.)
So we can break it out into computing func1
stuff first:
system.time(
out1 <- foreach(i = seq_along(int1list)) %dopar% {
func1(int1list[i])
}
)
# user system elapsed
# 0.02 0.01 5.03
str(out1)
# List of 3
# $ :List of 2
# ..$ : int 1
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 2
# ..$ : logi FALSE
# $ :List of 2
# ..$ : int 3
# ..$ : logi TRUE
console:
2018-11-12 11:53:21: func1 2
2018-11-12 11:53:21: func1 1
2018-11-12 11:53:21: func1 3
then work on func2
stuff:
system.time(
out2 <- foreach(i = seq_along(int1list), .combine="rbind") %:%
foreach(j = seq_along(int2list), .combine="rbind") %dopar% {
Log(paste("preparing", i, j))
if (out1[[i]][[2]]) {
int3 <- func2(out1[[i]][[1]], j)
data.frame(i=i, j=j, Result=int3, UsedJ=FALSE)
} else if (j == 1L) {
data.frame(i=i, j=NA_integer_, Result=out1[[i]][[1]], UsedJ=FALSE)
}
}
)
# user system elapsed
# 0.03 0.00 2.05
out2
# i j Result UsedJ
# 1 1 NA 1.00 FALSE
# 2 2 NA 2.00 FALSE
# 3 3 1 3.00 FALSE
# 4 3 2 1.50 FALSE
# 5 3 3 1.00 FALSE
# 6 3 4 0.75 FALSE
# 7 3 5 0.60 FALSE
Two seconds (first batch of three is 1 second, second batch of two is 1 second) is what I expected. Console:
2018-11-12 11:54:01: preparing 1 2
2018-11-12 11:54:01: preparing 1 3
2018-11-12 11:54:01: preparing 1 1
2018-11-12 11:54:01: preparing 1 4
2018-11-12 11:54:01: preparing 1 5
2018-11-12 11:54:01: preparing 2 1
2018-11-12 11:54:01: preparing 2 2
2018-11-12 11:54:01: preparing 2 3
2018-11-12 11:54:01: preparing 2 4
2018-11-12 11:54:01: preparing 2 5
2018-11-12 11:54:01: preparing 3 1
2018-11-12 11:54:01: preparing 3 2
2018-11-12 11:54:01: func2 3 1
2018-11-12 11:54:01: preparing 3 3
2018-11-12 11:54:01: func2 3 2
2018-11-12 11:54:01: func2 3 3
2018-11-12 11:54:02: preparing 3 4
2018-11-12 11:54:02: preparing 3 5
2018-11-12 11:54:02: func2 3 4
2018-11-12 11:54:02: func2 3 5
You can see that func2
is called five times correctly. Unfortunately, you see that there is a lot of "spinning" internally in the loop. Granted, it's effectively a no-op (as evidenced by the 2.05 second runtime) so the load on the nodes is negligible.
If somebody has a method to preclude this needless spinning, I welcome comments or "competing" answers.
edited Nov 12 '18 at 23:01
answered Nov 12 '18 at 17:43
r2evans
25.7k32856
25.7k32856
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00
|
show 1 more comment
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.
– r2evans
Nov 12 '18 at 23:00
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
Yes - apologies for the sloppy sample, all of your assumptions were appropriate. Your code works and fixes the issue I was having, but upon adapting this and running it, it is not parallelizing the work in the way I had intended. I was intending for it to try to utilize all 8 processors on my server in order to get the job done, but it appears this nested strategy results in the algorithm doing func1 followed by func2 a repeated number of times before moving onto the second iteration of i (which makes sense now that I think about it). Doing 2 separate loops should be more efficient for this.
– actuary_meets_data
Nov 12 '18 at 18:39
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
See my edit ... it's verbose as all get out with code you won't need, but I think it's clear what you can discard and what you may want to adapt into your own code.
– r2evans
Nov 12 '18 at 19:57
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I wasn't able to get your code to run properly. I added a parallelization with 4 cores for testing (cl <- makeCluster(4) / registerDoParallel(cl)), and I am getting an error: Error in make.socket(port = .port) : socket not established. This seems to be related to the log function you wrote, and possibly the 4000 port number? I am extremely unfamiliar with this so I am not sure.
– actuary_meets_data
Nov 12 '18 at 22:25
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
I am working on a version similar to the second piece you posted (because I agree, the first version is going to do exactly what it is doing, and that is not ideal). However, I am only using one loop for func2, as there is some data that needs to be pulled in for func2 to work based off of the results of func1. I'm trying to determine whether this can be done in a nested parallel loop or if I should stick to my current single parallel loop for func2 within a sequential func1 loop.
– actuary_meets_data
Nov 12 '18 at 22:29
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,
ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.– r2evans
Nov 12 '18 at 23:00
For your error, I should have mentioned here (and it is mentioned on the link provided) that the netcat listener (e.g.,
ncat -k -l 4000
) needs to be started first. If that's the only problem, none of that code is required for production, just for explanation of performance and parallelism.– r2evans
Nov 12 '18 at 23:00
|
show 1 more comment
I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.
This is the original proposed solution:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
stopCluster(cl)
registerDoSEQ()
out
However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}
finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())
for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}
stopCluster(cl)
registerDoSEQ()
finalOut
This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.
For clarity ...ncat
(ornc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >cmd
", go to the right dir, type inncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type inpath/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (withoutLog
) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57
add a comment |
I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.
This is the original proposed solution:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
stopCluster(cl)
registerDoSEQ()
out
However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}
finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())
for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}
stopCluster(cl)
registerDoSEQ()
finalOut
This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.
For clarity ...ncat
(ornc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >cmd
", go to the right dir, type inncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type inpath/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (withoutLog
) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57
add a comment |
I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.
This is the original proposed solution:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
stopCluster(cl)
registerDoSEQ()
out
However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}
finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())
for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}
stopCluster(cl)
registerDoSEQ()
finalOut
This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.
I appreciate the help provided by r2evans, while I wasn't actually able to replicate his work due to my inexperience and inability to figure out how to get ncat working on my computer, he helped me realize that my original method wouldn't work as well as splitting into two separate foreach parallelized loops, which I have gotten to a working production version at this point in time.
This is the original proposed solution:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out <- foreach(i=1:length(int1list),.combine=rbind) %do% {
out1 <- func1(int1list[i])
if(!out1[[2]]){
data.frame("Scenario"=i, "Result"=out1[[1]], UsedJ=FALSE)
# next
} else{
foreach(j=1:length(int2list),.combine=rbind) %dopar% {
int3 <- func2(out1[[1]], int2list[j])
data.frame("Scenario"=i,"Result"=int3, UsedJ=TRUE)
}
}
}
stopCluster(cl)
registerDoSEQ()
out
However, this results in a loop that waits for the first iteration of func1's func2 iterations to complete before beginning the second and on iterations of func1. I elected to split this into two separate loops, like below:
library(doParallel)
library(foreach)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
func1 <- function(int1){
results <- list(int1,int1>2)
return(results)
}
func2 <- function(int1,int2){
return(int1/int2)
}
int1list <- seq(1,3)
int2list <- seq(1,5)
out1 <- foreach(i=1:length(int1list)) %dopar%{
func1(i)
}
finalOut <- data.frame("Scenario"=integer(),"UsedJ"=logical(),"Result"=double())
for (i in 1:length(int1list)){
if(out1[[2]]==FALSE){
tempOut <- data.frame("Scenario"=i,"UsedJ"=FALSE,"Result"=NA)
} else{
tempOutput <- foreach(j=1:length(int2list),.combine=rbind) %dopar% {
Result <- func2(i,j)
data.frame("Scenario"=i,"UsedJ"=TRUE,"Result"=Result)
}
}
}
stopCluster(cl)
registerDoSEQ()
finalOut
This algorithm seems to fit my purposes nicely. It isn't as efficient as it could be, but it should get the job done and not be too wasteful.
answered Nov 13 '18 at 20:00
actuary_meets_data
235
235
For clarity ...ncat
(ornc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >cmd
", go to the right dir, type inncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type inpath/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (withoutLog
) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57
add a comment |
For clarity ...ncat
(ornc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.
– r2evans
Nov 14 '18 at 16:40
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >cmd
", go to the right dir, type inncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type inpath/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (withoutLog
) work as intended. We can stop this thread, no need to force netcat. :-)
– r2evans
Nov 14 '18 at 16:57
For clarity ...
ncat
(or nc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.– r2evans
Nov 14 '18 at 16:40
For clarity ...
ncat
(or nc
) needed to be run in a terminal, not in R ... I apologize if that was apparent, but newer users may not have made that leap based on my vague description. Regardless, the use of it was solely to provide indications of function-entry, it was by no means necessary for the parallelization strategy to work.– r2evans
Nov 14 '18 at 16:40
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
I was trying to run it in terminal, but I either ran it in the terminal that came with the installation (where it immediately closed out) or from a clean terminal, where it did not recognize the command. Would I have needed to add this to the system path? I do not have administrative rights on my work computer to add programs to the system path (which is pretty annoying).
– actuary_meets_data
Nov 14 '18 at 16:48
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >
cmd
", go to the right dir, type in ncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log
) work as intended. We can stop this thread, no need to force netcat. :-)– r2evans
Nov 14 '18 at 16:57
Admin rights are not required, but I understand the frustration. I do not understand what you mean "ran it in the terminal that came with the installation" ... installation of netcat? It doesn't include a terminal. "Clean terminal"? If on windows, then "Start > Run >
cmd
", go to the right dir, type in ncat -k -l 4000
, it should "do nothing" and not return. If on linux, open an xterm, type in path/to/ncat -k -l 4000
, should "do nothing" (and not return). Regardless, does the parallelization technique (without Log
) work as intended. We can stop this thread, no need to force netcat. :-)– r2evans
Nov 14 '18 at 16:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53266848%2fnested-parallel-processing-with-conditional-logic-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
where that
conditional logic *ERROR*
part coming into picture? Also parallelization inside already parllelized code will most likely ending up slowing up whole code (due to split and merge operations becoming very costly).– abhiieor
Nov 12 '18 at 17:07
The error is coming into play when the first function is ran, since "i" is part of the function call.
– actuary_meets_data
Nov 12 '18 at 17:10
The pseudocode may not be enough, and it's hard to address an R error when we don't have R code. I suspect this pseudocode is based heavily on actual code, so I suggest: come up with two trivial 1-2 line functions (in place of your more complex funcs) and a reproducible question including where
I
would be coming from. If this is based on subsetting a large dataset of some sort, well, it might help to give a sample (similarly structured or a sample from the actual data) as well.– r2evans
Nov 12 '18 at 17:11
What would you recommend as the most algorithmically efficient way to perform the above? The first function needs to run and succeed before the second function can run, the inner loop will likely need to loop about 15-25 times. The outer loop will likely be looping anywhere between 10 and 500 times.
– actuary_meets_data
Nov 12 '18 at 17:12
1
That is a good recommednation r2evans, I will edit with an update.
– actuary_meets_data
Nov 12 '18 at 17:12