R - return row indexes using grepl
I would like to return the row index for a text match. The text match is done via the adist
function in R
. The code creates some text data, calculates the edit distances and returns the 5 best matches.
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long
diner","joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
#calculate edit distance for name and address
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
#assemble data frame
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
What I would like to do is, 1. for each top.nm
and top.ad
column, add a column index.match
for the row index where the match was found and 2. add a column distance
containing the edit distance for that match (i.e. the value of adist
).
So, for top.nm.1
, the index.match
column would be: c(7, 6, 4, 1)
.
I have tried using which(grepl(paste(source1$name, collapse = "|"), source2$name, fixed = F))
to return row indexes but for some reason it only returns a single value, not a whole vector. Any help or advice would be appreciated.
r levenshtein-distance grepl stringdist
add a comment |
I would like to return the row index for a text match. The text match is done via the adist
function in R
. The code creates some text data, calculates the edit distances and returns the 5 best matches.
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long
diner","joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
#calculate edit distance for name and address
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
#assemble data frame
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
What I would like to do is, 1. for each top.nm
and top.ad
column, add a column index.match
for the row index where the match was found and 2. add a column distance
containing the edit distance for that match (i.e. the value of adist
).
So, for top.nm.1
, the index.match
column would be: c(7, 6, 4, 1)
.
I have tried using which(grepl(paste(source1$name, collapse = "|"), source2$name, fixed = F))
to return row indexes but for some reason it only returns a single value, not a whole vector. Any help or advice would be appreciated.
r levenshtein-distance grepl stringdist
1
I think you want to try this:sapply(top.nm$top.1, function(x){grep(x, source2$name)})
– Dave2e
Nov 14 '18 at 19:15
add a comment |
I would like to return the row index for a text match. The text match is done via the adist
function in R
. The code creates some text data, calculates the edit distances and returns the 5 best matches.
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long
diner","joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
#calculate edit distance for name and address
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
#assemble data frame
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
What I would like to do is, 1. for each top.nm
and top.ad
column, add a column index.match
for the row index where the match was found and 2. add a column distance
containing the edit distance for that match (i.e. the value of adist
).
So, for top.nm.1
, the index.match
column would be: c(7, 6, 4, 1)
.
I have tried using which(grepl(paste(source1$name, collapse = "|"), source2$name, fixed = F))
to return row indexes but for some reason it only returns a single value, not a whole vector. Any help or advice would be appreciated.
r levenshtein-distance grepl stringdist
I would like to return the row index for a text match. The text match is done via the adist
function in R
. The code creates some text data, calculates the edit distances and returns the 5 best matches.
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long
diner","joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
#calculate edit distance for name and address
dist.mat.nm <- adist(source1$name, source2$name, partial = T, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
#assemble data frame
imat <- apply(dist.mat.nm, 1, order)[1:5, ]
top.nm <- data.frame(name = source1$name)
tmp <- apply(imat, 1, function(i) source2$name[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.nm <- cbind(top.nm, tmp)
imat <- apply(dist.mat.ad, 1, order)[1:5, ]
top.ad <- data.frame(address = source1$address)
tmp <- apply(imat, 1, function(i) source2$address[i])
colnames(tmp) <- paste("top", 1:5, sep = ".")
top.ad <- cbind(top.ad, tmp)
What I would like to do is, 1. for each top.nm
and top.ad
column, add a column index.match
for the row index where the match was found and 2. add a column distance
containing the edit distance for that match (i.e. the value of adist
).
So, for top.nm.1
, the index.match
column would be: c(7, 6, 4, 1)
.
I have tried using which(grepl(paste(source1$name, collapse = "|"), source2$name, fixed = F))
to return row indexes but for some reason it only returns a single value, not a whole vector. Any help or advice would be appreciated.
r levenshtein-distance grepl stringdist
r levenshtein-distance grepl stringdist
asked Nov 14 '18 at 18:06
jvalentijvalenti
143113
143113
1
I think you want to try this:sapply(top.nm$top.1, function(x){grep(x, source2$name)})
– Dave2e
Nov 14 '18 at 19:15
add a comment |
1
I think you want to try this:sapply(top.nm$top.1, function(x){grep(x, source2$name)})
– Dave2e
Nov 14 '18 at 19:15
1
1
I think you want to try this:
sapply(top.nm$top.1, function(x){grep(x, source2$name)})
– Dave2e
Nov 14 '18 at 19:15
I think you want to try this:
sapply(top.nm$top.1, function(x){grep(x, source2$name)})
– Dave2e
Nov 14 '18 at 19:15
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53306344%2fr-return-row-indexes-using-grepl%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53306344%2fr-return-row-indexes-using-grepl%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I think you want to try this:
sapply(top.nm$top.1, function(x){grep(x, source2$name)})
– Dave2e
Nov 14 '18 at 19:15