NodeJS: How can I scrape two different tables, that are visually part of the same table, into one JSON...
Here's an example of the table of data I'm scraping:

The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:
EDIT: I forgot to add the surrounding div
<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>
As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.
How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:
EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other
EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there
let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);
let results = ;
trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);
headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});
return results; //results is OBJ in JSON format
}
}
...
results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);
...
Intended Result:
[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
Actual Result:
[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
javascript html node.js json html-table
add a comment |
Here's an example of the table of data I'm scraping:

The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:
EDIT: I forgot to add the surrounding div
<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>
As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.
How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:
EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other
EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there
let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);
let results = ;
trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);
headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});
return results; //results is OBJ in JSON format
}
}
...
results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);
...
Intended Result:
[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
Actual Result:
[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
javascript html node.js json html-table
and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.
– ADyson
Nov 15 '18 at 13:00
@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)
– gg93
Nov 15 '18 at 22:38
i don't see any thead or tbody containers in your tables, so the selectors seem problematic.
– Garr Godfrey
Nov 15 '18 at 23:07
@GarrGodfrey - I completely missed that, thank you; it should be consistent now
– gg93
Nov 15 '18 at 23:27
The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.
– Garr Godfrey
Nov 24 '18 at 1:35
add a comment |
Here's an example of the table of data I'm scraping:

The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:
EDIT: I forgot to add the surrounding div
<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>
As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.
How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:
EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other
EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there
let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);
let results = ;
trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);
headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});
return results; //results is OBJ in JSON format
}
}
...
results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);
...
Intended Result:
[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
Actual Result:
[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
javascript html node.js json html-table
Here's an example of the table of data I'm scraping:

The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:
EDIT: I forgot to add the surrounding div
<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>
As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.
How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:
EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other
EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there
let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);
let results = ;
trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);
headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});
return results; //results is OBJ in JSON format
}
}
...
results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);
...
Intended Result:
[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
Actual Result:
[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]
javascript html node.js json html-table
javascript html node.js json html-table
edited Nov 16 '18 at 1:01
gg93
asked Nov 15 '18 at 12:57
gg93gg93
376
376
and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.
– ADyson
Nov 15 '18 at 13:00
@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)
– gg93
Nov 15 '18 at 22:38
i don't see any thead or tbody containers in your tables, so the selectors seem problematic.
– Garr Godfrey
Nov 15 '18 at 23:07
@GarrGodfrey - I completely missed that, thank you; it should be consistent now
– gg93
Nov 15 '18 at 23:27
The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.
– Garr Godfrey
Nov 24 '18 at 1:35
add a comment |
and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.
– ADyson
Nov 15 '18 at 13:00
@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)
– gg93
Nov 15 '18 at 22:38
i don't see any thead or tbody containers in your tables, so the selectors seem problematic.
– Garr Godfrey
Nov 15 '18 at 23:07
@GarrGodfrey - I completely missed that, thank you; it should be consistent now
– gg93
Nov 15 '18 at 23:27
The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.
– Garr Godfrey
Nov 24 '18 at 1:35
and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.
– ADyson
Nov 15 '18 at 13:00
and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.
– ADyson
Nov 15 '18 at 13:00
@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)
– gg93
Nov 15 '18 at 22:38
@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)
– gg93
Nov 15 '18 at 22:38
i don't see any thead or tbody containers in your tables, so the selectors seem problematic.
– Garr Godfrey
Nov 15 '18 at 23:07
i don't see any thead or tbody containers in your tables, so the selectors seem problematic.
– Garr Godfrey
Nov 15 '18 at 23:07
@GarrGodfrey - I completely missed that, thank you; it should be consistent now
– gg93
Nov 15 '18 at 23:27
@GarrGodfrey - I completely missed that, thank you; it should be consistent now
– gg93
Nov 15 '18 at 23:27
The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.
– Garr Godfrey
Nov 24 '18 at 1:35
The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.
– Garr Godfrey
Nov 24 '18 at 1:35
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53320007%2fnodejs-how-can-i-scrape-two-different-tables-that-are-visually-part-of-the-sam%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53320007%2fnodejs-how-can-i-scrape-two-different-tables-that-are-visually-part-of-the-sam%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.
– ADyson
Nov 15 '18 at 13:00
@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)
– gg93
Nov 15 '18 at 22:38
i don't see any thead or tbody containers in your tables, so the selectors seem problematic.
– Garr Godfrey
Nov 15 '18 at 23:07
@GarrGodfrey - I completely missed that, thank you; it should be consistent now
– gg93
Nov 15 '18 at 23:27
The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.
– Garr Godfrey
Nov 24 '18 at 1:35