NodeJS: How can I scrape two different tables, that are visually part of the same table, into one JSON...












0















Here's an example of the table of data I'm scraping:



Sample Table



The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:



EDIT: I forgot to add the surrounding div



<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>


As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.



How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:



EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other



EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there



let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);

let results = ;

trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);

headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});

return results; //results is OBJ in JSON format
}
}

...

results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);

...


Intended Result:



[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]


Actual Result:



[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]









share|improve this question

























  • and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.

    – ADyson
    Nov 15 '18 at 13:00













  • @ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)

    – gg93
    Nov 15 '18 at 22:38











  • i don't see any thead or tbody containers in your tables, so the selectors seem problematic.

    – Garr Godfrey
    Nov 15 '18 at 23:07











  • @GarrGodfrey - I completely missed that, thank you; it should be consistent now

    – gg93
    Nov 15 '18 at 23:27











  • The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.

    – Garr Godfrey
    Nov 24 '18 at 1:35
















0















Here's an example of the table of data I'm scraping:



Sample Table



The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:



EDIT: I forgot to add the surrounding div



<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>


As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.



How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:



EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other



EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there



let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);

let results = ;

trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);

headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});

return results; //results is OBJ in JSON format
}
}

...

results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);

...


Intended Result:



[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]


Actual Result:



[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]









share|improve this question

























  • and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.

    – ADyson
    Nov 15 '18 at 13:00













  • @ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)

    – gg93
    Nov 15 '18 at 22:38











  • i don't see any thead or tbody containers in your tables, so the selectors seem problematic.

    – Garr Godfrey
    Nov 15 '18 at 23:07











  • @GarrGodfrey - I completely missed that, thank you; it should be consistent now

    – gg93
    Nov 15 '18 at 23:27











  • The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.

    – Garr Godfrey
    Nov 24 '18 at 1:35














0












0








0








Here's an example of the table of data I'm scraping:



Sample Table



The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:



EDIT: I forgot to add the surrounding div



<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>


As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.



How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:



EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other



EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there



let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);

let results = ;

trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);

headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});

return results; //results is OBJ in JSON format
}
}

...

results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);

...


Intended Result:



[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]


Actual Result:



[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]









share|improve this question
















Here's an example of the table of data I'm scraping:



Sample Table



The elements in red are in the <th> tags while the elements in green are in a <td> tag, the <tr> tag can be displayed according to how they're grouped (i.e. '1' is in it's own <tr>; HTML snippet:



EDIT: I forgot to add the surrounding div



<div class="table-cont">
<table class="tg-1">
<thead>
<tr>
<th class="tg-phtq">ID</td>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">1</td>
<td class="tg-0pky">2</td>
<td class="tg-0pky">3</td>
</tr>
</tbody>
</table>
<table class="tg-2">
<thead>
<tr>
<th class="tg-phtq">Sample1</td>
<th class="tg-phtq">Sample2</td>
<...the rest of the table code matches the pattern...>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0pky">Swimm</td>
<td class="tg-dvpl">1:30</td>
<...>
</tr>
</tbody>
<...the rest of the table code...>
</table>
</div>


As you can see, in the HTML they're actually two different tables while they're displayed in the above example as only one. I want to generate a JSON object where the keys and values include the data from the two tables as if they were one, and output a single JSON Object.



How I'm scraping it right now is a bit of modified javascript code I found on a tutorial:



EDIT: In the below, I've been trying to find a way to select all relevant <th> tags from both tables and insert them into the same array as the rest of the <th> tag array and do the same for <tr> in the table body; I'm fairly sure for the th I can just insert the element separately before the rest but only because there's a single one - I've been having problems figuring out how to do that for both arrays and make sure all the items in the two arrays map correctly to each other



EDIT 2: Possible solution? I tried using XPath Selectors and I can use them in devTools to select everything I want, but page.evaluate doesn't accept them and page.$x('XPath') returns JSHandle@node since I'm trying to make an array, but I don't know where to go from there



let scrapeMemberTable = async (page) => {
await page.evaluate(() => {
let ths = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > thead > tr > th'));
let trs = Array.from(document.querySelectorAll('div.table-cont > table.tg-2 > tbody > tr'));
// the above two lines of code are the main problem area- I haven't been
//able to select all the head/body elements I want in just those two lines of code
// just removig the table id "tg-2" seems to deselect the whole thing
const headers = ths.map(th => th.textContent);

let results = ;

trs.forEach(tr => {
let r = {};
let tds = Array.from(tr.querySelectorAll('td')).map(td => td.textContent);

headers.forEach((k,i) => r[k] = tds[i]);
results.push(r);
});

return results; //results is OBJ in JSON format
}
}

...

results = results.concat( //merge into one array OBJ
await scrapeMemberTable(page)
);

...


Intended Result:



[
{
"ID": "1", <-- this is the goal
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]


Actual Result:



[
{
"Sample1": "Swimm",
"Sample2": "1:30",
"Sample3": "2:05",
"Sample4": "1:15",
"Sample5": "1:41"
}
]






javascript html node.js json html-table






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 1:01







gg93

















asked Nov 15 '18 at 12:57









gg93gg93

376




376













  • and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.

    – ADyson
    Nov 15 '18 at 13:00













  • @ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)

    – gg93
    Nov 15 '18 at 22:38











  • i don't see any thead or tbody containers in your tables, so the selectors seem problematic.

    – Garr Godfrey
    Nov 15 '18 at 23:07











  • @GarrGodfrey - I completely missed that, thank you; it should be consistent now

    – gg93
    Nov 15 '18 at 23:27











  • The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.

    – Garr Godfrey
    Nov 24 '18 at 1:35



















  • and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.

    – ADyson
    Nov 15 '18 at 13:00













  • @ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)

    – gg93
    Nov 15 '18 at 22:38











  • i don't see any thead or tbody containers in your tables, so the selectors seem problematic.

    – Garr Godfrey
    Nov 15 '18 at 23:07











  • @GarrGodfrey - I completely missed that, thank you; it should be consistent now

    – gg93
    Nov 15 '18 at 23:27











  • The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.

    – Garr Godfrey
    Nov 24 '18 at 1:35

















and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.

– ADyson
Nov 15 '18 at 13:00







and what precisely is wrong with how you're scraping it now? You stated an intent but forgot to describe the issue with your existing attempt. Don't make people figure it out, or have to start from scratch, instead make it clear what the deficiency is. Thanks.

– ADyson
Nov 15 '18 at 13:00















@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)

– gg93
Nov 15 '18 at 22:38





@ADyson Hi, thanks for pointing that out- I was half asleep when I finished writing this up so evidently I missed some points; I've added clarification and additional class selectors as a better representation of what's going on, hope that help. Cheers :)

– gg93
Nov 15 '18 at 22:38













i don't see any thead or tbody containers in your tables, so the selectors seem problematic.

– Garr Godfrey
Nov 15 '18 at 23:07





i don't see any thead or tbody containers in your tables, so the selectors seem problematic.

– Garr Godfrey
Nov 15 '18 at 23:07













@GarrGodfrey - I completely missed that, thank you; it should be consistent now

– gg93
Nov 15 '18 at 23:27





@GarrGodfrey - I completely missed that, thank you; it should be consistent now

– gg93
Nov 15 '18 at 23:27













The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.

– Garr Godfrey
Nov 24 '18 at 1:35





The other thing is you filter on "table.tg-2" but the first table is tg-1, so it won't be included. Maybe just remove the '.tg-1' from the selector string.

– Garr Godfrey
Nov 24 '18 at 1:35












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53320007%2fnodejs-how-can-i-scrape-two-different-tables-that-are-visually-part-of-the-sam%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53320007%2fnodejs-how-can-i-scrape-two-different-tables-that-are-visually-part-of-the-sam%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

The Sandy Post

Danny Elfman

Pages that link to "Head v. Amoskeag Manufacturing Co."