Crawling multiple URLs in a loop using Puppeteer

Question

I have an array of URLs to scrape data from:

urls = ['url','url','url'...]

This is what I'm doing:

urls.map(async (url)=>{
  await page.goto(url);
  await page.waitForNavigation({ waitUntil: 'networkidle' });
})

This seems to not wait for page load and visits all the URLs quite rapidly (I even tried using page.waitFor).

I wanted to know if am I doing something fundamentally wrong or this type of functionality is not advised/supported.

map is unnecessary in this case regardless, it's when you want to return an array by processing an input array. What you would normally want is forEach since you are not returning anything. It keeps the intent of the code clear. — Dexygen, Commented Nov 5, 2024 at 18:43

Julien Le Coupanec · Accepted Answer · 2020-03-17 15:16:58Z

38

map, forEach, reduce, etc, does not wait for the asynchronous operation within them, before they proceed to the next element of the iterator they are iterating over.

There are multiple ways of going through each item of an iterator synchronously while performing an asynchronous operation, but the easiest in this case I think would be to simply use a normal for operator, which does wait for the operation to finish.

const urls = [...]

for (let i = 0; i < urls.length; i++) {
    const url = urls[i];
    await page.goto(`${url}`);
    await page.waitForNavigation({ waitUntil: 'networkidle2' });
}

This would visit one url after another, as you are expecting. If you are curious about iterating serially using await/async, you can have a peek at this answer: https://stackoverflow.com/a/24586168/791691

edited Mar 17, 2020 at 15:16

Julien Le Coupanec

7,9789 gold badges56 silver badges70 bronze badges

answered Sep 19, 2017 at 10:02

tomahaug

1,47611 silver badges13 bronze badges

Wierd, this gives await page.goto(${url}); Unexpected identifier syntaxErrpr.
– user2875289
Commented Nov 1, 2017 at 10:42
@user2875289 Which version of node are you using? You need to use 7.6 or higher to have async/await work without doing transpiling.
– tomahaug
Commented Nov 1, 2017 at 11:22
1

@tomahaug I'm using Node 8.9. The problem was solved. I was using async/wait mixed with promises that cause the syntaxError. It works now after changing to async/wait only. Thanks!
– user2875289
Commented Nov 1, 2017 at 11:28
@user2875289 the template literal syntax on the url variable seems superfluous here anyway, so you should be good to go with just page.goto(url). I don't think await page.waitForNavigation({ waitUntil: 'networkidle2' }); is necessary here--goto already waits for navigation, so I would use page.goto(url, {waitUntil: "networkidle2"}) and skip the waitForNavigation call.
– ggorlen
Commented Sep 29, 2022 at 1:50
1

@ggorlen Thanks! The question and this answer is 5 years old. I think your recent answers and comments (both under this answer and other answers) are more valuable to 2022 users. Thanks you.
– user2875289
Commented Sep 30, 2022 at 2:59

Add a comment |

ggorlen · Accepted Answer · 2024-05-07 17:19:48Z

The accepted answer shows how to serially visit each page one at a time. However, you may want to visit multiple pages simultaneously when the task is embarrassingly parallel, that is, scraping a particular page isn't dependent on data extracted from other pages.

A tool that can help achieve this is Promise.allSettled which lets us fire off a bunch of promises at once, determine which were successful and harvest results.

For a basic example, let's say we want to scrape usernames for Stack Overflow users given a series of ids.

Serial code:

import puppeteer from "puppeteer"; // ^22.7.1

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const baseURL = "https://stackoverflow.com/users";
  const startId = 6243352;
  const qty = 5;
  const usernames = [];

  for (let i = startId; i < startId + qty; i++) {
    await page.goto(`${baseURL}/${i}`, {
      waitUntil: "domcontentloaded"
    });
    const sel = ".flex--item.mb12.fs-headline2.lh-xs";
    const el = await page.waitForSelector(sel);
    usernames.push(await el.evaluate(el => el.textContent.trim()));
  }

  console.log(usernames);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Parallel code:

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const baseURL = "https://stackoverflow.com/users";
  const startId = 6243352;
  const qty = 5;

  const usernames = (await Promise.allSettled(
    [...Array(qty)].map(async (_, i) => {
      let page;
      try {
        page = await browser.newPage();
        await page.goto(`${baseURL}/${i + startId}`, {
          waitUntil: "domcontentloaded"
        });
        const sel = ".flex--item.mb12.fs-headline2.lh-xs";
        const el = await page.waitForSelector(sel);
        const text = await el.evaluate(el => el.textContent.trim());
      } finally {
        await page?.close();
      }
      return text;
    })))
    .filter(e => e.status === "fulfilled")
    .map(e => e.value);
  console.log(usernames);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Quick benchmark. Serial:

real 0m2.922s
user 0m1.300s
sys  0m0.320s

Parallel:

real 0m1.636s
user 0m1.171s
sys  0m0.408s

Remember that this is a technique, not a silver bullet that guarantees a speed increase on all workloads. It will take some experimentation to find the optimal balance between the cost of creating more pages versus the parallelization of network requests on a given particular task and system.

The example here is contrived since it's not interacting with the page dynamically, so there's not as much room for gain as in a typical Puppeteer use case that involves network requests and blocking waits per page.

Of course, beware of rate limiting and any other restrictions imposed by sites (running the code above may anger Stack Overflow's rate limiter).

For tasks where creating a page per task is prohibitively expensive or you'd like to set a cap on parallel request dispatches, consider using a task queue or combining serial and parallel code shown above to send requests in chunks. This answer shows a generic pattern for this agnostic of Puppeteer.

These patterns can be extended to handle the case when certain pages depend on data from other pages, forming a dependency graph.

See this answer which illustrates a common pattern, scraping a series of links on a main page, then scraping data from each sub-page.

See also Using async/await with a forEach loop which explains why the original attempt in this thread using map fails to wait for each promise.

Neil · Accepted Answer · 2019-01-15 13:48:09Z

4

If you find that you are waiting on your promise indefinitely, the proposed solution is to use the following:

const urls = [...]

for (let i = 0; i < urls.length; i++) {
    const url = urls[i];
    const promise = page.waitForNavigation({ waitUntil: 'networkidle' });
    await page.goto(`${url}`);
    await promise;
}

As referenced from this github issue

answered Jan 15, 2019 at 13:48

Neil

8,1114 gold badges57 silver badges77 bronze badges

Why not just use page.goto(urls[i], {waitUntil: "networkidle0"})? The issue you reference deals with a .click() that triggers navigation. goto doesn't need an explicit waitForNavigation because it's essentially built into the goto call.
– ggorlen
Commented Sep 29, 2022 at 1:37

Add a comment |

Mehran Shafqat · Accepted Answer · 2020-04-23 03:33:40Z

-2

Best way I found to achieve this.

 const puppeteer = require('puppeteer');
(async () => {
    const urls = ['https://www.google.com/', 'https://www.google.com/']
    for (let i = 0; i < urls.length; i++) {

        const url = urls[i];
        const browser = await puppeteer.launch({ headless: false });
        const page = await browser.newPage();
        await page.goto(`${url}`, { waitUntil: 'networkidle2' });
        await browser.close();

    }
})();

answered Apr 23, 2020 at 3:33

Mehran Shafqat

5103 gold badges6 silver badges16 bronze badges

I don't see a point in all these browser.newPage() and browser.close() calls. Since you're working serially, you can make one page before the loop and navigate it from one page to the next using gotos, then close the browser after the loop ends.
– ggorlen
Commented Nov 25, 2020 at 7:40

Add a comment |

user3483642 · Accepted Answer · 2021-05-05 02:27:24Z

-2

Something no one else mentions is that if you are fetching multiple pages using the same page object it is crucial that you set its timeout to 0. Otherwise, once it has fetched the default 30 seconds worth of pages, it will timeout.

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  page.setDefaultNavigationTimeout(0);

answered May 5, 2021 at 2:27

user3483642

3202 silver badges7 bronze badges

I don't think this is crucial. In fact, it's dangerous, because you should almost never wait forever for something and cause a script to hang without a clear log of the problem being recorded. Whenever a navigation takes longer than a minute or two, you almost certainly have a bug, for example, using networkidle when the page opens multiple long-running requests. The 30-second timeout is specific to each navigation, not some persistent value that each navigation subtracts from. Navigation timeouts aren't really relevant to the looping problem that OP is asking about in any case.
– ggorlen
Commented Sep 29, 2022 at 1:38

Add a comment |

Collectives™ on Stack Overflow

Crawling multiple URLs in a loop using Puppeteer

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
javascript
web-scraping
puppeteer
google-chrome-headless
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged javascriptweb-scrapingpuppeteergoogle-chrome-headless or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
javascript
web-scraping
puppeteer
google-chrome-headless
or ask your own question.