9,674 questions
-2
votes
0
answers
14
views
Advice on Efficiently Tracking Product Data Updates (Python Web Scraping)
I’m working on a Python-based web scraping project that involves tracking updates to product table fields such as:
{
{
price_history,
chat_count_history,
like_count_history,
view_count_history,
...
0
votes
0
answers
31
views
crawling with selenium, don't operate pagination
I execute crawling for collect data.
everything is well without pagination.
below code is problem code.
i need your help.
when i operate code -> maybe 30sec later appear error message
...
0
votes
1
answer
36
views
Web Crawling in .net 8.0 for angular website
I want to crawl angular website - https://v16.angular.io/docs I have written a code for this
var playwright = await Playwright.CreateAsync();
var browser = await playwright.Chromium.LaunchAsync(new ...
0
votes
0
answers
50
views
Scrapy crawl a website and lose some items
Here is my main spider class below
import scrapy
from xxx.items import WorkItem
class XXXSpider(scrapy.Spider):
name = "xxx"
allowed_domains = ["example.com"]
...
0
votes
1
answer
59
views
scrape the html page after click on a div tag using BeautifulSoup
I got some troubles when scraping the questions and answers from websites:
https://tech12h.com/bai-hoc/trac-nghiem-lich-su-12-bai-1-su-hinh-thanh-trat-tu-gioi-moi-sau-chien-tranh-gioi-thu-hai
The ...
0
votes
0
answers
21
views
Simple crawler to scrape ICD-11 database using API requests
I tried to make this simple crawler to crawl down the entire ICD-11 database (https://icd.who.int/browse/2024-01/foundation/en#455013390) and collect all the titles and descriptions of all diseases, ...
1
vote
0
answers
28
views
Run spider programatically integrated with a crawl lib
I have the following code that runs a spider programatically:
import asyncio
from scrapy.crawler import CrawlerProcess
from scrapy_webcrawler.spiders.spider import WebCrawlerSpider
class ...
0
votes
0
answers
36
views
The specified selector is not loading in Apify
I'm building an Apify scraper to target transaction data on a dynamic webpage. The table containing these transactions loads asynchronously via AJAX, taking under 30 seconds.
This is the page when ...
-1
votes
1
answer
38
views
Python Script to crawl ADO Project for specific file and download it
I am trying to create a python script that will crawl Azure DevOps project for a file, and download it locally. However, I'm running into an issue where making the request to download the file isn't ...
0
votes
0
answers
40
views
How to crawl data using Selenium and split cell values in Google Sheets
Currently, I am retrieving all the <td> values and saving them to Google Sheets
For the page to be crawled, the date and rank are located as td > div > div > span, b
So, when I ...
0
votes
0
answers
64
views
Scrapy: Preventing Data Persistence and Cross-Request Contamination
Intro
I've edited the post to simplify and clarify the content and included the proposed solutions.
All issues were resolved using download delays and dupefilter (as suggested by @wRAR).
However, with ...
0
votes
0
answers
15
views
Bing Search Engine Indexing
Have anyone encountered with the problem and fixed the issue where Bing search engine shows wrong Website name in the header of the search result?
How to fix this issue?
Google Chrome shows it right, ...
0
votes
1
answer
66
views
Querying athena aws the right way
i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much
This is, ...
0
votes
0
answers
42
views
How can I keep Xvfb screen is kept alive for crawler that is using chrome and selenium webdriver after ssh session is closed
I made a crawler program with selenium webdriver and chrome.
The target website blocks chrome headless mode, so I needed to use Xvfb.
The crawler works on AWS EC2 (Amazon linux2023).
While SSH session ...
0
votes
1
answer
76
views
Scraping/Crawling a website with multiple tabs using python
I am seeking assistance in extracting data from a website with multiple tabs and saving it in a .csv format using Python and Selenium. The website in question is: https://www.amfiindia.com/research-...