Unanswered 'web-crawler' Questions

13 votes

2 answers

5k views

crawl dynamic web page using htmlunit

I am crawling data using HtmlUnit from a dynamic webpage, which uses infinite scrolling to fetch data dynamically, just like facebook's newsfeed. I used the following sentence to simulate the ...

Marcopolo Soc

131

asked Aug 25, 2012 at 5:58

9 votes

4 answers

2k views

How to detect if a site lets you upload files?

I would like to be able to tell if a site lets you upload files. I can think of two main ways sites do it and ideally I'd like to be able to detect both: Button Drag & Drop PhantomJS ...

rudolfovic

3,276

asked Dec 16, 2021 at 12:10

9 votes

0 answers

3k views

How to web crawl a DeepZoom image from IIPImage server?

How to get all tiles and metadata of a DeepZoom image hosted on an IIPImage server? IIPImage suports the IIP protocol (not well documented), MS DeepZoom, and Zoomify

Vladimir Alexiev

2,601

asked Aug 4, 2011 at 23:41

7 votes

0 answers

216 views

Nutch problems executing crawl on Windows

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8. I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain: Exception ...

Daniel Z.

71

asked May 12, 2016 at 8:48

7 votes

0 answers

2k views

Solr dedup error Failed with exit value 255

I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears in ...

Hafiz Muhammad Shafiq

8,660

asked Jan 28, 2015 at 5:53

7 votes

0 answers

893 views

Adsense with dynamic content

I know that this topic has been discussed before in varying extent but I have some specific queries. I will use an example for this case and would like to request you for your views. Example:- A ...

Mridul Kanti Roy Chowdhury

139

asked Jun 29, 2013 at 16:59

7 votes

0 answers

9k views

Python urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems

I've written a crawler that uses urllib2 to fetch URLs. every few requests I get some weird behaviors, I've tried analyzing it with Wireshark and couldn't understand the problem. getPAGE() is ...

YSY

1,236

asked Jul 25, 2011 at 19:15

6 votes

0 answers

6k views

Scraping data from Flightradar24

I’m trying to make a scraper that returns data for daily flights between airports in Europe for a list of European airlines. For KLM, the data can be found on the following website by clicking on the ...

R. Plate

131

asked Aug 15, 2018 at 15:09

6 votes

0 answers

105 views

Keywords status tracking for different scrapers maintaining retrieval speed

To make a better understanding here is are the tables/models of my scraping app in Ruby on Rails with MySQL: Scraper (A scraper searches a given site for all keywords) Keyword (Contains term to ...

Hassan Akram

652

asked Dec 13, 2017 at 9:09

6 votes

0 answers

832 views

Docker Scrapinghub/splash exited with 139

I'm using Scrapy to do some crawling with Splash using the Scrapinghub/splash docker container however the container exit after a while by itself with exit code 139, I'm running the scraper on an AWS ...

MtziSam

130

asked Aug 16, 2017 at 19:59

6 votes

1 answer

4k views

Select option from dropdown and submit request using nodejs

I am working on nodejs for scrapping a website and I am very new to nodejs.The website initial page is a popup in which one has to select option from selectbox and submit only then later pages can be ...

Java_begins

1,679

asked Jun 14, 2015 at 12:16

6 votes

0 answers

91k views

How do I extract data from a website using javascript.

Hi complete newbie here so bear with me. Seems like a simple job but I can't seem to find an easy way to do this. So I need to extract a particular text from a webpage "www.example.com/index.php". I ...

Vivek

153

asked Oct 4, 2013 at 13:02

6 votes

1 answer

1k views

How to make sure web crawler works for site hosted on AWS S3 and uses AJAX

Google webmaster guide explains that web server should handle requests for url that contains _escaped_fragment_ (The crawler modifies www.example.com/ajax.html#!mystate to www.example.com/ajax.html?...

tomerlic

61

asked Oct 9, 2012 at 12:38

5 votes

2 answers

1k views

Custom BCS indexing connector with changelog inremental crawl is not working properly

I am writing a custom indexing connector using changelog incremental crawl approach. I'm using sample from http://msdn.microsoft.com/en-us/library/ff625800%28v=office.14%29.aspx and trying to change ...

Mitka

51

asked Jun 6, 2013 at 9:00

5 votes

1 answer

3k views

guide to setup crawler4j

I would like to setup the crawler to crawl a website, let say blog, and fetch me only the links in the website and paste the links inside a text file. Can you guide me step by step for setup the ...

Wai Loon II

259

asked Feb 16, 2011 at 5:17

Collectives™ on Stack Overflow

crawl dynamic web page using htmlunit

How to detect if a site lets you upload files?

How to web crawl a DeepZoom image from IIPImage server?

Nutch problems executing crawl on Windows

Solr dedup error Failed with exit value 255

Adsense with dynamic content

Python urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems

Scraping data from Flightradar24

Keywords status tracking for different scrapers maintaining retrieval speed

Docker Scrapinghub/splash exited with 139

Select option from dropdown and submit request using nodejs

How do I extract data from a website using javascript.

How to make sure web crawler works for site hosted on AWS S3 and uses AJAX

Custom BCS indexing connector with changelog inremental crawl is not working properly

guide to setup crawler4j

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags