Skip to main content
3,274 questions with no upvoted or accepted answers
13 votes
2 answers
5k views

crawl dynamic web page using htmlunit

I am crawling data using HtmlUnit from a dynamic webpage, which uses infinite scrolling to fetch data dynamically, just like facebook's newsfeed. I used the following sentence to simulate the ...
Marcopolo Soc's user avatar
9 votes
4 answers
2k views

How to detect if a site lets you upload files?

I would like to be able to tell if a site lets you upload files. I can think of two main ways sites do it and ideally I'd like to be able to detect both: Button Drag & Drop PhantomJS ...
rudolfovic's user avatar
  • 3,276
9 votes
0 answers
3k views

How to web crawl a DeepZoom image from IIPImage server?

How to get all tiles and metadata of a DeepZoom image hosted on an IIPImage server? IIPImage suports the IIP protocol (not well documented), MS DeepZoom, and Zoomify
Vladimir Alexiev's user avatar
7 votes
0 answers
216 views

Nutch problems executing crawl on Windows

I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in Windows 8. I have put hadoop-core jar into lib folder but when I try to run a crawl I obtain: Exception ...
Daniel Z.'s user avatar
7 votes
0 answers
2k views

Solr dedup error Failed with exit value 255

I am crawling few data from web using apache nutch 2.3. My solr version is 4.10.3. Data is crawled successfully in hbase and indexed also in solr but at end (dedup stage ) Follwoing error appears in ...
Hafiz Muhammad Shafiq's user avatar
7 votes
0 answers
893 views

Adsense with dynamic content

I know that this topic has been discussed before in varying extent but I have some specific queries. I will use an example for this case and would like to request you for your views. Example:- A ...
Mridul Kanti Roy Chowdhury's user avatar
7 votes
0 answers
9k views

Python urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems

I've written a crawler that uses urllib2 to fetch URLs. every few requests I get some weird behaviors, I've tried analyzing it with Wireshark and couldn't understand the problem. getPAGE() is ...
YSY's user avatar
  • 1,236
6 votes
0 answers
6k views

Scraping data from Flightradar24

I’m trying to make a scraper that returns data for daily flights between airports in Europe for a list of European airlines. For KLM, the data can be found on the following website by clicking on the ...
R. Plate's user avatar
  • 131
6 votes
0 answers
105 views

Keywords status tracking for different scrapers maintaining retrieval speed

To make a better understanding here is are the tables/models of my scraping app in Ruby on Rails with MySQL: Scraper (A scraper searches a given site for all keywords) Keyword (Contains term to ...
Hassan Akram's user avatar
6 votes
0 answers
832 views

Docker Scrapinghub/splash exited with 139

I'm using Scrapy to do some crawling with Splash using the Scrapinghub/splash docker container however the container exit after a while by itself with exit code 139, I'm running the scraper on an AWS ...
MtziSam's user avatar
  • 130
6 votes
1 answer
4k views

Select option from dropdown and submit request using nodejs

I am working on nodejs for scrapping a website and I am very new to nodejs.The website initial page is a popup in which one has to select option from selectbox and submit only then later pages can be ...
Java_begins's user avatar
  • 1,679
6 votes
0 answers
91k views

How do I extract data from a website using javascript.

Hi complete newbie here so bear with me. Seems like a simple job but I can't seem to find an easy way to do this. So I need to extract a particular text from a webpage "www.example.com/index.php". I ...
Vivek's user avatar
  • 153
6 votes
1 answer
1k views

How to make sure web crawler works for site hosted on AWS S3 and uses AJAX

Google webmaster guide explains that web server should handle requests for url that contains _escaped_fragment_ (The crawler modifies www.example.com/ajax.html#!mystate to www.example.com/ajax.html?...
tomerlic's user avatar
5 votes
2 answers
1k views

Custom BCS indexing connector with changelog inremental crawl is not working properly

I am writing a custom indexing connector using changelog incremental crawl approach. I'm using sample from http://msdn.microsoft.com/en-us/library/ff625800%28v=office.14%29.aspx and trying to change ...
Mitka's user avatar
  • 51
5 votes
1 answer
3k views

guide to setup crawler4j

I would like to setup the crawler to crawl a website, let say blog, and fetch me only the links in the website and paste the links inside a text file. Can you guide me step by step for setup the ...
Wai Loon II's user avatar

15 30 50 per page
1
2 3 4 5
219