Skip to main content
1 vote
1 answer
41 views

Fetch Page Content from Common Crawl

I have thousands of web pages of different websites. Is there a fast way to get content of all those web pages using Common Crawl and Python. Below is code i am trying but this process is slow. async ...
ALTAF HUSSAIN's user avatar
0 votes
1 answer
66 views

Querying athena aws the right way

i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much This is, ...
fass33443423's user avatar
3 votes
3 answers
799 views

Querying HTML Content in Common Crawl Dataset Using Amazon Athena

I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web ...
Cauder's user avatar
  • 2,537
1 vote
1 answer
402 views

Is there any way to get check if certain domain exists in Common Crawl?

I want to know if certain domain is exists in common crawl’s crawl data. Is there any API or any other way to check that? I couldn’t find any way to achieve this in their documentations In my solution ...
Avishka Balasuriya's user avatar
1 vote
1 answer
106 views

Python's zlib doesn't work on CommonCrawl file

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything ...
157 239n's user avatar
  • 369
-1 votes
1 answer
433 views

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code:...
Jawaher's user avatar
4 votes
1 answer
960 views

Common Crawl requirement to power a decent search engine

Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years. Are these snapshots independent (probably not)? Or do we have to ...
NedStarkOfWinterfell's user avatar
0 votes
1 answer
247 views

How to access Columnar URL INDEX using Amazon Athena

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query: SELECT COUNT(*) AS count, url_host_registered_domain FROM "ccindex".&...
Gladiator's user avatar
2 votes
2 answers
2k views

Extracting the payload of a single Common Crawl WARC

I can query all occurences of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come ...
js16's user avatar
  • 63
3 votes
0 answers
604 views

Common Crawl Request returns 403 WARC

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate ...
presa's user avatar
  • 105
0 votes
1 answer
420 views

Common crawl request with node-fetch, axios or got

I am trying to port my C# common-crawl code to Node.js and getting error in with all HTTP libraries(node-fetch, axios of got) in getting the single page HTML from common-crawl S3 archive. const offset ...
Vikash Rathee's user avatar
2 votes
1 answer
284 views

Which block represents a WARC-Block-Digest?

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Line 01: WARC/1.0 Line 02: WARC-Type: request Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-...
user avatar
4 votes
1 answer
2k views

Common Crawl data search all pages by keyword

I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will ...
Python 123's user avatar
0 votes
1 answer
388 views

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

I can obtain listing for Common Crawl by: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz How can I do this with Common Crawl News Dataset ? I tried different options, but ...
Andrey's user avatar
  • 6,367
0 votes
1 answer
196 views

Getting date of first crawl of URL by Common Crawl?

In Common Crawl same URL can be harvested multiple times. For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added. Is there a way to find when a ...
dzieciou's user avatar
  • 4,514

15 30 50 per page
1
2 3 4 5