75 questions
1
vote
1
answer
41
views
Fetch Page Content from Common Crawl
I have thousands of web pages of different websites. Is there a fast way to get content of all those web pages using Common Crawl and Python.
Below is code i am trying but this process is slow.
async ...
0
votes
1
answer
66
views
Querying athena aws the right way
i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much
This is, ...
3
votes
3
answers
799
views
Querying HTML Content in Common Crawl Dataset Using Amazon Athena
I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web ...
1
vote
1
answer
402
views
Is there any way to get check if certain domain exists in Common Crawl?
I want to know if certain domain is exists in common crawl’s crawl data. Is there any API or any other way to check that?
I couldn’t find any way to achieve this in their documentations
In my solution ...
1
vote
1
answer
106
views
Python's zlib doesn't work on CommonCrawl file
I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything ...
-1
votes
1
answer
433
views
Unknown archive format! How can I extract URLs from the WARC file by Jupyter?
I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org.
After decompressing the file and writing the code to read this file, I attached the code:...
4
votes
1
answer
960
views
Common Crawl requirement to power a decent search engine
Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years.
Are these snapshots independent (probably not)? Or do we have to ...
0
votes
1
answer
247
views
How to access Columnar URL INDEX using Amazon Athena
I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query:
SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex".&...
2
votes
2
answers
2k
views
Extracting the payload of a single Common Crawl WARC
I can query all occurences of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come ...
3
votes
0
answers
604
views
Common Crawl Request returns 403 WARC
I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate ...
0
votes
1
answer
420
views
Common crawl request with node-fetch, axios or got
I am trying to port my C# common-crawl code to Node.js and getting error in with all HTTP libraries(node-fetch, axios of got) in getting the single page HTML from common-crawl S3 archive.
const offset ...
2
votes
1
answer
284
views
Which block represents a WARC-Block-Digest?
At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-...
4
votes
1
answer
2k
views
Common Crawl data search all pages by keyword
I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will ...
0
votes
1
answer
388
views
How to get a listing of WARC files using HTTP for Common Crawl News Dataset?
I can obtain listing for Common Crawl by:
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz
How can I do this with Common Crawl News Dataset ?
I tried different options, but ...
0
votes
1
answer
196
views
Getting date of first crawl of URL by Common Crawl?
In Common Crawl same URL can be harvested multiple times.
For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added.
Is there a way to find when a ...