Newest 'common-crawl' Questions

1 vote

1 answer

41 views

Fetch Page Content from Common Crawl

I have thousands of web pages of different websites. Is there a fast way to get content of all those web pages using Common Crawl and Python. Below is code i am trying but this process is slow. async ...

ALTAF HUSSAIN

355

asked Nov 1, 2024 at 16:17

0 votes

1 answer

66 views

Querying athena aws the right way

i get a time out queriing https://commoncrawl.org/overview data with athena ... and if it succeed it will cost me 1000$ each query ... 5$ for each TB with 200 TB (?) ... actually too much This is, ...

fass33443423

117

asked Oct 20, 2024 at 6:28

3 votes

3 answers

799 views

Querying HTML Content in Common Crawl Dataset Using Amazon Athena

I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web ...

Cauder

2,537

asked Oct 6, 2023 at 1:22

1 vote

1 answer

402 views

Is there any way to get check if certain domain exists in Common Crawl?

I want to know if certain domain is exists in common crawl’s crawl data. Is there any API or any other way to check that? I couldn’t find any way to achieve this in their documentations In my solution ...

Avishka Balasuriya

21

asked Sep 4, 2023 at 4:11

1 vote

1 answer

106 views

Python's zlib doesn't work on CommonCrawl file

I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything ...

157 239n

369

asked Jun 11, 2023 at 20:54

-1 votes

1 answer

433 views

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code:...

Jawaher

3

asked Jun 4, 2023 at 15:49

4 votes

1 answer

960 views

Common Crawl requirement to power a decent search engine

Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years. Are these snapshots independent (probably not)? Or do we have to ...

NedStarkOfWinterfell

5,125

asked May 23, 2023 at 12:27

0 votes

1 answer

247 views

How to access Columnar URL INDEX using Amazon Athena

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query: SELECT COUNT(*) AS count, url_host_registered_domain FROM "ccindex".&...

Gladiator

3

asked Jan 8, 2023 at 13:01

2 votes

2 answers

2k views

Extracting the payload of a single Common Crawl WARC

I can query all occurences of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come ...

js16

63

asked Dec 1, 2022 at 22:14

3 votes

0 answers

604 views

Common Crawl Request returns 403 WARC

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate ...

presa

105

asked Apr 30, 2022 at 15:58

0 votes

1 answer

420 views

Common crawl request with node-fetch, axios or got

I am trying to port my C# common-crawl code to Node.js and getting error in with all HTTP libraries(node-fetch, axios of got) in getting the single page HTML from common-crawl S3 archive. const offset ...

Vikash Rathee

2,024

asked Apr 23, 2022 at 13:00

2 votes

1 answer

284 views

Which block represents a WARC-Block-Digest?

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Line 01: WARC/1.0 Line 02: WARC-Type: request Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-...

user16656944

asked Aug 13, 2021 at 8:08

4 votes

1 answer

2k views

Common Crawl data search all pages by keyword

I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will ...

Python 123

79

asked Mar 26, 2021 at 4:26

0 votes

1 answer

388 views

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

I can obtain listing for Common Crawl by: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz How can I do this with Common Crawl News Dataset ? I tried different options, but ...

Andrey

6,367

asked Mar 20, 2021 at 18:36

0 votes

1 answer

196 views

Getting date of first crawl of URL by Common Crawl?

In Common Crawl same URL can be harvested multiple times. For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added. Is there a way to find when a ...

dzieciou

4,514

asked Mar 5, 2021 at 13:08

Collectives™ on Stack Overflow

Fetch Page Content from Common Crawl

Querying athena aws the right way

Querying HTML Content in Common Crawl Dataset Using Amazon Athena

Is there any way to get check if certain domain exists in Common Crawl?

Python's zlib doesn't work on CommonCrawl file

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

Common Crawl requirement to power a decent search engine

How to access Columnar URL INDEX using Amazon Athena

Extracting the payload of a single Common Crawl WARC

Common Crawl Request returns 403 WARC

Common crawl request with node-fetch, axios or got

Which block represents a WARC-Block-Digest?

Common Crawl data search all pages by keyword

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

Getting date of first crawl of URL by Common Crawl?

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags