Commons:Bots/Requests/WebArchiveBOT
Operator: Amitie 10g (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)
Bot's tasks for which permission is being sought: Use the MediaWiki API in a high rate to retrive the latest Files uploaded (in fact, File pages), and extract external links from them. Source code is available at GitHub (work in progress, constantly updated).
I will use this account for security purposes, due the tool is hosted at WMF Labs. and I don't want to compromise my main account.
Automatic or manually assisted: Automatic.
Edit type: None, this bot does not edit the Wiki.
Maximum query rate: 100 queries (or even more) every 5 minutes.
Bot flag requested: No, unles if it affects the maximum query rate.
Programming language(s): PHP
Amitie 10g (talk) 07:36, 18 December 2015 (UTC)
Discussion
What is your intention regarding the gathered data? --Krd 09:57, 18 December 2015 (UTC)
- I found this idea useful due several external links (mainly the source) could become unavailable in any time. Archiving them keeps the source of the files always available (specially for DRs claiming lack of source). --Amitie 10g (talk) 17:14, 18 December 2015 (UTC)
- Where is the actual data (archive) being stored? --Krd 07:33, 19 December 2015 (UTC)
- The data (pagename and external links) is stored in my tool home at WMF Labs server; blacklisted domains will not be saved. There are two files stored: A compressed JSON with all the queries done, and a plain JSON used as cache for the page. Is this a problem (eg security or privacy)? --Amitie 10g (talk) 13:17, 19 December 2015 (UTC)
- You say that you want to retrieve external links from new file pages. Do you collect the links only, or also their content? How can the collected data be accessed? --Krd 16:14, 21 December 2015 (UTC)
- The data (pagename and external links) is stored in my tool home at WMF Labs server; blacklisted domains will not be saved. There are two files stored: A compressed JSON with all the queries done, and a plain JSON used as cache for the page. Is this a problem (eg security or privacy)? --Amitie 10g (talk) 13:17, 19 December 2015 (UTC)
- Where is the actual data (archive) being stored? --Krd 07:33, 19 December 2015 (UTC)
I'll explain how the tool works and interacts with the MediaWiki API in order:
- Get the latest files with the Title and Timestamp:
?action=query&list=allimages&format=php&aisort=timestamp&aidir=older&aiprop=timestamp%7Ccanonicaltitle
- Iterate the list of pages, and get just the external links:
?action=parse&format=php&prop=externallinks&page=$page
- Then, filter these external links and remove blacklisted domains (including the WMF Wensites, Creative Commons sites, etc) with a regex.
- Iterate the External links and check if them has been or not already archived at Internet Archive in the latest 48 hours (comparing timestamps):
https://archive.org/wayback/available?url=$url
- The above query will return a JSON with the results. If the link was already archived, the JSON will contain the link of the latest version archived. If not, JSON contains an error message. Then, if not already archived in the last 48 hours, query to Internet Archive to save the link, and get just the HTTP Headers:
https://web.archive.org/save/$link</nowiki>
- Headers contains the final link like the following (unless errors ocurred):
https://web.archive.org/web/XXXXXXXXXXXXX/https://example.com/index.html
The external links got from the Internet Archive query will replace the original external links provided by the external links, and the rest of the data is discarded. The following is an example (from an actual query) JSON that is actually stored:
{ "File:Naturalis Biodiversity Center - RMNH.MOL.299590 - Lysinoe ghiesbreghti (Nyst, 1841) - Xanthonychidae - Mollusc shell.jpeg": { "timestamp": 1450735044, "urls": [ "https:\/\/web.archive.org\/web\/20151221215811\/http:\/\/data.biodiversitydata.nl\/naturalis\/specimen\/RMNH.MOL.299590", "https:\/\/web.archive.org\/web\/20151221215813\/http:\/\/bioportal.naturalis.nl\/nba\/result?nba_request=specimen%2Fget-specimen%2F%3FunitID%3DRMNH.MOL.299590" ] }, "File:Naturalis Biodiversity Center - RMNH.MOL.299564 1 - Leptarionta trigonostoma (Pfeiffer, 1844) - Xanthonychidae - Mollusc shell.jpeg": { "timestamp": 1450735043, "urls": [ "https:\/\/web.archive.org\/web\/20151221215811\/http:\/\/data.biodiversitydata.nl\/naturalis\/specimen\/RMNH.MOL.299590", "https:\/\/web.archive.org\/web\/20151221215813\/http:\/\/bioportal.naturalis.nl\/nba\/result?nba_request=specimen%2Fget-specimen%2F%3FunitID%3DRMNH.MOL.299590" ] } }
Notice that I paused the tool to get more discussion (like this), but the source code is still available at GitHub and works correctly; most of your doubts can be resolved by seeing the source code. --Amitie 10g (talk) 22:07, 21 December 2015 (UTC)
- Ok, understood. (Internet Archive was the missing link for me.)
- Technically this looks ok for me, but I can imagine legal concerns, e.g. if the source contains copyvio or other forbidden material and by requesting archiving you help distributing it. IMO this should be investigated further before starting. --Krd 18:21, 22 December 2015 (UTC)
- I also considered the non-free files issue, and the Blacklist could help to combat these legal concerns. Also, consider that several sites have configured robots.txt to forbid crawling (and the webmasters are responsible for that).
- For now, the testing stage was successful (running the script from WMF Labs several times and append the results to the existing JSON) and bugs were resolved. The script can be adjusted to retrive external links only under certain conditions, including a delay to allow other bots to check the files (but it may require to increase the files retrived per query and increasing the time between each execution), and discard files lacking of valid license tags.
- Thanks for your feedback. --Amitie 10g (talk) 21:38, 22 December 2015 (UTC)
- I would like to keep this open for a while to gather more opinions. --Krd 15:10, 28 December 2015 (UTC)
- I agree. Meanwhile, I'm making some updates and running the tool once or twice at week.
- The source is still in GitHub, so any developer can give opinions by seeing the source code, too. --Amitie 10g (talk) 02:19, 29 December 2015 (UTC)
- I would like to keep this open for a while to gather more opinions. --Krd 15:10, 28 December 2015 (UTC)
I consider this provisionally approved and suggest to keep the request open for another while just in case more discussion is required. --Krd 08:46, 31 December 2015 (UTC)
- Thanks. I'll keep the tool running. Users can analize the links in the JSON and report which of them should be blacklisted. --Amitie 10g (talk) 19:31, 31 December 2015 (UTC)
Approved. --Krd 14:06, 15 January 2016 (UTC)