User talk:DemonDays64/Bot Archive 1

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

BRFA for DemonDays64 Bot

Your BRFA for DemonDays64 Bot has been approved for trial. Please see the request page for more details. Enterprisey (talk!) 18:37, 3 December 2019 (UTC)

Just as a note, when you use the {{BotTrialComplete}} template AnomieBOT will automatically move the BRFA transclusion to the proper section; while you are of course welcome to move it yourself, I figured I'd let you know and save yourself some edits in the future. Primefac (talk) 14:43, 7 December 2019 (UTC)
@Primefac: oh! Didn't know that. The guide said to do both, so I did both. BTW, any estimate on when my bot will probably be approved/rejected by? There's a lot of time where I'm just waiting and thinking it could be weeks, days, or like an hour until the BAG says anything new about the bot, because I'm new to the process so I don't know the timeframe. Thanks! DemonDays64 (talk) 14:49, 7 December 2019 (UTC)
Depends on who is around and how complex the task. It can be any or all of the above time frames. Primefac (talk) 03:19, 8 December 2019 (UTC)

Removing www

Hi! The bot also removed the "www." parts of a couple links [1]. I see you have an explicit table of sites, but I don't why removing "www." from sites that then redirect it back to "www." is valid (like https://ign.com/games/midi-maze)? I don't see this mentioned in the BRFA (@Primefac:). —  HELLKNOWZ   ▎TALK 16:28, 11 December 2019 (UTC)

Same observation here. E.g.: http://www.gamesradar.com/ is replaced with https://gamesradar.com/, which again redirects to https://www.gamesradar.com/. Lordtobi () 18:30, 11 December 2019 (UTC)
@Lordtobi: @Hellknowz: As with User:Bender the Bot, the bot removes the www.; it doesn’t affect any of the specific links that are edited, and it is much easier to remove the www. than to not—it would take two separate RegExes for each site to not remove it. Anyway, it is much easier, and it’d just be pointless to program that, with literally no benefit. DemonDays64 (talk) 18:44, 11 December 2019 (UTC)
Couldn't you just group the URL as (?:http:\/\/)?((?:www\.)?gamesradar\.com) and use https://$1 as replacement? Lordtobi () 18:51, 11 December 2019 (UTC)
@Lordtobi: it’s very convenient for me to have just one kind of RegEx—if something can be improved, I can just find and replace in the JS file. Using your method to preserve the www for some links would stop that. IMHO it’d be pointless; the system works fine as-is. DemonDays64 (talk) 19:54, 11 December 2019 (UTC)
Remember that some webservers might not be able to handle such kinds of redirects, so I would argue that preserving the original link is not pointless as you say. The method I proposed allows for using kind of regex and still handle both cases. Lordtobi () 19:59, 11 December 2019 (UTC)
It's not pointless, because www is a subdomain and a website can choose to not redirect it or serve different content, even if the websites in question don't. And teh websites in question redirect the URL back, which means it's not the default URL. Regardless, I don't see community consensus to perform this change en masse. And I don't see it mentioned in your BRFA. Also, can you point out where Bender the Bot removes the www subdomain? —  HELLKNOWZ   ▎TALK 20:13, 11 December 2019 (UTC)
  • You have had multiple editors tell you that the changes you are making are problematic. Please implement their suggestions (i.e. keep the www if there is one) or I will have to rescind the approval of the bot. Primefac (talk) 02:37, 12 December 2019 (UTC) (please ping on reply)

@Primefac: ok I’ll try to make it not do this. DemonDays64 (talk) 03:16, 12 December 2019 (UTC)

@Lordtobi: hi! I’ll implement this. However, I’m very inexperienced at RegEx; could you please instruct me a little bit on how you’d recommend creating a thing that will work to not remove the www? Thanks! DemonDays64 (talk) 03:18, 12 December 2019 (UTC)

Couple problems:

  1. It should not remove www as noted above. I did not see that in the BRFA. Hostnames sometimes lead to different content, and even if it doesn't today it might in the future.
  2. It should not modify |url= when a |archiveurl= is present. See [2] for the newyorker. The archived link is http://www.newyorker.com/archive/2006/08/21/060821fa_fact but it changed the primary URL link to http://newyorker.com/archive/2006/08/21/060821fa_fact (note missing wwww) - it will cause alarms on the archive bots when the primary URL and archive URL are mismatched the bots will attempt to reconcile the difference leading to bot wars.

Modifying URLs with regex search-replace is deceptively easy and rarely a good idea. This type of work should be done by a full bot that can test the new URL is working before making the change, that can detect archive URLs and {{webarchive}} templates, etc.. Other problems that can come up include websites that use https that is actually a redirect back to http (MSNBC did this for 10s of thousands), that can make sure the content in the https is the same as at http (they are sometimes different). The work your doing here is important, but more sophisticated than a regex with AWB. -- GreenC 03:21, 12 December 2019 (UTC)

  • GreenC, is there enough of a concern to raise this at WP:BOTN? In the BRFA it sounded like the links were being checked manually to ensure the new URL would work properly, but I'm willing to re-evaluate the situation at BOTN if you think it's necessary. Primefac (talk) 12:05, 12 December 2019 (UTC) (please ping on reply)
@Primefac: I don't check the individual links, but I check the site in general; it's not like the sites have some HTTPS pages and some HTTP pages, so I don't think it is a problem. The thing with the web archives could be, though. DemonDays64 (talk) 14:45, 12 December 2019 (UTC)
MNSBC was a case where they had some https and some http. They had servers (hostnames) with certs and some not - they are hosted across data centers globally run by different branches almost like different companies internally. They had 100s of hostnames.
I want to help you get the bot working again. If you can predetermine a given domain hostname (eg. "www") has a working https, and the http and https versions resolve to the same IP (or IP routed block) then a blind conversion should be safe. If they go to different IPs, they may be serving different content from different servers (this is not common but it happens) or some other unknown unknown that needs a closer look - like they did a migration from one server to another and didn't migrate all of the paths creating 404s in the ssl space. For the archives, so long as it's only modifying the scheme (http -> https) it should be OK, the problem arises when modifying the hostname, domain name, path, query or fragment. -- GreenC 16:11, 12 December 2019 (UTC)
@GreenC: Hi! Sorry that I'm replying late. Anyway, how would one verify those things about the server using the same IP? Thanks! DemonDays64 (talk) 04:06, 26 December 2019 (UTC)
@GreenC: Hi! Using a modification of what Lordtobi said to do, I have made a new RegEx and approval page at Wikipedia:Bots/Requests for approval/DemonDays64 Bot 2. Please comment there on it. Thanks! DemonDays64 (talk) 04:55, 30 December 2019 (UTC)

Wikipedia:Bots/Requests for approval/DemonDays64 Bot 2 has been approved! Happy editing and thanks for taking on this task . --TheSandDoctor Talk 07:22, 17 February 2020 (UTC)

Bot edits in https on links such as [[allmusic.com]]

This edit[3] was not helpful -- the .com link is supposed to go to a WP article (allmusic.com) not a .com web link. —2606:A000:1126:28D:B5B6:B7C1:7A7:18D0 (talk) 04:23, 20 February 2020 (UTC)

oh! I will fix the regex right away to not change links with double square brackets. Thank you! DemonDays64 (talk) 04:37, 20 February 2020 (UTC)
 Done DemonDays64 (talk) 04:54, 20 February 2020 (UTC)

This was also a problem here. Noting for the record, assuming it's now fixed. -Pmffl (talk) 14:09, 20 February 2020 (UTC)

@Pmff1: yup that shouldn't happen now. Thanks! DemonDays64 (talk) 16:08, 20 February 2020 (UTC)