User talk:NicDumZ/Archive 1

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Welcome

Hello, NicDumZ/Archive 1, and welcome to Wikipedia. Thank you for your contributions. I hope you like it here and decide to stay. If you are looking for help, please do any of the following:

visit the new contributors' help page, where experienced Wikipedians can answer any queries you have
type {{helpme}} on your user page, and someone will answer your questions shortly
visit the directory of help pages

There are a lot of standards and policies here, but as long as you are editing in good faith, you are encouraged to be bold in updating pages. Here are a few links you might find useful:

I hope you enjoy editing here and being a Wikipedian! Please sign your name on talk and vote pages using four tildes (~~~~), which produces your name and the current date. Also, it would be a huge help if you could explain each of your edits with an edit summary. Again, welcome!--NAHID 17:49, 25 July 2007 (UTC)

User:NicDumZ/Bir Hakeim

User:NicDumZ/Bir Hakeim moved. Anthony Appleyard 16:41, 27 July 2007 (UTC)

Smile

Connell66 has smiled at you! Smiles promote WikiLove and hopefully this one has made your day better. Spread the WikiLove by smiling at someone else, whether it be someone you have had disagreements with in the past or a good friend. Happy editing!
Smile at others by adding {{subst:Smile}} to their talk page with a friendly message.

Proof. Request

You just sent me a message. Yes, I will proofread it. I can't translate, though. Laleena^{talk to me contributions to Wikipedia} 12:18, 28 July 2007 (UTC)

"French apartheid" article

Hi NicDumZ. I have encountered similar problems, and I don't have a simple answer. Disruptive editors will often tie up talk-page debates with red herrings, strawman arguments and question-begging, so it's best to address your arguments to an ideal intelligent editor, rather than a specific disruptive one. Also consider filing an RfC. In the meantime I will take a look at the specific issue you're referring to.--G-Dett 16:30, 28 July 2007 (UTC)

Hi. Besides what G-Dett says, a way is writing a good article on the subject; I linked some material here--Victor falk 19:52, 30 July 2007 (UTC)

Your note

Thanks for your note. Discussion is good, and I think we've managed to have a reasonable one, even though disagreeing on many points. Jayjg ^(talk) 21:18, 30 July 2007 (UTC)

No problem

Ah, I didn't even see you'd changed it. I just tested it and assumed I'd mixed them up. Thanks for letting me know. Mackan79 15:02, 3 August 2007 (UTC)

AN/I thread

I didn't mean to imply, in my comments on WP:AN/I, that you were in any way "the bad guy". I'm sorry if it came across that way. MastCell ^Talk 20:52, 6 August 2007 (UTC)

I really apologize. I was not referring to you as an WP:SPA. I was referring to Jeemde (talk · contribs), whose comment apparently sparked Greg Park Avenue's response. Jeemde seems to be a single-purpose account created solely to participate in the French apartheid AfD and talk page - which is what I was addressing. I don't think you were being especially provocative (though Jeemde was). I don't consider you a single-purpose account, and I wasn't referring to you, but to Jeemde. I apologize for not making myself clearer, and for unintentionally causing offense. MastCell ^Talk 15:28, 7 August 2007 (UTC)

A Question

Excuse me, I read your message on my discussion page and I wanted to know about something. What kind of articles can I create? I really want to create an article, but I can't think of one. Can you help me, please? KiaraFan13 18:17, 7 August 2007 (UTC)

Articles Situation

Yes, you can look at my contributions, but I'll try to wait for a few months until I can create an article. I was going to create an article about a fictional character in my series of plays called The 23 Kids, but the idea was off because she is modeled after Hannah Montana and she wouldn't want to be featured on television. KiaraFan13 19:20, 7 August 2007 (UTC)

Battle of Bir Hakeim

Hi NicDumZ,
Good work on translating the Battle of Bir Hakeim article, and don't worry about the mistakes, translation is always a bit tricky.

If you want to improve it further, I have only one thing to say to you: inline citations! Your article won't get past start class if all the important facts aren't appropriately cited. To know which points need a cite, and how to do it see WP:MILHIST#CITE. In short:

Use footnotes (<ref> </ref>)
Give a ref. for precise figures(troop strengths, casualties etc), quotations, and anything that looks vaguely controversial or open to debate.
If your source is a book don't forget to give the page number for each point you're citing.

Ideally, as this is English wikipedia, you should use sources in English, but if none are available, you'll have to use some in French.

Voila, j'espère que ça t'a aidé, et si t'as besoin d'autres conseils, tu peux aussi me demander en Français.
A +
Raoulduke47 19:25, 7 August 2007 (UTC)

Wikipedia:Requests for arbitration/Allegations of apartheid

Hello,

An Arbitration case involving you has been opened: Wikipedia:Requests for arbitration/Allegations of apartheid. Please add any evidence you may wish the Arbitrators to consider to the evidence sub-page, Wikipedia:Requests for arbitration/Allegations of apartheid/Evidence. You may also contribute to the case on the workshop sub-page, Wikipedia:Requests for arbitration/Allegations of apartheid/Workshop.

On behalf of the Arbitration Committee, Newyorkbrad 18:04, 12 August 2007 (UTC)

Wikipedia:Requests for arbitration/Allegations of apartheid

The Arbitration Committee has adopted a motion in the above arbitration case, providing: "As the Committee has been unable to determine which actions in this matter, if any, were undertaken in bad faith, and as the community appears to be satisfactorily dealing with the underlying content dispute, the case is dismissed with no further action being taken." This notice is given by a Clerk on behalf of the Arbitration Committee. Newyorkbrad 19:13, 26 October 2007 (UTC)

Re: BJBot & Afds

Sure, I'm almost done with the new version that handles prods, after that I'll post it. BJ^Talk 14:54, 28 December 2007 (UTC)

Your bot (probably ;))

Just to be sure there is no foul play, can you please edit the user page of your bot's account with your "real" account? Thanks a lot! -- lucasbfr ^{ho ho ho} 16:09, 29 December 2007 (UTC)

EditorBot

While I appreciate your suggestion, it really doesn't make any sense to me. Why should there be inclusion guidelines for French communes? Shouldn't we have articles on all towns? The Rambot created all the articles on US towns back in 2001, and the Eubot created articles on all Italian communes in 2006. I also just noticed that all Swiss and almost all German and Austrian municipalities have articles. So do other English-speaking countries like UK and Australia. To be frank, Nick, I think your idea is rubbish. Why should we deny "any country town" in France an article when on our homefront we have one for every? Editorofthewiki (talk) 19:05, 6 January 2008 (UTC)

I mainly edit on fr:, and was just stating that I merely consider that having an article per every small French town is not necessary, since not every town is notable to me. It was not really about French town articles on en.wiki, but about French town articles on fr.wiki. :) NicDumZ ~ 01:11, 7 January 2008 (UTC)

In any case, something has to be done about the serious lack of French towns on Wikipedia.Editorofthewiki (talk) 22:03, 8 January 2008 (UTC)

Your bot request

Hi NicDumZ I wanted to let you know that Wikipedia:Bots/Requests for approval/DumZiBoT has been approved. Please visit the above link for more information. Thanks! BAGBot (talk) 03:30, 3 February 2008 (UTC)

thanks

Glad to see a bot doing this.

Does it also convert inline exlinks to refs?

I made such a change here. (It's the first change, the second just needed a different name.) --Jtir (talk) 14:16, 3 February 2008 (UTC)

Thanks !

No, it does not convert inline links ! I'm afraid this would cause too much trouble : How a bot could be sure that an inline link should be converted into a reference ? NicDumZ ~ 14:19, 3 February 2008 (UTC)

OK. I just noticed the helpful link in the bot's edit summary. --Jtir (talk) 14:23, 3 February 2008 (UTC)

Yes. Please feel free to edit that page if you think that it needs some improvements. Some things are obvious for me, but might not be this obvious for others... Also, my english is not this good :) NicDumZ ~ 14:31, 3 February 2008 (UTC)

"He runs every time that a new XML dump is available."

Would it be better to say "He usually runs every time ..."?

I'm not so happy with the phrase "... please check that you can access the pages", but that's the best I could come up with. --Jtir (talk) 15:37, 3 February 2008 (UTC)

Well, I do appreciate your help ! The best that you could come up with is way better than my not-so-good academic English... Thanks a lot !

I've added usually: You are right, the run frequency depends on my availability :)

NicDumZ ~ 15:49, 3 February 2008 (UTC)

Glad to help out. --Jtir (talk) 16:04, 3 February 2008 (UTC)

possibly missed exlinks

This edit possibly missed the exlinks in a named ref (<ref name="RFC3092">). The named ref had the bare exlink repeated in three places. I made the corrections in these two edits [1] [2] (it took two edits because I didn't realize the exlink had been repeated in three places). And, yes, the page can be accessed. :-) --Jtir (talk) 16:45, 3 February 2008 (UTC)

Nice catch !

I've just corrected this :)

NicDumZ ~ 16:57, 3 February 2008 (UTC)

Thanks! That's the fastest bug fix I have ever seen when I wasn't also the programmer. :-) (I guess this is your test suite.) --Jtir (talk) 17:09, 3 February 2008 (UTC)

Yes it is ! Feel free to add links overthere !! NicDumZ ~ 17:11, 3 February 2008 (UTC)

Jolly clever bot.--Wetman (talk) 17:18, 3 February 2008 (UTC)
I thought this behavior was intentional. Which is why I never brought it up in BRfA. — Dispenser 19:01, 3 February 2008 (UTC)
mmh... At some point of the dev, I remembered that I had to add this, and eventually forgot. Anyway, that's not a big problem :) NicDumZ ~ 19:11, 3 February 2008 (UTC)

In da middle

It looks good to me, so long as the tiles are accurate. On the other hand, I can see concerns about bots such as those raised above. In any case, so long as you're amenable to receiving feedback whem and if problems arise, I think both you and the bot will be happy together. Clever, by the way. :) As a used to be programmer I wouldn't mind seeing the code. Cheers. Jim62sch^dissera! 21:08, 3 February 2008 (UTC)

Thanks ! The code, (slightly out to date) is available for now here. I believe that the above problems will be fixed once my bot gets flagged, though... :) NicDumZ ~ 21:10, 3 February 2008 (UTC)

Very, very nice. Logical and well-referenced. Congrats, well done! Jim62sch^dissera! 21:21, 3 February 2008 (UTC)

Your bot is wonderful!

What a great idea for a bot. This is something that I manually do all of the time. Your bot is very helpful, and I cannot believe that no one had thought of it sooner. Kudos! нмŵוτн τ 18:01, 3 February 2008 (UTC)

I just don't know what to answer. It makes me xD ! Thanks :) NicDumZ ~ 19:26, 3 February 2008 (UTC)

Ditto. I think DumZiBoT is doing fine. Decriptive text as a link label sure beats just a number ([2]) in the references section. -Fnlayson (talk) 20:01, 3 February 2008 (UTC)

More support: even if it picks the wrong phrase it's still an improvement, and any human follow-up is easier than before. --Old Moonraker (talk) 07:42, 4 February 2008 (UTC)

Your bot is a pain...

...but I hope that, when it has caught up with all the untitled references, I might find some reason to look at my watchlist again! TINY MARK 23:08, 3 February 2008 (UTC)

It's now flagged. I think that this should be better now :) NicDumZ ~ 00:20, 4 February 2008 (UTC)

Great bot but...

Hi. Your bot is really useful but it needs some tuning i think. Can you please exclude JSOTR links? Check here. For non-registered users JSTOR gives the message: "JSTOR: Accessing JSTOR" and doesn't show the real html. -- Magioladitis (talk) 01:56, 4 February 2008 (UTC)

Exception added !

Thanks ;) NicDumZ ~ 02:02, 4 February 2008 (UTC)

Is there a way that editors could be informed that such links are present in an article and may need their attention? In analogy with the image fair use notifications, perhaps a brief message on the talk page could say that DumZiBoT had not changed such links. --Jtir (talk) 09:08, 4 February 2008 (UTC)

After looking at Andrew Sullivan, I would say that the links needing attention are obvious. --Jtir (talk) 09:41, 4 February 2008 (UTC)

What a good little bot

Thank you. --Duncan (talk) 09:46, 4 February 2008 (UTC)

Great bot

Keep it up. --Arcadian (talk) 13:09, 4 February 2008 (UTC)

Your bot is awesome

Your bot edited two pages and cleaned up the reference sections a job that I really don't like doing. Thank you, your bot is very useful. EconomistBR (talk) 15:40, 3 February 2008 (UTC)

Can I just add to that "Yippee!!!"? This is wonderful to see! Thank you! -- SatyrTN (talk / contribs) 15:56, 3 February 2008 (UTC)

+1. I see there are some issues, but I think bot is doing a great job. utcursch | talk 04:42, 4 February 2008 (UTC)

Your bot rules, DumZiBoT just edited the GT Interactive article, over 60 references. Now the reference list looks so elegant, clean and easy on the eyes.

I can't wait to see DumZiBoT editing the Vale (mining company) (80 bare references) and Infogrames (58 bare references), the changes on those pages will be huge. EconomistBR (talk) 07:45, 4 February 2008 (UTC)

Amen, amen. Replacing cryptic ref URLs with the corresponding <title> element via a bot is a fantastic idea!! Kudos. — ¾-10 01:53, 5 February 2008 (UTC)

Your bot is not helpful

You need to turn off this bot, especially on science articles. You are making it difficult if not impossible to watch science articles for trolls, vandals and POV-warriors, because all I see on my watchlist is your useless bot. You are making Wikipedia worse off, not better, because once the POV warriors know how your bot works, they'll just put in links without titles, and your bot will format it, making yours the last change in history. This will take more work using Twinkle or other vandal fighting tools. Either turn the thing off, or I will ask for administrative assistance. OrangeMarlin ^{Talk• Contributions} 17:50, 3 February 2008 (UTC)

Can't you ask nicely, mmh ?

NicDumZ ~ 17:57, 3 February 2008 (UTC)

"especially on science articles": Could you cite a specific example? --Jtir (talk) 17:55, 3 February 2008 (UTC)

I think the bot is great, but Orangemarlin has a point -- shouldn't there be an option to ignore bot edits on watchlists? Just like you can ignore edits marked "minor"? csloat (talk) 18:37, 3 February 2008 (UTC)

There is an option to ignore bots. However, my bot has not been flagged yet, hence is not considered by Mediawiki as a bot. Just wait a few hours :) NicDumZ ~ 19:00, 3 February 2008 (UTC)

~~I will resume my edits once DumZiBoT gets flagged. But seriously Orangemarlin, adopting such a condescending tone is not the way around. I expect some excuses. NicDumZ ~ 18:38, 3 February 2008 (UTC)~~

I've changed my mind : per [3], as your bot is listed as approved by WP:BAG, you may operate it, just keep it under 3-4 edits per min until you are flagged. DumZiBoT has been approved (hence is considered as useful), I reduced a bit the edit rate : I see no reason to stop. NicDumZ ~ 18:57, 3 February 2008 (UTC)

Orangemarlin : Next time, please feed me with some diffs... NicDumZ ~ 18:57, 3 February 2008 (UTC)

I don't feed diffs, because frankly I'd rather edit articles than try to prove anything, since I specifically stated, your bot makes my life difficult on Wikipedia. But so do incompetent admins, anti-science editors, and trolls. You got my opinion, you ignored my opinion, I'm fine with that decision. OrangeMarlin ^{Talk• Contributions} 21:47, 3 February 2008 (UTC)

As another editor said below, once it gets caught up with all the untitled references, everything should be fine. --Jim Butler (t) 07:52, 5 February 2008 (UTC)

How can it be a bad thing for the bot to fetch the URL title? Now you will be spared from having to look at the URL to decided whether or not it is useful. It makes my life easier, and I edit science articles too. So what then? --Adoniscik (talk) 03:16, 6 February 2008 (UTC)

Andrew Sullivan

Hi, can your linkbot be set loose on Andrew Sullivan? Benji boi 08:14, 4 February 2008 (UTC)

Done NicDumZ ~ 08:17, 4 February 2008 (UTC)

Thank you! Benji boi 14:12, 4 February 2008 (UTC)

Bot

Here is another problem: "[http://www.medscape.com/viewarticle/554347?sssdmh=dm1.259053&src=ddd Log In Problems]" http://en.wikipedia.org/w/index.php?title=Pergolide&curid=622942&diff=188999410&oldid=150088654 Maybe you should add a bad word list that contains error message words... Сасусlе 12:42, 4 February 2008 (UTC)

mmhh, turned that function on from now on. NicDumZ ~ 14:38, 4 February 2008 (UTC)

We have a black list... and it match when I do the regex on the title, just not sure why its still adding it. The title comes from the cookie error page at [4]. — Dispenser 14:45, 4 February 2008 (UTC)

Because the feature was disabled. I did that for a test (yesterday ?), and I, sigh, forgot to uncomment it. NicDumZ ~ 14:54, 4 February 2008 (UTC)

list of exlinks that are excluded

Do you have a link to a list of exlinks that are excluded? (a blacklist?) I am thinking of adding a third reason to the section in User:DumZiBoT/refLinks that lists reasons an exlink might not be changed. --Jtir (talk) 12:51, 4 February 2008 (UTC)

I've commented it out, and added a banner. Tell me what you think :) NicDumZ ~ 14:19, 4 February 2008 (UTC)

Thanks, that looks good. I widened the banner, because the text was wrapping just before the last word on my display. Maybe there is a better way. (center?) --Jtir (talk) 18:23, 4 February 2008 (UTC)

Thames dumb barge?

Your bot fixed a bare reference in Landing craft, but it included the gratuitous word "dumb." Is that your idea of humor, or a flaw in your bot, or what? I have removed the word "dumb." Lou Sander (talk) 15:34, 4 February 2008 (UTC)

xD

Look at the title of your browser when opening this page : http://www.naval-history.net/WW2MiscRNLandingBarges.htm Get it :) ? My bot only copies the title from the page, not less, not more. And I have no idea why is there "dumb" in the title of this page ?!

NicDumZ ~ 15:37, 4 February 2008 (UTC)

From the page one of his first tasks was to requisition 1000 ‘dumb’ (unpowered) Thames barges. — Dispenser 16:12, 4 February 2008 (UTC)

Got it! The word wasn't in the title of the article as printed, or very visible when skimming it. I saw "dumb" and "dum" and feared the worst. Thanks for responding. Good bot. Not broken. Lou Sander (talk) 17:13, 4 February 2008 (UTC)

DumZ bot

Hi, I noticed the your bot introduced a hidden comment into East Mountain that looks like spam: " TopoZone - The Web's Topographic Map, and more!" Can you explain this?--Pgagnon999 (talk) 18:24, 4 February 2008 (UTC)

Hello Pgagnon !

You'll find your answer at User:DumZiBoT/refLinks :)

Cheers !

NicDumZ ~ 21:38, 4 February 2008 (UTC)

For the record, this is the link. I have reannotated. --Jtir (talk) 19:00, 4 February 2008 (UTC)

Thanks ! :) NicDumZ ~ 21:38, 4 February 2008 (UTC)

Hmmm....interesting. Not sure how I feel about the opportunity for a free hidden advertising plug for companies with clever URL titles. . .or (in this case anyway) if the bot introduced anything of value that wasn't already inherent in the URL sytax itself, but it is what it is. . .and, at the end of the day, not a super big deal.--Pgagnon999 (talk) 23:49, 4 February 2008 (UTC)

I think that it's wikipedian's responsibility to add titles to an external links. When this hasn't been done, I do my best to fix that. If, however, the fix is not that good, well 1)It's better than a plain hideous URL 2) Someone would have had to edit the link to add a good link anyway; after DumZiBoT, this someone just has to *fix* the title, which is less work than checking the link, and adding a title... NicDumZ ~ 15:16, 5 February 2008 (UTC)

Another issue with the bot: When a URL redirects, the bot is following it to its new destination and blithely listing the title of the new URL. Where I observed this: In List of unaccredited institutions of higher learning, http://www.asiaweek.com/asiaweek/features/universities2000/artic_online.html redirected to the current issue of TIME Magazine, so the bot left a link title of "TIME Magazine - Asia Edition - February 11, 2008 Vol. 171, No. 5". That misdirection was fairly innocuous (although the current issue of the magazine would be useless as a source, at least it's clean), and I've fixed that particular misdirection with a link to the archive.org version of the original AsiaWeek article, but I think that as a general policy the bot process should be generating a list of domains that redirect, rather than generating new titles. --Orlady (talk) 15:02, 5 February 2008 (UTC)

hmm... Right. Though, a lot of websites are using soft redirects if for example, the content has moved, or if you linked to a frame when the navigation menu is in another frame. I'm afraid that logging the redirects would not do, as there would have too many of them. However, I might add some sort of exception for the Times... NicDumZ ~ 15:12, 5 February 2008 (UTC)

I hadn't thought of the frames issue... I see multiple problems with following a nonframe-related redirect. One is when the domain registration has expired and a new owner has redirected it to unsavory content. Another is that the bare URL is actually more informative to a user than the description of the new target. A user who clicks on an Asiaweek URL that has the year 2000 in its name will quickly recognize what happened when they see the current issue of Time magazine, but a user who sees a link to Time magazine that makes no sense in the context is likely to assume that the Wikipedia contributors were idiots. This problem is by no means unique to Time magazine -- many domains do that kind of thing with old URLs. --Orlady (talk) 15:29, 5 February 2008 (UTC)

Of course this is exactly the reason why I created my tool. By the way that article is a horrid mess with its external links. — Dispenser 01:24, 6 February 2008 (UTC)

Perhaps you can turn your attention to the article "Malleus"

While you are about it, NicDumZ, perhaps you can turn your attention to the article malleus. The info box ref to the image of the gestation stage indicated needs fixing as it directs you straight to the UNC University Wiki article, unless you as the potential reader know what you are doing. Not many of our readers might know that though. Unless he (the reader) knows to home in on the template used he is going to be nonplussed. Many thanks, and congratulations on your work. Do you actually look out for unsourced articles, too? Dieter Simon (talk) 01:41, 5 February 2008 (UTC) Dieter Simon (talk) 01:43, 5 February 2008 (UTC)

...?! I'm sorry I really don't understand what you are trying to do. I looked at the history of malleus, and couldn't understand either. {{EmbryologyUNC}} looks fine to me, si I really don't understand ?! NicDumZ ~ 11:38, 5 February 2008 (UTC)

I believe he means that it is confusing to have the two links in the UNC ref conjoined without explanation. That is a problem with {{EmbryologyUNC}} and could be fixed by writing a sentence using the link names. Unfortunately, the external links have uninformative names like "subject #231 1044" and "hednk-023". I'm not sure how to fix that. --Jtir (talk) 14:58, 5 February 2008 (UTC)

Okay, understood. I tried fixing that. How is it now, Dieter ? Better ? NicDumZ ~ 15:04, 5 February 2008 (UTC)

Much better. A third argument could be an optional external link name that overrides the default "hednk-023". --Jtir (talk) 15:32, 5 February 2008 (UTC)

Yes,NicDumZ, that's what was needed, as far as I am concerned. Many thanks. Dieter Simon (talk) 23:22, 5 February 2008 (UTC)

More praise for your bot

Hey, I just saw the edits made by your bot at Krav Maga -- great bot, in both concept and performance! Kudos! JDoorjam JDiscourse 19:32, 5 February 2008 (UTC)

Absolutely brilliant - congratulations from me too --Matilda ^talk 22:41, 5 February 2008 (UTC)

Love the bot, too. Thanks for the edits to Mono (software). :) Mahanga^Talk 22:45, 5 February 2008 (UTC)

Wow...

I just figured out what your bot does, after quite a bit of confusion. As soon as I figured it out, I was quite impressed. Thanks for making such a useful addition to the Wikimunity. Darkage7 (talk) 07:20, 6 February 2008 (UTC)

What a Brilliant Idea Barnstar

		What a Brilliant Idea Barnstar
		You are awarded this barnstar for programming DumZiBoT to expand bare references. Thanks for helping make Wikipedia a well-referenced resource. Flibirigit (talk) 07:34, 6 February 2008 (UTC)

One more voice in the crowd

I've seen probably fifty pages on my Watched list get (slightly) improved by this both in the past two days - keep up the work, it's a great idea. Sherurcij ^{(Speaker for the Dead)} 08:52, 6 February 2008 (UTC)

DumZiBoT

Nice BOT - can you change it to use a basic citation template though ? eg

<ref> {{Citation | title = | url = }} </ref>

Cheers -- John (Daytona2 · Talk · Contribs) 23:02, 5 February 2008 (UTC)

Seconded. Mahanga^Talk 23:12, 5 February 2008 (UTC)

Thirded :-) --Matilda ^talk 23:15, 5 February 2008 (UTC)

Fourthded :) vıd ıoman 23:23, 5 February 2008 (UTC)

(edit conflict)Well, I'm not used to the English style guidelines... But It seems to me that {{citation|title=example|url=http://example.com}} gives exactly the same result as [http://example.com example], or am I missing something ? If so, why should I complicate things, for the users and for the servers, using this intricated template ? :þ NicDumZ ~ 23:24, 5 February 2008 (UTC)

The template can be expanded. It provides the groundwork. I'm pretty sure we're supposed to have "retrieved on" tags for references as well. vıd ıoman 23:38, 5 February 2008 (UTC)

it is very good groundwork and promotes the use of additional paramters such as who published which is very useful for judging the reliability of the source--Matilda ^talk 23:45, 5 February 2008 (UTC)

I'm reticent to that idea (But some could say that I'm always reluctant to other's ideas; please make yourself bold if you think that it worths it) :

DumZiBoT is dealing with tens of thousands of links. I don't think that all these references that have been left alone for so long will be granted any further information in the next days. Even if 30% of these links are getting modified in the future, that would leave something like 50,000 unnecessary templates ? I don't really think that the servers need that, do they ?! And leaving the technical complaint apart, I personally try to use the simplest syntax I can when editing articles. I don't think that using templates when the standard syntax simply works is the way to a "newcomer-friendly" encyclopedia.
Also, while I really understand that this might ease the work of some contributors in the future, I'm not sure that every contributors would like to use this template. And the reading of Wikipedia:Citation templates confirmed my doubts : They may be used at the discretion of individual editors, subject to agreement with the other editors on the article. Some editors find them helpful, while other editors find them annoying.
Eventually, the same page states Because they are optional, editors should not change articles from one style to another without consensus. : As I really don't think that there is a consensus over that question, I'm not going to do this, since this could be considered as some sort of "orignial style guideline pushing", if you see what I mean, despite my poor English...

NicDumZ ~ 00:03, 6 February 2008 (UTC)

I commonly use cite templates to ensure a consistent reference style, but I know of one editor who is an experienced librarian and he never uses them; indeed, he removes them and does what he calls "scratch cataloging". And I agree that "access dates" are not always needed — published scholarly works (e.g. JSTOR) are not going to change, nor are court decisions, newspaper articles more than a few days old, The Bible, the works of Shakespeare, etc. Further, if sources are changing after they are "accessed", they are not verifiable. --Jtir (talk) 00:43, 6 February 2008 (UTC)

It is exactly for reasons of verifiabilty that access dates are recommended, in the hope that deleted information may be retrieved again using, for example, the Wayback Machine. I am a great fan of citation templates for the above reason of consistency, but I fear it would be expecting too much for a bot to intelligently retrieve the infomation necessary (how would it decide whether a particular name is the author or the subject of an article?) TINY MARK 01:14, 6 February 2008 (UTC)

Please read the BRfA as these questions are redundant, although the answers are a bit more indepth. And I had come up with an idea of getting meta-data into the links. — Dispenser 01:40, 6 February 2008 (UTC)

I couldn't quite fathom what was the consensus on the BRfA regarding the citation template issue, but I too would be overjoyed to see a bot do the grunt work of creating {{cite}} template stubs, but what we have now is great too. Thank you very much! Adoniscik (talk) 03:10, 6 February 2008 (UTC)

My opinion is that people who are experienced in adding references are welcome to ignore the format of the {{cite}} templates. However, since these links are already poorly cited, the article can't be high in the priority of these editors. By using the {{cite}} templates, this bot could both lay the groundwork for a well-formatted citation, and also bring attention to these templates to inexperienced editors. Bluap (talk) 04:55, 6 February 2008 (UTC)

Thanks for the replies, and the pointer towards the BRfA discussion on the issue. I didn't express myself at all well, but others understood my thinking which was flawed, because I now realise that I was advocating pushing citation templates because I think that they act to encourage high quality referencing and I appreciate the work put in by their constructors. Which method is likely to encorage the highest quality reference information from the Wikipedia user base ? I say citation templates. Since we're not allowed to push them, my arguement is with that ruling, and I will see what I can do to challenge it. Cheers -- John (Daytona2 · Talk · Contribs) 21:13, 6 February 2008 (UTC)

And another

Your bot is doing great work! Thank you so much! Aleta (Sing) 14:23, 6 February 2008 (UTC)

Bug report

Here is a bug for you to fix. The text is supposed to be in Russian, but it is gibberish due to incorrect encoding.—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 16:51, 4 February 2008 (UTC)

Thanks. Stopped the bot, looking for a fix right now. NicDumZ ~ 16:54, 4 February 2008 (UTC)

I'm afraid there's not much that I can do :

The server is not sending any character encoding information :

Server: nginx/0.6.25

Date: Mon, 04 Feb 2008 16:56:46 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
X-Powered-By: PHP/4.4.7
Content-Language: ru

The HTML source of the page does not have any charset information neither, therefore the page is not meeting any standards.
The generated characters are fine, so I can't really detect when a page is in Russian or not...

As a side note, my Firefox is completely lost when opening the page : I have to tell him explicitly that it is some Russian, or else it won't print me the page correctly.

I've added an exception : The script now tries the encoding windows-1251 when the domain name is ending by .ru; It works here [5] but I'm afraid that this might not work everywhere, or cause some collateral problems... NicDumZ ~ 17:14, 4 February 2008 (UTC)

I don't really know how to fix it myself (or I would have told you :)), but I'll keep an eye on the changes the bot does to Russia-related articles (since I've got about 5,000 of them watchlisted, I am reasonably sure I'll be able to catch cases like this).

On a different note, let me express my gratitude—this kind of bot is something we've been needing for quite a while now! Great job, keep it up. Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 17:29, 4 February 2008 (UTC)

I'm examined the issue. On Firefox 3 Beta 2 it will detect the proper encoding if the Character Encoding > Auto-Detect > Universal option is set. It also render correctly in Opera 9.23. On IE6 it renders a Shift-JIS. Safari 3 don't have an auto-detect option. I tried save the files to disk to get rid of the meta data in the HTTP headers, Firefox still somehow work but Opera fails. My conclusion is that opera has different default encoding based on header information and most likely the Content-Language (Update, this appears not to be the case, they're both apparently using statsticsal method of the languages to determine the encoding — Dispenser 00:51, 5 February 2008 (UTC)). Also, can't we include on of those nifty Language icons if Content-Language is specified? — Dispenser 22:44, 4 February 2008 (UTC)

Firefox 2.0.0.10, with "Auto-Detect Russian" enabled, detected the encoding as "Cyrillic (Windows-1251)" and rendered the page correctly. --Jtir (talk) 23:15, 4 February 2008 (UTC)

Looking at the source again and it seems to me that it never pass the HTTP header encoding to UnicodeDammit, only if its encoded in the page using the meta tags which looks for those tags anyway. And as it contains chardet it should be able to identify the encoding correctly. — Dispenser 02:49, 5 February 2008 (UTC)

Also, can't we include on of those nifty Language icons if Content-Language is specified? Very nice idea, again. I'm working on that. NicDumZ ~ 21:09, 7 February 2008 (UTC)

Seems operational : http://en.wikipedia.org/w/index.php?title=Future_French_aircraft_carrier&diff=prev&oldid=189814408 :) NicDumZ ~ 21:41, 7 February 2008 (UTC)

Idea for your bot

Perhaps you could set up a requests page where people could post articles they would like the bot to fix the references. I was trying to get your bot to have a go at February 2008 tornado outbreak, but there doesn't seem to be a way to add requests. Great job on the bot btw! Cheers, JACO PLANE • 2008-02-6 17:26

Its working through the database and will in time fix the bare references in all articles. I may setup something on the Toolserver that will operate similar to my other tools (don't actually edit) if NicDumZ thinks its a good idea. — Dispenser 19:49, 6 February 2008 (UTC)

Ah, come on Dispenser, you have proved many times that you had very good ideas. If you think that some thing has to be done, just do it :)

An external tool might an efficient intermediate way to proceed bare references that have been added after the last dump, or after my last pass. Actually, DumZiBoT will proceed every bare reference in the mean time, but that might take some time...

NicDumZ ~ 09:33, 7 February 2008 (UTC)

Ok, need to changed main() function around and added a stub function to web wikipedia.py but it works. Sort of, need to get the edit form into shape. — Dispenser 21:06, 7 February 2008 (UTC)

"" unncessecary

Each reference that is modified gets "" added to it, it's about 29 bytes per each reference modified. It may not seem like much, but that can add up quick. Shouldn't that kind of comment just be put into the edit summary? Gh5046 (talk) 20:35, 6 February 2008 (UTC)

Oh, and I forgot to say, thanks for creating this bot. It's very helpful. Gh5046 (talk) 20:36, 6 February 2008 (UTC)

Well, automatically retrieved titles might be nonsensical, and some editors might not understand why without having to check deeply in the history of the article; that's why I add this comment : If no one actually catch a diff like this, weeks later, someone finding [http://www.youmeishi.com/contents/product/paper.html @–¼Žh ˆó?ü?ê–åWEBƒVƒ‡ƒbƒv ‹ž“s–¼•¨ u‚ä [‚ß‚¢‚µ v @—pŽ†?à–¾] can easily know that this was inserted by a Bot, and easily understand that DumZiBoT has been mistaken. It also allows easy bug reports... ! (this diff was actually reported just above)

NicDumZ ~ 09:40, 7 February 2008 (UTC)

WP:AWB could munch down on these comments like a kid with a bag of cookies. :-) It might, however, be useful to put the name of the bot in the comment so that an edit by DumZiBoT could be distinguished from other bot edits: "". --Jtir (talk) 17:15, 7 February 2008 (UTC)

Untitled Document

http://en.wikipedia.org/w/index.php?title=Pointy_hat&diff=189487992&oldid=182653326

Look at the diff line around "Gomer". Is "Untitled Document" more useful than the bare URL? --Damian Yerrick (talk | stalk) 21:09, 6 February 2008 (UTC)

Fixed here. I don't know why DumZiBoT didn't convert this link. --Jtir (talk) 21:31, 6 February 2008 (UTC)

Thanks for the fix; yet I need to fix my code. I was thinking about adding an exception directly to the title blacklist. What do you think, Dispenser ? Actually, the title blacklist is intended for unaccessible links, and adding an exception for an untitled page/document might seems messy, but that'd work :)

NicDumZ ~ 09:50, 7 February 2008 (UTC)

Maybe you don't. While this case has some merit, ISTM, that once DumZiBoT has done the conversions, an editor should review them and make any further changes. A compromise might be to include both the URL and the title in the link name: http://www.editionhutter.de/german.htm — Untitled Document. --Jtir (talk) 16:32, 7 February 2008 (UTC)

I honestly think that the URL blacklist is more a hack as the site might give valid title sometimes. The title blacklist is more refined and allows specific variation to be covered. Ultimately we implemented to improve the quality of the titles produced by the bot, and a few days ago I thought on adding adding this but doing a google:allintitle:untitled search shows that I was too broad is the matching. I recommend now to use untitled *(document|page|$). — Dispenser 21:39, 7 February 2008 (UTC)

converting multiple bare links in one reference

In Meishi, DumZiBoT did not convert three bare exlinks in one of the references.

<ref>See, e.g., http://www.adobe.com/jp/special/creativesuite/portal/guides/cs2_01_52.html, http://www.washiya.com/shop/namecard/index.html, http://www.kenseido.co.jp/shop/kps/namecard.html</ref>

--Jtir (talk) 21:13, 6 February 2008 (UTC)

Well, you read the FAQ :þ

I actually don't convert links with text around, I just convert references made of one link.

This could be some work for DumZiBoT2, along with some external links (those contained in a External links section) processing.

NicDumZ ~ 09:46, 7 February 2008 (UTC)

OK. The first sentence of the documentation is misleading then. Maybe it should say something like: "He is converting single bare external links in references …". Converting exlinks in other contexts would be a nice future enhancement. --Jtir (talk) 14:53, 7 February 2008 (UTC)

How do I request that your bot visit a page?

Regulation of acupuncture, as well as acupuncture, could use his talents... again, thank you, very nice work! best regards, Jim Butler (t) 05:46, 7 February 2008 (UTC)

Processed both articles. Eventually, DumZiBoT will fix every bare references, it just takes time.

NicDumZ ~ 10:39, 7 February 2008 (UTC)

Super, thanks again --Jim Butler (t) 08:44, 8 February 2008 (UTC)

Suggestion for determining web site name

Hi—First, kudos on a most excellent bot. I was reading your discussion with Dispenser about filling in more of the parameters of template:cite web, and I have a suggestion. The basic idea is to slog through a dump examining occurrences of template:cite web, and correlating the values for the url= and work= parameters. For instance, if 99% of the time, url values with a prefix of http://nytimes.com/ co-occur with work=The New York Times, then you can reliably add the latter to the references you generate for similar urls. You can build up a dictionary of these relationships in a first pass of the bot (or with a separate script). Make sense? —johndburger 01:41, 7 February 2008 (UTC)

But WP say we should not push the optional use of citation templates - See my earlier request. When I get some time I'm looking to investigate/challenge this as I believe there is a greater liklihood of getting higher quality reference info. using the templates and hence educating people of their existence. -- John (Daytona2 · Talk · Contribs) 12:08, 7 February 2008 (UTC)

What WP:CITE says is Because templates are optional and can be contentious, editors should not change an article with a distinctive citation format to another without gaining consensus. If taken literally, that suggests that the bot should not change anything in an article full of nothing but bare links—it already has a "distinctive citation format". But most articles with bare links are, in fact, a mix of formats. If there are any instance of the template:cite family in an article, I think you could make the argument that it's perfectly reasonable to change a bare link to cite web.

But, in fact, my suggestion is actually independent of how the bot inserts the reference—I should have made that more clear. Whether DumZiBoT uses a template, or raw wikimarkup, it can still add the name of the web site in many cases using the approach I described above. —johndburger 01:09, 8 February 2008 (UTC)

I had wanted to add PDF conversion and remembered that I had seen it once somewhere. A quick grep in pywikipedia came up with the old m:standardize_notes.py which uses {{ref}} templates instead of the newer m:cite.php system. Because of the changing of the templates its been in a quasi-block on the en. The script itself does alot of things. With the quick glance I taken at the source it doesn't do as many checks with the titling as reflinks.py does. However, it does the news cite referencing. In any case it was a good source to get code to parse titles from PDF files. — Dispenser 05:51, 8 February 2008 (UTC)

I just tried adding that feature, using the code from standardize_notes. But apparently it just don't work: I can't find a pdf that gives me a title with that code ?! NicDumZ ~ 07:31, 8 February 2008 (UTC)

Subprocess doesn't seem to accept streams only files. Either write to a temp file or reopen using url_retrive (hackish) like it does in the program. — Dispenser 08:03, 8 February 2008 (UTC)

Posted an implementation using tempfiles, I've also added a dead link check with a list from over a year ago for tagging purposes. — Dispenser 20:08, 8 February 2008 (UTC)

Suggestion

Would it be possible for the bot to convert bare refs to refs using {{cite web}} instead? Instead of a lead "[" it would add "<ref>{{cite web | url =" then before the new title it would add "|title =" and after the title instead of "]" it would add "|accessdate=2008-02-08}}</ref>"? It would also have to add reflist at the bottom if it were not already there. Just an idea. Thanks Ruhrfisch _><>°^° 17:09, 8 February 2008 (UTC)

See this discussion. ;) vıd ıoman 17:16, 8 February 2008 (UTC)

Diffs with problems

[6] - Pages are labeled with {{ru icon}} while their in English and is defined in the HTML as <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

— Dispenser 17:39, 8 February 2008 (UTC)

[7] very long title. Either Good nor Bad, just unusual. — Dispenser 17:39, 8 February 2008 (UTC)
[8] bad title "WebCite query result", suggest blacklisting (query result[s]?|query)$. Possibly check for duplicate titles. — Dispenser 17:39, 8 February 2008 (UTC)

Not much time this night. I've fixed the (in Russian) issue. I simply assumed that pages using windows-1251 were Russian, which is obviously false. I'll check the rest later.NicDumZ ~ 19:01, 8 February 2008 (UTC)

Periods

Your bot puts refs before periods, they should be after periods. — Rlevse • Talk • 12:14, 9 February 2008 (UTC)

Could you provide a diff ? My bot is not supposed to move refs... NicDumZ ~ 12:21, 9 February 2008 (UTC)

Your bot is most excellent

Very nice edit at acupuncture. A much-needed service. rock on, Jim Butler (t) 10:15, 3 February 2008 (UTC)

Ah ! Thanks a lot xD NicDumZ ~ 10:16, 3 February 2008 (UTC)

Indeed — good work! (And after reading the approval process page, I really do mean work!) Thanks! — the Sidhekin (talk) 11:09, 3 February 2008 (UTC)

Hear hear! I agree. :) Jmlk 1 7 11:11, 3 February 2008 (UTC)

hey, just adding in my little note of praise. I want to marry your bot and have its babies! :D Mathmo ^Talk 19:09, 6 February 2008 (UTC)

Great bot! Fig (talk) 19:53, 9 February 2008 (UTC)

Retrieved date

hi, is it possible to add this info also like: "Retrieved on 2008-02-09." --— Typ932^{T | C} 15:16, 9 February 2008 (UTC)

I was thinking the same - the bot could generate a "cite web" with the accessdate parameter as well as url and title.

BTW, impressive idea - a very useful bot, skillfully executed. -- Euchiasmus (talk) 08:18, 10 February 2008 (UTC)

Gender

Your description of the robot uses he references—the masculine pronoun. I understand English isn't your first language and respectfully suggest you refer to it as either she or it. Most properly, use of it is appropriate since robots are usually considered gender neutral. However, considering the fuss above, she wouldn't be out of line. Ships and boats are normally referred to as she, perhaps for their unpredictable temperaments. Your use of he is amusing, and presents absolutely no difficulty in understanding or communication. The choice is yours of course. —EncMstr 23:29, 8 February 2008 (UTC)

considering the fuss above, she wouldn't be out of line. Ships and boats are normally referred to as she, perhaps for their unpredictable temperaments.

Umm, that is sexism!!! :< vıd ıoman 23:51, 8 February 2008 (UTC)

Yes it is. But the useful kind. —EncMstr 00:00, 9 February 2008 (UTC)

Well, I guess we can't argue with that? :\ vıd ıoman 00:11, 9 February 2008 (UTC)

NicDumZ invited me to do copyediting on the article, so I take responsibility for retaining that usage, which I simply interpreted as personifying the bot for a light effect. If this were documentation for a general audience, I would probably have changed it. BTW, does anyone object to NicDumZ being described as the bot's "owner"? That could be offensive to the dignity of the bot, who is certainly no one's slave. Click here to free the bot. :-) --Jtir (talk) 19:24, 9 February 2008 (UTC)

Really, I like these messages. :)

DumZiBoT might turn into a she, after all...!

NicDumZ ~ 22:36, 10 February 2008 (UTC)

Barnstars

You can use the gallery tag to organize them. It's probably easier than making a table. :) vıd ıoman 22:31, 10 February 2008 (UTC)

Well, thanks for the suggestion, but for now, I'm fine with the table : It's not this hard to use :þ NicDumZ ~ 22:54, 10 February 2008 (UTC)

I know, but when you get more the gallery might be easier. ;) vıd ıoman 12:29, 11 February 2008 (UTC)

Again strange edits

What is this [9]? Please fix the handling of non latin scripts before running the bot again. --jergen (talk) 08:27, 7 February 2008 (UTC)

The sourcepage of http://www.chinascout.org/ pretends to be encoded in GB 2312. However, line 1726, there is the character "深", encoded C389, which is not part of GB 2312 : http://demo.icu-project.org/icu-bin/convexp?conv=ibm-1383_P110-1999&b=C3&s=ALL#layout

In other words : This page is not encoded properly, hence DumZiBoT is not able to convert it into unicode using GB 2312. It then tries without success to use ascii, then utf8, and eventually defaults to windows 1252 which render these ugly characters. This is not a bug of DumZiBoT.

NicDumZ ~ 10:26, 7 February 2008 (UTC)

This is a bug, and you really should do something about it. For your information, the "illegal" character in this case is not 0xC389 but 0x89C3. It may not appear in all GB2312 standards, but it renders perfectly well in my browser. The equivalent Unicode glyph is 0x5169, as you can see here.

You have absolutely no justification for using windows 1252 encoding to convert a page that explicitly declares itself to be GB 2312. Your bot is broken. Please fix it. -- Sakurambo 桜ん坊 12:06, 7 February 2008 (UTC)

Give me a break. This character is not a standard GB 2312 character, I have no reason to support it. NicDumZ ~ 12:15, 7 February 2008 (UTC)

I'm not saying your bot has to support these characters. It's quite obvious that it doesn't support them.

I'm just saying it should stop trying to interpret them as windows 1252 without any justification.

Is that really so difficult to understand? -- Sakurambo 桜ん坊 12:27, 7 February 2008 (UTC)

I'm just saying that I will not support these characters. Is that really so difficult to understand?

Using windows 1252 is a way to convert every characters to some printable characters, avoiding to insert some junk control characters in articles.
Also, windows 1252 is a common american/european charset that works well when no special characters are in the string : If, for some reasons, the title was made of standard non-accentuated latin characters, the conversion would have worked
Eventually, a lot of windows-made web pages use windows 1252 as a charset without specifying in the meta tags.

That makes three very good reasons to use it, three very good reasons for you to move along.

NicDumZ ~ 12:36, 7 February 2008 (UTC)

Fixed, when GB 2312 handling fails, I now try GBK : [10] NicDumZ ~ 14:38, 7 February 2008 (UTC)

And when that fails? When invalid characters are found, why not just ignore the character (which was nowhere near the title tag) and turn it into a question mark or � instead of assuming that it's lying about the encoding and falling back on windows-1252? Why even attempt to convert anything outside the title tag? —Random832 19:30, 7 February 2008 (UTC)

Because, again, a lot of pages specify the wrong encoding : Consider a wrongly encoded file : If I only convert the title part into unicode, from a statistical point of vue, chances are that I will be able to convert it without raising any error : Codepoints are different from a charset to another, but I might still be able to convert it to the specified charset. The title won't make any sense, because I converted it using the wrong charset, but still, I would think that I have converted well the document. If now, I try to convert the whole document using a wrong charset, I'm more likely to raise an error, encountering a bad character that has no correspondance in the charset, hence I have more chances to detect a wrong charset.

NicDumZ ~ 20:38, 7 February 2008 (UTC)

Please excuse my kibitzing, but I think the (most excellent) bot needs to be extremely resistant to screwed up content, of which there is a lot on the web. I think your "transcode the whole document" heuristic is a very good idea, but I'd suggest that if there is any evidence that the title may not be correctly extracted and transcoded, the bot should bail and not put the title in the reference. —johndburger 01:24, 8 February 2008 (UTC)

The thing is, you're _not_ raising an error. you're silently inserting garbage characters. That the garbage characters are random windows-1252 characters instead of being in the declared character set of the document is not a positive aspect of your method. In other words, why not have the choice be "fall back on GB 2312 with question marks for anything that doesn't fit" instead of "fall back on windows-1252", if you're going to fall back at all? —Random832 16:59, 11 February 2008 (UTC)

Good job + idea

		What a Brilliant Idea Barnstar
		I've got nothing but praise for this bot idea. – sgeureka ^t•c 15:27, 8 February 2008 (UTC)

And I have another improvement suggestion, but I know another bot already cares about this, so don't feel yourself pushed to do this. I became aware of your bot with this edit, and {{reflist}} or <references /> was missing on the page to display the <ref> at all. Would it be to much coding effort to also have the bot check this? – sgeureka ^t•c 15:27, 8 February 2008 (UTC)

Thanks for the idea :)

I've added the feature :)

NicDumZ ~ 07:59, 12 February 2008 (UTC)

Character set problems

Your bot made rather a mess of the Meishi article by converting the anchor text for one of the references into unintelligible garbage. Does this thing understand Shift JIS? -- Sakurambo 桜ん坊 14:28, 6 February 2008 (UTC)

Well, my bot handles Shift JIS as any other encoding.

Problem is, the page http://www.youmeishi.com/contents/product/paper.html contains a badly encoded character, probably line 365 of the html source, which causes the python.codecs module to raise an error (character #19563 of the html source, but since the codecs parser failed, I don't think that this number is reliable). I can't do anything to solve these kind of problems, that's really not my fault, sorry. NicDumZ ~ 16:34, 6 February 2008 (UTC)

The Shift_JIS source is not invalid. The character you're blaming this problem on is the "mm" glyph highlighted in this screenshot of the page's HTML source. This is equivalent to the Unicode character 0x339c ("SQUARE MM", part of the CJK compatibility code block).

The web page in question also clearly identifies itself as Shift_JIS, so it makes no sense to use any other encoding. If your software can't recognise the encoding of a web page, wouldn't it make more sense to just leave it alone? Or do you really think it's better to blame the problem on other people and carry on regardless? -- Sakurambo 桜ん坊 17:19, 6 February 2008 (UTC)

Firefox 2.0.0.11 auto-detects the encoding of this page as Shift JIS, yet the name of the page is still displayed as a string of question marks: "【名刺用紙】名刺用紙販売所". --Jtir (talk) 20:25, 6 February 2008 (UTC)

That was in Windows XP, where I do not have the Japanese language pack installed. With Firefox 2.0.0.10 in Linux, all but one character is displayed correctly, instead of question marks. So never mind, it is my problem with fonts. --Jtir (talk) 20:49, 6 February 2008 (UTC)

Well, feel free to try by yourself, instead of assuming that I'm deliberately using another encoding :

import urllib2
url = u'http://www.youmeishi.com/contents/product/paper.html'
handler = urllib2.urlopen(url)
source = handler.read()
to_uni = Unicode(source, "Shift JIS") #will raise UnicodeDecodeError (illegal mutibyte sequence)

There must be some problem in the encoding of the HTML source. What you have to understand is that my script tries first to convert to the encoding specified in the "meta" markup of the page. When no UnicodeDecodeError is raised, it assumes that it works, and uses that encoding. But when an error is raised, it goes on an try other encodings. When a "fine" codec is found, i.e. a codec that does not raise an error during the conversion, I use it. But there's no way for an automated script to determine whether a character sequence makes sense or not... (Also, some pages actually say they use one encoding in their meta tags, while they're not; And a lot of pages are not sending any encoding : that's why I try other encodings)

NicDumZ ~ 09:29, 7 February 2008 (UTC)

OK, I got the same error (UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 19563-19564: illegal multibyte sequence).

I guess there are just some characters in Shift JIS that can't be successfully mapped to Unicode. Simply looking for another encoding that doesn't raise any errors isn't going to be safe because any random binary data will work with some 8-bit encodings such as KOI8-R and Mac OS Roman.

So instead of banging square pegs into round holes, I think it would be safer to halt the process as soon as you encounter a UnicodeDecodeError condition. -- Sakurambo 桜ん坊 11:27, 7 February 2008 (UTC)

nah, you're not listening :) What should I do then when no encoding is being given in the headers, or in the meta tags ? Nothing ? That would exclude a lot of pages... ! Much, much, much more than the hundred or so links that are being given weird titles because of an encoding error...

NicDumZ ~ 11:52, 7 February 2008 (UTC)

No information in the meta tags? Please go back to http://www.youmeishi.com/contents/product/paper.html and take a look at the HTML source. Can you see any meta tags in there? What about this one:

<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

(Hint: It's on line 4) -- Sakurambo 桜ん坊 12:12, 7 February 2008 (UTC)

Okay. Stop this. Read again : my point was not about that particular page, but about others : If I stop at the first UnicodeDecodeError that I get, that means that I will not be able to detect any encoding for pages not specifying their encodings. And I was saying that pages not specifying their encoding are way more common than pages specifying an encoding, and badly encoded, hence I made the implementation choice to try to detect an encoding, since the false positives are very rare (Over 25,000 contributions, I've been reported less than 10 errors : You could say that some errors are remaining undetected, but still, even considering, exaggeratedly, that 500 errors are remaining, that would make a 0,02% error rate. Come on, give me some space.)

NicDumZ ~ 12:27, 7 February 2008 (UTC)

You're just not paying attention are you? Let me spell it all out once again:

I'm not talking about pages that fail to provide encoding information (either in the HTTP headers or in a meta HTML tag).
The problem with your bot is that it ignores encoding information that has already been provided if it has difficulties converting pages into unicode.
If your bot cannot successfully convert a web page from its declared encoding into unicode, then it should admit defeat.
It makes absolutely no sense to use windows 1252 encoding to convert a page that has explicitly declared itself to use a different encoding.

Now apparently you're having trouble understanding one of the above points (1-4). If you could let me know which one, then I'll try to make the explanation a bit simpler for you. -- Sakurambo 桜ん坊 13:08, 7 February 2008 (UTC)

Just... calm down, would you ? Answer for #4 is here. I don't understand #2. When you can't convert a text into unicode using a charset, you just... can't. That means that one, or several byte sequences have no equivalent in the charset, or, better, that the charset does not specify any corresponding unicode character for that byte sequence. Knowing that I can't use this charset for a particular character, what should I do ? Try to convert this byte sequence apart, using another charset ? That makes no sense ! The only thing that you can do is to try something else, another encoding for the whole text, because it must have been encoded differently, that's it. And no, definitely, it should not admit defeat in this case : I saw, several times, pages that were declaring a charset while they were actually using another charset. NicDumZ ~ 13:19, 7 February 2008 (UTC)

I'm perfectly calm, thank you. The bold text was just there to hold your attention, which seems to be in rather short supply.

Anyway, since you are having problems understanding #4, let me elaborate:

If a web page contains a meta tag specifying the character set as "Shift_JIS" (or "GB2312"), then you can be fairly certain that it contains Japanese (or Chinese) characters.
There are some code points in Shift JIS and GB2312 that render correctly but are apparently not directly compatible with the Unicode standard. I have already pointed out two examples for you.
Windows 1252 encoding does not support Japanese or Chinese.
It is therefore meaningless to use Windows 1252 to convert pages encoded as Shift JIS or GB2312.

Again, if you don't understand any of these points, just let me know. I would also appreciate it if you could provide the URLs for some of these "pages that were declaring a charset while they were actually using another charset". -- Sakurambo 桜ん坊 13:35, 7 February 2008 (UTC)

I just noticed you're having problems with #2 as well. What this means is that if your bot has problems converting a page into unicode using the character set stated in a meta tag, it ignores the meta tag and uses windows 1252 instead. Is this not correct? -- Sakurambo 桜ん坊 13:40, 7 February 2008 (UTC)

Again, if you don't understand any of these points, just let me know. I would also appreciate it if you could provide the URLs for some of these "pages that were declaring a charset while they were actually using another charset". -- Sakurambo 桜ん坊 13:35, 7 February 2008 (UTC)

There are some code points in Shift JIS and GB2312 that render correctly but are apparently not directly compatible with the Unicode standard. No, you're wrong. Yes, of course, you can print them in unicode. I too, like you, can see perfectly well in Firefox these characters, and that's because it's the point of Unicode: It can print anything. Every codepoint has a meaning in Unicode, but not every codepoints have a meaning in a charset, and that's precisely why it causes problems ! A charset is a dictionary : A codepoint, a number, mapped to a unicode character. But the fact is that it is an encoding, i.e. that the same codepoint in Unicode and in a charset don't render the same character. For example, C006 is 쀆 in unicode, while it is ∑ in GB2312. And when you try to use such a dictionary, if a key is missing, i.e. no unicode character is given for that codepoint, well... You just don't know what it is. I understand perfectly what you mean : Firefox *can* actually print these characters, using some tricks, or complex heuristics to guess what character it is. But the fact is that I don't know what are these heuristics. That's it : I'm not ignoring the charset, are lazily giving it up, I just have no way that I know to find which Unicode character is this two-bytes sequence. Do you get this ? Because the more we talk, the more it seems that you have troubles understanding how a charset works, and that might be why we just can't find the right questions, or the right answers... NicDumZ ~ 13:52, 7 February 2008 (UTC)

This is getting very tiresome. For someone who claims to be so knowledgeable about character encodings, you do sound rather uninformed.

Take a look at this PDF file (it's in Japanese; the title "Windows と Mac OS X 間でのシフト JIS コード非互換文字一覧" translates as "List of Shift JIS code incompatibilities between Windows and Mac OS"). On page 3 you'll see an entry for the Shift JIS code point 0x876F, which corresponds to the "mm" character that caused your bot to fail at youmeishi.com. In case your PDF reader is incompatible with Japanese, here's a screenshot of the relevant section. I've already provided you with a screenshot of the "View Source" window for this page in Firefox, where this character is displayed correctly.

The inability of your bot to successfully convert these code points into Unicode is not a good enough reason for defaulting to Windows 1252. Especially for pages that identify themselves as using non-Roman character sets. -- Sakurambo 桜ん坊 15:07, 7 February 2008 (UTC)

Hey, come on. You have tried by yourself to convert the text using Shift JIS, haven't you ? It failed, didn't it ? Isn't it the proof that the text is not Shift JIS-compliant ? Then what are you saying ? I know that the mm symbol is part of the Shift JIS set... Did I ever said that it wasn't part of it ? No ! I really don't get why you're writing this here... NicDumZ ~ 15:18, 7 February 2008 (UTC)

*Sigh*

No, it doesn't prove anything of the sort.

It proves that some Shift JIS encoded pages are difficult to convert into Unicode. That's all. -- Sakurambo 桜ん坊 15:24, 7 February 2008 (UTC)

Well, why would it be difficult ? If each byte sequence is in the Shift JIS table, hence has a Unicode equivalent, there is no problem. unicode() is basically mapping this byte sequence to the corresponding unicode codepoint. The only reason it fails is that somewhere, ther is a byte sequence that IS NOT shift-JIS compliant, remember the error message : UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 19563-19564: illegal multibyte sequence. ILLEGAL MULTIBYTE SEQUENCE. Get it ?

It seems that A) you didn't really read what I wrote above about charsets or B) You're showing some bad faith here.

NicDumZ ~ 15:34, 7 February 2008 (UTC)

Wow, it really is difficult getting through to you, isn't it?

I've provided you with ample evidence that the code point 0x876F really does exist in Shift JIS. Yet again, you've missed the point. Characters in this range are frequently used in Japanese web pages, and are handled quite happily by Japanese web browsers.

You are continuing to make this assertion that the inability of your bot to successfully convert between these encodings somehow "proves" that these pages are invalid and would be better off being converted using Windows-1252. That is a ridiculous position to take. All your error messages prove is that the Python character encoding library is inadequate in some cases. Shall I put that in capitals for you? IT'S INADEQUATE. IN SOME CASES.

Take another look at the PDF file I linked to. You'll note that the "Windows" and "Mac OS X" columns often specify different Unicode values. Sometimes these values are absent. Python's inability to process these codes correctly is not a sufficient reason for churning out garbage in Windows-1252. -- Sakurambo 桜ん坊 16:06, 7 February 2008 (UTC)

Python's shift jis did not contain that character. shift jis 2004 does, so I now try shift jis, then shift jis 2004, then cp932. NicDumZ ~ 09:13, 8 February 2008 (UTC)

Apparently GBK extends nicely GB 2312. I've just ried replacing GB2312 by GBK, and it appear to work. I don't know if there's any similar solution for Shift JIS... NicDumZ ~ 14:15, 7 February 2008 (UTC)

I'm now using Code page 932 to extend Shift JIS. It works for this link ([11]), but there might be other problems... NicDumZ ~ 14:36, 7 February 2008 (UTC)

GBK is a superset of GB2312, so it should be safe to use. For Shift JIS, you could try using Windows-31J (Code page 932), which includes this character. But the point I'm trying to make is this: if your bot is unable to work with the information it's given, then it should do nothing instead of generating garbage for other people to clear up. If your bot really has encountered lots of "pages that were declaring a charset while they were actually using another charset", then please provide some examples. -- Sakurambo 桜ん坊 15:07, 7 February 2008 (UTC)

Chiming in, the bot messed up the title for this link in Vii as well. Jappalang (talk) 01:35, 14 February 2008 (UTC)

Great Bot

Thanks for your recent work on chess articles. The bot is doing a fine job! Voorlandt (talk) 08:48, 13 February 2008 (UTC)

Good Bot!

Request for DumZiBoT2

Would it be hard to make a bot that consolidated references with <cite name=X>? Just making a suggestion! --Adoniscik (talk) 06:43, 7 February 2008 (UTC)

erm... I don't know how that tag works, actually. Some documentation might help :þ

Also, how would you retrieve the "X" value ?

Thanks for trying to help,

NicDumZ ~ 09:53, 7 February 2008 (UTC)

Possibly means one ref definition where you give it a name <ref name=BBC080207> which you then refer to it by for any other occurances using only <ref name=BBC080207 /> ? Wikipedia:Footnotes#Naming_a_ref_tag_so_it_can_be_used_more_than_once. It would be a sensible addition although you need to avoid name conflicts. I use dates - SourceYYMMDD Cheers -- John (Daytona2 · Talk · Contribs) 12:03, 7 February 2008 (UTC)

Ah, okay ! You used "cite" instead of "ref" in your first message, so I was lost.

Seems a bit complex to do that, but that's really a good idead :)

NicDumZ ~ 12:06, 7 February 2008 (UTC)

Correct...I meant citations with the <ref> tag. Sorry, I was typing late at night. --Adoniscik (talk) 15:42, 7 February 2008 (UTC)

No pressure, since this is way over my head, but I think this would be a great idea also. I've come behind the bot sometimes and done it myself, but if it could be automated, that would really cut the load.--Esprit15d • talk • contribs 14:38, 14 February 2008 (UTC)

Ugly bot-generated title (to be avoided)

I just removed the following bot-generated title from an article: C:\Documents and Settings\wabalber\Local Settings\Temporary Internet Files\OLK1D0\rptApprovedSchoolsWeb.snp . . . (That apparently is the automatically generated "title" of a PDF file on a US government website.) Can the bot be trained to ignore "titles" that are file names? --Orlady (talk) 14:43, 13 February 2008 (UTC)

Thanks for the report ! :)

I could "train" him to ignore file paths, but... I would need an automated way to detect a filepath, and... how would you detect automatically a pathname ? It's not that easy. ([LETTER]:\[somethingelse], maybe ? )

NicDumZ ~ 14:49, 13 February 2008 (UTC)

That might work. If not, you definitely could train him to ignore "titles" that include backslashes followed by standard windows directory names such as "Temporary Internet Files" and "Documents and Settings." --Orlady (talk) 17:13, 13 February 2008 (UTC)

Windows standard folder names are not so standard. The above assumes that an English installation of Windoze is present. TINY MARK 19:25, 13 February 2008 (UTC)

A path usuly is the following: (someone correct me if I missed something) [A-Z]\:([/\|/][\w| ]+)+(.[\w+| ])? but what I came here to say: Excellent bot! Martijn Hoekstra (talk) 19:04, 13 February 2008 (UTC)

Probabely a good idea to blacklist URIs and too, file://, ftp://, http://, nfs://, and smb:// (so ^\w{3,}://\w+ and ^[A-Za-z]:\\\w+). — Dispenser 20:56, 13 February 2008 (UTC)

Found another issue in a bot edit from several days ago (this may already be fixed). Issue: The bot should be trained not to record "flash_AS_detection" as a title... 'Nuf said. --Orlady (talk) 18:47, 14 February 2008 (UTC)

Amazing

I would also like thank you for making a bot that really helps out on Wikipedia. Though from what I read it has its hiccups, but you try getting 1000 contributors to add that text on their URLs in the same timeframe. Keep up the good work NicDumZ! (Tries to think of something better than motor oil) -- Riffsyphon1024 (talk) 04:16, 14 February 2008 (UTC)

Yes, this has been a real help on all the pages that I watch (only English pages). Keep up the good work. Vincecate (talk) 14:12, 14 February 2008 (UTC)

Amazingly stupid

Why would you let a bot loose to make substantial edits on wikipedia unless it is has been extraordinarily well tested?

The bot made two substantial, and completely incorrect, edits to the ITA_Software entry, substituting the name of some other organization for the company's actual name, and adding a sentence about that other organization's members.

That's pretty destructive, and given the probability of such kinds of mistakes (1, by my calculation) by any robot with such grand plans, seems kind of obvious such robots should not be running around wikipedia. —Preceding unsigned comment added by 67.165.122.220 (talk) 06:32, 14 February 2008 (UTC)

Erm. Actually it made only one edit, [12], and I'm perfectly fine with it. ?!?! NicDumZ ~ 06:45, 14 February 2008 (UTC)

Methinks Mr Anonymous IP has looked at this diff and misread the headings of the left column. :) — the Sidhekin (talk) 06:50, 14 February 2008 (UTC)

Sheesh.--Esprit15d • talk • contribs 14:45, 14 February 2008 (UTC)

Amazingly stupid, indeed. vıd ıoman 18:56, 14 February 2008 (UTC)

Peer Online Reference Editing

Thanks for enhancing the raw references I put in most of my edits ! :-) Reminds me of a project I use to propose on my user page :

Peer Online-Reference Editing: to enhance the overall quality of references, I suggest separating the work of finding references from the work of enhancing them. The reference finder would just write <ref>http://someurl</ref>. The reference enhancer would then come, open the reference link, check that the source is authoritative, check that it indeed confirms the fact, and format it with a nice filled template. First benefit: adding a reference is faster, leading to more references. Second benefit: references are peer-reviewed, leading to better references quality. Not enhancing your own online references could become a guideline. The enhancers' work could be made faster with a web application to would automatically suggest the kind of template and some fields, such as the article's name, the author's name, or the access date.

What do you think about it ? Nicolas1981 (talk) 10:08, 14 February 2008 (UTC)

The online thing exists already, its impossible for a machine to distinguish between what actually qualifies as good reference source and a bad one, and please read the FAQ about the templates. — Dispenser 11:18, 14 February 2008 (UTC)

Indeed, there is a Fact and Reference Check WikiProject. --Jtir (talk) 18:05, 14 February 2008 (UTC)

Good job on the bot

Excellent work with this bot. You have greatly contributed toward a better wikipedia. Fredsmith2 (talk) 02:22, 15 February 2008 (UTC)

Are you sure this is a good idea?

Take this diff: [13] Most of the "titles" it generates really aren't helpful: http://myweb.tiscali.co.uk/celynog/Brittany/kermario.htm Kermario] - it's more helpful to the reader to see the bare URL. In this case, it hints that the site is a personal website for some amateur based in Britain, and even without going there, suggests that it's not a strong reference. What does captioning it as "Kermario" add? How does this benefit the reader? Stevage 04:38, 10 February 2008 (UTC)

I agree - A few pages I watch have had similar changes. (I think) I understand what it's trying to do, but I don't think this is a good job for a bot to be doing (at least, not a bot without some significant English language parsing). Natebailey (talk) 06:59, 10 February 2008 (UTC)

I think the work that is being done is good. I understand some of the requests/complaints here, but overall it is probably worth it. The most important part of the work (and the initial request for a bot) is being done wonderfully. It is changing:

<ref>[http://www.yahoo.com]<ref> which show like this: [1]
...into
<ref>[http://www.yahoo.com - Yahoo.com]<ref>. I think anyone can see that this is helpful; the comments above have merit, also.

On an unrelated note, you could be using {{archive banner}} on this page, instead of your hand-drawn template. :-) Timneu22 (talk) 13:02, 10 February 2008 (UTC)

Well, the last time that I tried using {{archive banner}}, the image parameter was just not working, so I subst'd it, and modified the image :) NicDumZ ~ 22:44, 10 February 2008 (UTC)

Yeah, there's no image param for archive banner. Yet. Timneu22 (talk) 01:38, 11 February 2008 (UTC)

For the first comments : I sort of understand, similar extended comments have been issued on de:, and people overthere tend to think that nothing should be done, instead of slightly improving the pages : DumZiBoT, if still flagged, is not anymore allowed to run the reflinks task on de:. Still, there are tens of thousands of articles that need fixing, with several links on every article. I don't think that anyone is going to do this manually, so I just get a bot to do as best as I can... Not every changes are perfect, but most of the retrieved titles seem relevant, or, at least, better than the URL. NicDumZ ~ 22:44, 10 February 2008 (UTC)

I think you could have thought this through a little bit more and discussed it with some people. Rather than launching into "change every reflink no matter what, except for a couple of known exceptions", it would have been better to start with "analyse reflink, change it if the bot is certain that the change is an improvement". The yahoo example given is very borderline - though I would have expected it to be linked as "Yahoo!"

So, maybe 95% of the changes are good. But 5% of a million is still a lot of annoying changes....Stevage 05:46, 16 February 2008 (UTC)