Talk:Lemmatization

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

lemma vs lexeme[edit]

Quote from the article:

The combination of the base form with the part of speech is often called the 'lexeme' of the word.

I don't know who uses this terminology, but e.g. Lauri Bauer uses it slightly differently in his textbook Introducing Linguistic Morphology: "A lexeme is a dictionary word, an abstract unit of vocabulary. It is realised [...] by word-forms, in such a way that the word-form represents the lexeme and any inflectional endings [...] that are required." So I think that base form + pos are properties of a lexeme but not the lexeme itself. I'd say that 'lexeme' is just a word for 'word', in the sense of 'word' where tree and trees are both the same word. -- Mumpitz 20:49, 28 November 2006 (UTC)[reply]

Lexicology[edit]

Just a small point -- the sentence (used to cover all the alphabet) was originally, The quick brown fox jumps over the lazy dog. Then dogs plural is unnecessary. FYI, Julia Rossi (talk) 07:49, 30 January 2009 (UTC)[reply]

Needs better examples[edit]

The typing example has only one inflected term (jumps). This page deserves some better examples. I'll try to find some myself. --Avirr (talk) 20:10, 16 October 2009 (UTC)[reply]

"...might not be valid (see lazy below)"

What's wrong with 'lazy' below? Maybe it's actually stemmed to 'laz'? -- IdS (talk) 01:19, 12 March 2010 (UTC)[reply]

I think the current example ([The] [quick]...) should be modified or deleted altogether. It does not illustrated the difference between stemming and lemmatisation. It only illustrates the difference between using a LuceneAnalyzer (SnowballAnalyzer) with a StopFilter which removes the stop words and some other Analyzer which does not filter out the stop words, but otherwords does the same thing. Gblarp (talk) 13:41, 22 March 2010 (UTC)[reply]

A lot of stemmers stem many words than end in -y to end -i instead, for various reasons. The Porter2 stemmer - often considered the newest/best stemmer - does this less often than the original Porter stemmer, but it still happens. 'lazy' is, in fact, stemmed to 'lazi' by the Porter2 stemmer (just tried it out now). The person doing the example probably just made a typo. I'm changing it, and giving a bit of an explanation. Sir Tobek (talk) 08:19, 19 August 2011 (UTC)[reply]
Alright, I made the example a little better. There's still room for improvement. I can't promise that the Lucene Snowball Analzyer does actually stem it that way - I'm pretty certain it does from what I know about Snowball, but I can't currently get it to work on my computer. Don't have much in the way of sources for what I wrote but... I'm pretty sure it's right. I'm doing a lot of stemming in my dissertation at this very moment. What good procrastination Wikipedia is! Sir Tobek (talk) 08:58, 19 August 2011 (UTC)[reply]

Requested move[edit]

The following discussion is an archived discussion of the proposal. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the proposal was not moved per WP:ENGVAR.--Fuhghettaboutit (talk) 09:59, 28 May 2011 (UTC)[reply]


LemmatisationLemmatization – I would like to switch Lemmatization and Lemmatisation articles, so the spelling would be consistent withing the text. However, I am not sure HOW exactly (technically) switch two articles where one is a redirect to the other, so I am listing it here. Have a nice day. Running 14:40, 21 May 2011 (UTC)[reply]

    • Oppose. First, a point of order; there is only one article involved here; Lemmatization is a redirect, not an article. Anyway, I oppose per WP:RETAIN; the Commonwealth spelling has been used since the article was started in 2004, and with no strong national ties to the subject, there's no reason to switch to the North American spelling. Powers T 14:52, 22 May 2011 (UTC)[reply]
    • Strong Oppose the concept of first major contributor should be enforced here. A unilateral edit by an IP changed the spelling into the American English form. Because edits have taken place since then a simple revert can't be used but it would be simple to quickly return spelling into British English and then tag aritcle on discussion page notifying this page to be in BrE. Shatter Resistance (talk) 11:37, 24 May 2011 (UTC)[reply]
The above discussion is preserved as an archive of the proposal. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

deleted Implementations section[edit]

This section contained only broken links and spammish links to commercial services. I deleted it.--2603:8000:8901:F00:B59F:7584:B15F:CD8 (talk) 20:41, 16 April 2021 (UTC)[reply]

Requested move 22 August 2023[edit]

The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review after discussing it on the closer's talk page. No further edits should be made to this discussion.

The result of the move request was: moved. Given evidence being presented showing how the Z-spelling is the usual spelling even in varieties where you would expect the S-spelling. (closed by non-admin page mover)Ceso femmuin mbolgaig mbung, mellohi! (投稿) 16:58, 6 September 2023 (UTC)[reply]


LemmatisationLemmatization – Reopening this because the last discussion failed to mention that the -ize spelling is also favoured by Oxford spelling, which is commonly seen in the UK in academic contexts. Every source cited use the -ize spelling (except for 6 and 7 which do not use the word at all). This includes the cited Collins Dictionary, which also adheres to oxford spelling. Ngrams also shows the -ize spelling to be roughly 7x as frequent. jajaperson (talk) 11:12, 22 August 2023 (UTC) — Relisting. BilledMammal (talk) 11:39, 30 August 2023 (UTC)[reply]

  • Oppose per WP:RETAIN. The current spelling has been long established here. Oxford Spelling is not the standard common spelling in the UK and its use in some academia is not convincing to make such a change. Timrollpickering (talk) 07:16, 23 August 2023 (UTC)[reply]
  • Comment. Is it possible that Lemmatization is the standard in British English? That's the standard that would need to be met here, or at least a half & half split to make a MOS:COMMONALITY argument. If someone wants to investigate usage strictly in Commonwealth/British publications, this might be helpful. SnowFire (talk) 16:57, 24 August 2023 (UTC)[reply]
    Comment Scopus results for advanced search AFFILCOUNTRY ( united AND kingdom ) ALL ( lemmatization ) shows 153 results, whereas AFFILCOUNTRY ( united AND kingdom ) ALL ( lemmatisation ) yields 29. Obviously -ize is favoured in the US, but it's also favoured by academics in the UK, and this isn't really a word that's used outside of academic contexts. The -ize spelling is nearly universal. jajaperson (talk) 11:57, 25 August 2023 (UTC)[reply]
  • Support per jajaperson's comment (which probably should have been part of the nomination). One of the rare reasons to overturn RETAIN is if a spelling isn't the majority even in its "home" context, so it seems like British English prefers the 'z' here. SnowFire (talk) 14:33, 25 August 2023 (UTC)[reply]
  • Oppose per WP:RETAIN. "-isation" is always far commoner is British English than "-ization", whatever the OED may prefer. -- Necrothesp (talk) 13:47, 29 August 2023 (UTC)[reply]
That doesn't seem to be accurate according to the British Google Ngrams. Rreagan007 (talk) 16:43, 30 August 2023 (UTC)[reply]
Please see my reply above. Linguistics papers contributed to by British linguists are about 5.6x more likely to use -ize than -ise. jajaperson (talk) 14:12, 29 August 2023 (UTC)[reply]
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.