Talk:Naive Bayes classifier/Archives/2013

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Feature Discretization

From Parameter estimation section, "Non-discrete features need to be discretized first. Discretization can be unsupervised (ad-hoc selection of bins) or supervised (binning guided by information in training data)."

The quoted statement is false. Continuous features (e.g., 1, 1.5, 2) can also be estimated using maximum likelihood without binning. Assuming a Gaussian distribution of the data, the maximum likelihood estimator of the mean is simply calculated by finding the mean of the set (i.e. add up the numbers and divide by size of the set, which equals 1.5 in the example set).Joseagonzalez (talk) 03:58, 11 August 2010 (UTC)

Now this was changed, which is good. But, I am still a bit puzzled by this whole paragraph:

"Another common technique for handling continuous values is to use binning to discretize the values. In general, the distribution method is a better choice if there is a small amount of training data, or if the precise distribution of the data is known. The discretization method tends to do better if there is a large amount of training data because it will learn to fit the distribution of the data. Since naive Bayes is typically used when a large amount of data is available (as more computationally expensive models can generally achieve better accuracy), the discretization method is generally preferred over the distribution method."

There are not any citations for this, and I find the whole paragraph a bit hand-waiving. The binning method would have the additional aspect that a certain number and distribution of bins has to be chosen. Also, if the "precise distribution of the data is known", then surely I don't need a learning algorithm, right? Plus, if the bins are coarse, the approximated distribution will be coarse, and if it is fine, the memory requirements rise, which may collide with the large-data statement above. Also, I'd say that the typical use-case of the algorithm is as a low-cost baseline? So, it would be great if the author of this paragraph could provide some more evidence, or if others would come in and discuss if these statements are really valid and worth having on the page in this generality. — Preceding unsigned comment added by 95.166.251.153 (talk) 16:47, 29 July 2012 (UTC)

Statistical error in the worked example spam filter

In the "Multi-variate Bernoulli Model" section of [A Comparison of Event Models for Naive Bayes Text Classification], you will note that documents have to be modeled as being drawn from a particular distribution (multivariate bernoulli or multinomial being the most common). See also Andrew Moore's tutorial slides: the occurrence of an instance must be drawn from the joint distribution of mutually independent random variables.

The worked example however only multiplies over the words present in the document:

p(D\vert C)=\prod _{i}p(w_{i}\vert C)\,

Add up these "likelihoods" over all possible documents and the sum will be greater than 1 (making it an invalid distribution). Attempt to use a normalising constant, and the feature occurrences will be revealed to be non-independent (the length of the document is fixed, forcing a precise number of features to occur, which cannot be the case if feature occurrence are bernoulli trials, as implied by p(w_i|C)).

To be formally correct, the likelihood should multiply over the probabilities of failed bernoulli trials as well (for words that did not occur).

Let F be a multivariate bernoulli random variable of dimension V (size of vocabulary), with individual bernoulli random variables F_1, F_2, ... F_V. A particular document D=(f_1, f_2, ... f_V) represents the outcomes of the independent trials with success (1) for presence of each word, failure (0) for absence. Then the likelihood of D given class C is:

p(F=D\vert C)=\prod _{i}^{\vert V\vert }p(F_{i}=f_{i}\vert C)\,

This multiplies over all words in the vocubulary, and results in a somewhat lower likelihood (because non-occurrence of features will in real-world situations have probability a bit less than 1.0).

Winter Breeze (talk) 13:46, 20 November 2007 (UTC)

Link to peer reviewed paper

Hi, I recently added an external link the source code of a Matlab Naive Bayes classifier implementation, the link has repeatedly been removed. I am currently doing my PhD in pattern recognition and I know the link is of high quality and very relevant.

Is the reference and the external link http://www.pattenrecognition.co.za suitable for this site? If not, what can I do so that this information is not repeatedly removed?

cvdwalt

I don't know, but probably because the link is dead. I think you have mispelled it. Is http://www.patternrecognition.co.za/ the correct one? --222.124.122.192 12:56, 17 September 2007 (UTC)

As of Nov 27, 2008, the link is not working, it goes to a hosting web site instead Martincamino (talk) 22:17, 27 November 2008 (UTC)

I would say "no". The website has no author information on it. It appears to be one anonymous person's collection of other's classification tutorials. Link to the original tutorials, not to the copies on patternrecognition.co.za. As for the paper, all are by "C.M. van der Walt" (cvdwalt), I don't think its appropriate to rate one's own site as being "of high quality and very relevant" Winter Breeze (talk) 08:04, 1 June 2009 (UTC)

2002-2004

Could someone please add an introduction which explains comprehensible to someone who is not a mathematician, what this thing is? --Elian 22:01 Sep 24, 2002 (UTC)

Just as soon as someone adds an introduction which explains comprehensible to someone, who is a mathematician (well, technically physics and computing stuff, only just started school, but anyway...), what those symbols are supposed to mean...

D is an object of type document, C is an object of type class, how can both p(D|C) and p(C|D) be meaningful (with two different values, even)? Can only guess whether "p(D and C)" is supposed to be something like boolean operations, set operations, surgical operations, or CIA operations... Cyp 19:42 Feb 10, 2003 (UTC)

I think the notation is clear to persons familiar with probability theory, but it could probably be explained more clearly for those who are not. Michael Hardy 19:44 Feb 10, 2003 (UTC)

Under the assumption that Probability axiom is right and meaningful, I've added a "(see Probability axiom)" and used that particular and symbol. Was an edit conflict, someone else L_AT^EΧed the last two lines before I could submit the new and symbol... (Person put the text "and", hope I was right to replace it with $\cap$ ... Cyp 20:12 Feb 10, 2003 (UTC)

Aaargh... Now that I know what the notation means... Either I'm going mad, or all the fractions in the entire article are upsidedown... Cyp 20:41 Feb 10, 2003 (UTC)

From the article:

Important: Either I'm going mad, or the following formula, along with the rest of the formulas, are upsidedown (D/C instead of C/D)... If I wasn't considering the possibility of me going mad, I would correct this article myself. (Triple-checked I didn't accidentally reverse them myself, when adding $L_{A}T^{E}\chi$ . If I'm mad, just remove this line. If I'm not, let me know, or correct it yourself (and remove this line anyway).

Fixed the upside-down equations. Please review my changes to make sure I've made the right changes.

Seems like what I'd have done... So I guess I wasn't mad, then. Cyp 17:20 Feb 11, 2003 (UTC)

Calling this page Naive Bayesian is extremely misleading. It is more generally known as a Naive Bayes classifier. For something to be Bayesian the parameters are treated as random variables. In the Naive Bayes Classifier this doesn't happen. I strongly suggest that the name is changed. Note that Google has 22,600 hits for "Naive Bayes" and 6,400 hits for "Naive Bayesian". A naive Bayesian is a Bayesian who is naive. Naive Bayes is a simple independance assumption. --Lawrennd 16:49 Sep 20, 2004

For something to be Bayesian the parameters are treated as random variables. This is simply not so. "Bayesian" has a much broader meaning than treating parameters as random variables. I agree that "naive Bayes classifier" is more commonly used (and therefore it's a more appropriate title), but the current name is not "extremely misleading". Wile E. Heresiarch 21:39, 20 Sep 2004 (UTC)

I agree that the title is somewhat inappropriate. "Naive Bayes" is clearly the more common name, which is sufficient motivation for changing the title. However, the very first paragraph of the article in fact points out that NB classification does not require any Bayesian methods. While that discussion could be improved, it is not deficient to the point of being misleading. The term "Bayesian" is often vague and can refer to something as generic as automatically trained methods: cf. "Bayesian spam filtering", which is usually not Bayesian in the sense of treating parameters (but not hyperparameters) as random variables. --MarkSweep 18:49, 21 Sep 2004 (UTC)

I'd prefer "Naïve Bayesian classification" to the current title. Κσυπ Cyp 23:00, 21 Sep 2004 (UTC)

please translate anybody

ohnoooo... can anybody who speaks mathematics - translate this into a more common language? shouldnt it be english :-) no realy, i mean - bayesian networks are used in programming - so it would be usefull to talk about it in a common programming language... c++; java; (perl?) i dont think there are a lot of people who can understand this math notation... and it doesnt help if you are a programmer who wants to work with bayesian networks. so please translate this or add programmlanguages examples - of loops, simple calculations. (even the word class is very misunderstanding cause its very different from the word used in the world of programming)

Why do you think that C++ is a more common language than mathematics? Sympleko (Συμπλεκω) 15:12, 17 April 2008 (UTC)

Example PHP script

It seems a bit silly to me that the choice of example script in this page (that PHP script) links to a resource that, whilst apparently free, is compiled into some fantastic Zend optimised format, and is therefore no good to read whatsoever. Any other examples out there? —Preceding unsigned comment added by 82.33.75.53 (talk) 21:03, 28 May 2006

I agree. I just added a Visual Basic implementation with source code, and out of interest checked out the PHP script. First of all, if it belongs anywhere, it belongs in Bayesian spam filtering. Second of all, since the source is missing it doesn't do anybody any good. I removed it. --Stimpy 13:47, 28 June 2006 (UTC)

proposed merge with Bayesian spam filtering

Not much of a discussion three months after the merge was suggested. I'd rather keep them separate. Rl 15:44, 18 June 2006 (UTC)

They sould be kept seperate. Bayesian spam filtering is a topic of its own, and would clutter the relatively straight-forward article about Naive Bayes. --Stimpy 13:49, 28 June 2006 (UTC)

Now it's been five months and people agree it shouldn't happen. I removed the proposal. ~a (user • talk • contribs) 16:41, 16 August 2006 (UTC)

naive Bayes conditional independence assumption

P(F_i|C,F_j)=P(F_i|C) only defines pairwise independence, which is not equivalent to mutual conditional independence which is needed!!!!

regards

You're right, I made a correction, hope it reads well 158.143.77.29 (talk) 11:43, 26 March 2013 (UTC)

Skipped some steps?

I'm completely flummoxed about this something. In the article it says:

Using Bayes' theorem, we write

p(C\vert F_{1},\dots ,F_{n})={\frac {p(C)\ p(F_{1},\dots ,F_{n}\vert C)}{p(F_{1},\dots ,F_{n})}}.\,

Then the article says:

...The numerator is equivalent to the joint probability model

p(C,F_{1},\dots ,F_{n})\,

What happened here? This is non-obvious. --Herdrick 18:49, 16 February 2007 (UTC).

afaik this follows directly from the definition of the _conditional_ probability, which is:

p(A|B)=p(A,B)/p(B)

greets Will

We don't want the joint probability - we want the probability of the features conditional on the class variable, so it's perfectly sensible. 137.158.205.123 (talk) 13:15, 20 November 2007 (UTC)

Naive Bayes != Idiot's Bayes

Someone used this term once in a paper, surprisingly. It's not a recognized second name for the algorithm. (D.J. Hand and K. Yu, Idiot's Bayes Not so Stupid after All? Int'l Statistical Rev., vol. 69, no. 3, pp. 385-398, 2001)

- 128.252.5.115 02:07, 23 March 2007 (UTC)

(Probabilityislogic (talk) 11:59, 18 March 2011 (UTC))

This classifier is certainly a "naive" one, but it is "naive" only if one actually has knowledge of connections between the different attributes. The "naivety" is in throwing away potentially important information that probability theory can take into account.

If you do not know of any relationships or dependencies exist, then it is actually more conservative to assume that they do not (and certainly not naive). This is because the presence of correlations places additional constraints on the data, and it does this by lowering the number of ways a particular set of data can be produced (see principle of maximum entropy for details).

well,if you're not utterly starved for training samples, you usually do in fact know that dependencies are likely to exist... unless you are an idiot, of course. — Preceding unsigned comment added by 99.109.17.32 (talk) 05:39, 30 March 2012 (UTC)

I think it would be worthwhile to include this point in the article, as it is "hinted at" a few times, along with the "bewilderment" of why the results are so accurate, given that independence may not necessarily hold in the real world.

(Probabilityislogic (talk) 11:59, 18 March 2011 (UTC))

Removed Statement

> The Naive Bayes classifier performs better than all other classifiers under very specific conditions.

This sentence is so non-specific that it is useless —The preceding unsigned comment was added by 128.2.16.65 (talk) 15:00, 11 May 2007 (UTC).

True, it's pretty useless like that. I will consult some of my notes from a class I took and maybe I can correct it with specifics. HebrewHammerTime 07:03, 2 August 2007 (UTC)

worst statement ever--thanks for removing. every classifier in existence performs better than all other classifiers under very specific conditions. — Preceding unsigned comment added by 99.109.17.32 (talk) 05:30, 30 March 2012 (UTC)

Formulas dividing by zero?

I'm confused by some of the formulas where, for example p(w_i|S) is used in the denominator of an expression. If a word never appears in a spam message, then that would be zero and the division undefined. Likewise when it appears in a numerator and then the log of it is taken where log(0) is also undefined.

Kevin 15:04, 1 August 2007 (UTC)

I'm not an expert on this topic, but I've done some programming with naive bayes and clearly dividing by 0 won't work, but in theory if you analyzed lots and lots of spam you'd very rarely have 0 in the denominator. But naturally these programs can't analyze that much and will sometimes have 0 there. The solution I used was to make a constant that was larger than other values so that a divide by 0 would be accurately represented. I actually found these instances very useful for my classification- giving them a bit of extra weight sometimes increased my classifier's accuracy. HebrewHammerTime 07:01, 2 August 2007 (UTC)

The probability can be zero if you use the maximum likelihood estimate of p(w_i|S). If zeros occur, you are better off with a pseudocount or posterior estimate of the frequency - which is non-zero even if the word occurs zero times. 137.158.205.123 (talk)

Apparent Plagiarism

The worked example seems to be an exact copy of this work (pdf). --Vince | Talk 08:33, 10 April 2008 (UTC)

At the top of that paper, it says "General Mathematics Vol. 14, No. 4 (2006), 135–138". Yet, when I took a cursory glance into the revision history, I thought I saw the corresponding examples in Wikipedia prior to 2006. Is it possible that it was copied in the other direction? Because it is so tabu in academics to cite Wikipedia, it can be difficult for scholarly papers to properly give attribution. Further, these "examples" consist of rather well-established formulas, and don't really express any creativity, so it might also be argued that Copyright may not be relevant here.--Headlessplatter (talk) 15:48, 18 March 2011 (UTC)

How to join the different probabilities?

how is the total spam probability calculated? for example, if pr(spam | "viagra") = 0.9 and pr(spam | "hello") = 0.2 , how is the pr(spam | {"viagra", "hello"} ) calculated? —Preceding unsigned comment added by 82.155.78.196 (talk) 17:16, 18 August 2008 (UTC)

Commercial external examples

Is there any reason we shouldn't be able to add an external link to a website that performs Bayesian Classification? I tried to add DiscoverText.com but wikipedia removed it. — Preceding unsigned comment added by 24.9.167.226 (talk) 17:33, 27 June 2011 (UTC)

C# implementation

https://sites.google.com/a/wmail.fi/ukramdata/bayes

Here is my C# implementation of the algorithm. I got different values for variance, and i think the error is not in my code. — Preceding unsigned comment added by 213.243.135.48 (talk) 12:53, 9 July 2012 (UTC)

I believe you are right. This is how I calculate it: Variance of male footsize: (12+11+12+10)/4 = 11.25. Variance: ((12-11.25)²+(11-11.25)²+(12-11.25)²+(10-11.25)²)/4 = (0.75²+(-0.25)²+0.75²+(-1.25)²)/4 = 2.75/4 = 0.6875. — Preceding unsigned comment added by 119.75.27.50 (talk) 07:21, 13 July 2013 (UTC)

Example

Sex Classification seems to be a bad example since Height and Weight are likely to have strong coherence which violates the "naive" assumption. — Preceding unsigned comment added by 108.199.130.44 (talk) 07:17, 11 September 2013 (UTC)