Talk:Spreadsort

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Talk[edit]

Issues with Spreadsort:

Sorting outside of RAM[edit]

How much should this be described, if at all? It's a significant strength of the algorithm.


Worst-case performance[edit]

As cited in the Spreadsort paper, Markku Tamminen's paper "Two is as good as any" proves that sorting algorithms of this type are O(n) for continuous integrable functions. It requires effort to design a distribution that makes this hit its worst-case performance. On evenly distributed random data, Spreadsort (like Tamminen's approach) breaks up the problem into tiny pieces which comparison-based algorithms can finish quickly. On data with a lot of overlap, it often can solve the problem quickly in a distributional fashion.

Parallel version[edit]

Anyone interested? It hasn't been developed, but should work well.


Value vs. "-"[edit]

Which is more general and/or usable? A Value function that returns a value to be subtracted, or a "-" function? Obviously a "-" function can be constructed as Value() - Value(). I left this alone since that's the way it is described in the paper.

Two are as Good as Any[edit]

Markku Tamminen proved that two recursive iterations of this type of algorithm will have O(n) time on any continuous integrable function in his paper "Two is as Good as Any". For performance reasons, this isn't always optimal, but it was an interesting result. Real data often isn't in the form of such a function, but a bell curve is a common example of a function that is integrable. It is trivial to make Spreadsort always do two iterations if the first iteration doesn't cut the problem down to a very small size, to attain this result. I'm not sure how much mention to make of this in the article; I consider it an important result.

Personal communication[edit]

I'm afraid we can't cite a personal communication. The whole point of citation on Wikipedia is verifiability, and citing a personal communication that's not available for verification doesn't qualify. --P3d0 19:54, 6 May 2007 (UTC)[reply]

Okay, good point, feel free to remove that info. I'm a bit annoyed about citing the paper's result though, which is compared to C's qsort() and so extremely optimistic. Dcoetzee 07:29, 12 May 2007 (UTC)[reply]
Well, frankly I don't know what to do with this article. It seems to be advocating the algorithm rather than describing it. --P3d0 13:26, 17 May 2007 (UTC)[reply]

It describes the basics of the algorithm, based upon my publication. Also, comparisons were eventually made to std::sort, which is quite fast (do you know of a faster general-case library sorting call?), though Spreadsort was consistently faster than it in my testing on large random data sets with different distributions on multiple platforms and in multiple compilers. I am interested in demonstrating the performance of Spreadsort, but yet to find a place where I can do so without open-sourcing the fully optimized source code. If you feel it is advocating the algorithm, you are welcome to modify the description in an accurate fashion to attain NPOV.Cuberoot31 17:56, 7 June 2007 (UTC)[reply]

It reads like a paper, but without any actual description of the algorithm. I don't think it's suitable encyclopedia material at all. Above all other objections... why don't we describe the algorithm here? --P3d0 20:14, 18 June 2007 (UTC)[reply]

The Paper has a description of the algorithm. I would happily send you a copy of the paper (which describes the algorithm) if you like. I could get in trouble for providing any kind of detailed algorithmic description, so here's a basic one that describes how it operates:

       1) Scan the data for the maximum and minimum key values;
       2) Determine the number of bins n/c to cover the range found in 1 based upon the number of data items n.
       3) move each datum (or a pointer to it) into its correct bin
       4) Sort each bin that isn't fully sorted by recursive application of either Spreadsort or a fast comparison-based algorithm, using a quick heuristic to calculate which has better worst-case performance.

2 and 4 are both significantly different from pigeonsort, and 3 likely is.

I will happily provide verifiable results if anyone will tell me where this can be done without violating an NDA on the source.Cuberoot31 22:27, 18 June 2007 (UTC)[reply]

Pigeonsort[edit]

This sort method appears to be very similar if not identical to pigeonsort which was first published during the 1980s. What's the difference ? -- Derek Ross | Talk 05:03, 12 May 2007 (UTC)[reply]

I've never heard of pigeonsort. Reference? Dcoetzee 07:28, 12 May 2007 (UTC)[reply]
I think we have an article on it called Pigeonhole sort. However having read that article it misses the point and describes something more like count sort than pigeonsort. In particular the pseudocode in that article is just count sort. I first saw pigeonsort described in a copmputer magazine, Personal Computer World or perhaps PC Magazine, during the late 1980s. It was notable that Knuth wrote a letter to the magazine the following month commenting on it. Personally I found its speed very impressive when I tried implementing it myself (originally in BBC BASIC). I still have a QBASIC version of it somewhere that I can look out but there don't seem to be too many references to it on the 'net. A google search for pigeonsort algorithm just returns one result and that is a mention of pigeonsort rather than a description! -- Derek Ross | Talk 18:07, 12 May 2007 (UTC)[reply]
As far as I recall the pigeonsort steps are
1) Scan the data for the maximum and minimum values;
2) Map the values to numerics and use the results together with the number of pigeonholes (ie partitions) to calculate the pigeonhole partition values;
3) Scan the data to count the number of slots required for each pigeonhole;
4) Scan the data to move each datum (or a pointer to it) into a slot in the appropriate pigeonhole;
5) Sort each pigeonhole with more than one entry in it.
For large pigeonholes step 5 can be accomplished by recursive application of pigeonsort but for smaller pigeonholes another sort may be more appropriate. -- Derek Ross | Talk 18:27, 12 May 2007 (UTC)[reply]

I appreciate the input. I had never heard of Pigeonsort, and if I had, I probably wouldn't have started on the path of designing and optimizing Spreadsort. With that said, Spreadsort is different, and in ways I've found make its performance superior to conventional approaches. Fundamental differences between Spreadsort and Pigeonsort: 1) The heuristic test which determines how to recurse, which is described in the publication 2) It aims at more than one item per partition if n >> 2k, and works quickly even if there are many identical or nearly identical items. Also Spreadsort's step 4 is likely significantly different from pigeonsort; I could compare if I had a copy of pigeonsort. If any of you have references to Pigeonsort, I am strongly interested. Finally Spreadsort is demonstrably faster than std::sort on every type of (large) data set I've found to feed it, of many different types and distribution profiles.Cuberoot31 17:49, 7 June 2007 (UTC)[reply]

Algorithm[edit]

Is it possible to have an algorithmic implementation of the spreadsort on the article? TCMREastwood 05:17, 17 May 2007 (UTC)[reply]

What kind of implementation (C, C++, Pseudocode) are you looking for, and what are you interested in it for?Cuberoot31 17:59, 7 June 2007 (UTC)[reply]

I was thinking C/Pseudocode would do. I'm just interested for the sake of curiosity :P. Could you send me a copy of the paper or anything that doesn't violate NDA? Cheers. —Preceding unsigned comment added by TCMREastwood (talkcontribs) 15:33, 22 September 2007 (UTC)[reply]

I've uploaded it recently. See: Spreadsort Cuberoot31 (talk) 20:37, 8 June 2008 (UTC)[reply]

Performance Comparisons[edit]

As these are hard to verify, when/how should they be included, if at all?

Ross.sj (talk) 14:57, 30 August 2017 (UTC) Please see http://www.boost.org/doc/libs/1_65_0/libs/sort/doc/html/index.html#sort.overview.performance There are also benchmarks and a performance proof in the documentation (now referenced).[reply]

COI[edit]

(moved from my talk User talk:Widefox)
Ross.sj (talk) 15:12, 30 August 2017 (UTC) What is necessary to remove the RefImprove link from the spreadsort page? I added links to the Boost.Sort open-source implementation and documentation, which includes benchmarks and graphs comparing algorithms.[reply]

As the author of the library, I would rather someone else remove or confirm that it's reasonable to remove the RefImprove link as soon as it meets the required standards. If you think more is needed, please specify what you need.

Please can you read and disclose your COI per instructions at WP:COI. I've removed the refimprove as it doesn't go far enough, and replaced it with primary and COI tags. Does the topic meet WP:GNG, ie multiple indi secondary sources? Widefox; talk 18:43, 30 August 2017 (UTC)[reply]

Ross.sj (talk) 01:20, 31 August 2017 (UTC) I've reverted both my and your changes today, and returned it back to the state before any of these changes (the state of your edit in July). It still has the RefImprove link now. I'd be happy to point you to the performance results Francisco Tapia found (which largely match mine) if you doubt accuracy, or you could just look at the Boost discuss list. I have no financial interest in this open-source library that is now part of Boost, but if you don't want corrections from the author, I guess the article will remain less useful and accurate than it could be.[reply]

Please read WP:COI, the article was better with the changes and the additional refs, and the maintenance tags should not be removed without addressing the issues or discussion. Best practice is to refrain from editing the article directly, and make COI edit suggestions here (per that link). What would help is finding independent WP:RS to satisfy WP:GNG. Widefox; talk 11:05, 31 August 2017 (UTC)[reply]