| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | Sorting |
Hi Kieran KH> I was wondering if someone might be able to point me in the right KH> direction of how to numerically and alphabetically sort a large KH> file, say ranging from 1k to over 4meg in size... RM> I'd suggest finding a book on sorting or general algorithms. Simple RM> sorting I can explain, but when you introduce the limitation that RM> the array to be sorted can't fit into memory in one piece, then the RM> best in-memory sorting algorithms can run into funny limitations RM> (if, for example, they need to access widely spaced records RM> sequentially). I suppose so, but I wouldn't have the first clue on a book which KH> would cover it.. at the moment the best idea I can think of is loading KH> as many lines as possible into an array sort that, then load the rest until KH> you hit the end of the file sorting as you go.. then passing over the file KH> numerous times at different line offsets, but I don't think it would work KH> in practise to well :( I'll give you a summarized summary of what my book says, to try to convince you to go get a book. Sorting is a large subject. CRASH COURSE IN EXTERNAL SORTING taken from Algorithms by Sedgewick, extras by me. External sorting is different from in-memory sorting because it is very expensive to access individual records. The wide variety of external storage devices makes it impossible to pick one method that would be Best For Everything. You might, for example, be trying to keep the array inside cache ram, or the array may be on a tape drive. Whatever, most external methods work like this: make a first pass through the array, splitting it up into blocks small enough to fit in memory (say x bytes long). Sort these blocks individually, using whatever in-memory technique is suitable. Merge these sorted blocks together, usually by doing several passes - creating successively larger blocks by merging (say) y blocks at a time, and merging the resulting larger blocks y at a time until the whole array is sorted. A multiway merge like this can be implemented efficiently using priority queues. Virtual memory systems that allow arbitrarily large arrays to be treated as if they exist in memory shouldn't be overlooked - an in-memory sort that doesn't often exceed the "cache size" when it does successive seeks might perform well. Careful testing needed, the vm system could be easily swamped. The quantity x under Turbo C++ without a dos extender will be 64k or less. If the data items are larger than a few bytes, eg name and address records, you can sort larger blocks by sorting a 64k array of pointers (use large memory model!). There's usually a speed improvement here, swapping two pointers is much faster than swapping two big records. The quantity y is trickier, because this is the number of input files you will have open at once when you're merging. Dos allows lots of file handles, but individual programs are often restricted to 20 unless they use tricks that I haven't attempted yet. (You think that's rough - think of the tape drive situation: the number of open files is limited by the number of tape drives...) This number will give you real portability problems, if you're looking for the fastest possible sort. Have I convinced you to go find a book yet? Cheers --- PPoint 1.88* Origin: Silicon Heaven (3:711/934.16) SEEN-BY: 711/809 934 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.