TIP: Click on subject to list as thread! ANSI
echo: aust_c_here
to: Kieran Haughey
from: Roy McNeill
date: 1996-01-11 23:44:04
subject: Sorting

Hi Kieran



 KH> I was wondering if someone might be able to point me in the right

 KH> direction of how to numerically and alphabetically sort a large

 KH> file, say ranging from 1k to over 4meg in size...



 RM> I'd suggest finding a book on sorting or general algorithms. Simple

 RM> sorting I can explain, but when you introduce the limitation that

 RM> the array to be sorted can't fit into memory in one piece, then the

 RM> best in-memory sorting algorithms can run into funny limitations

 RM> (if, for example, they need to access widely spaced records

 RM> sequentially).



  I suppose so, but I wouldn't have the first clue on a book which

 KH> would cover it.. at the moment the best idea I can think of is loading

 KH> as many lines as possible into an array sort that, then load the rest until

 KH> you hit the end of the file sorting as you go.. then passing over the file

 KH> numerous times at different line offsets, but I don't think it would work

 KH> in practise to well :(



I'll give you a summarized summary of what my book says, to try to

convince you to go get a book. Sorting is a large subject.







CRASH COURSE IN EXTERNAL SORTING taken from Algorithms by

Sedgewick, extras by me.





External sorting is different from in-memory sorting because it is

very expensive to access individual records. The wide variety of

external storage devices makes it impossible to pick one method

that would be Best For Everything. You might, for example, be

trying to keep the array inside cache ram, or the array may be on a

tape drive.



Whatever, most external methods work like this:



 make a first pass through the array, splitting it up into blocks

 small enough to fit in memory (say x bytes long).



 Sort these blocks individually, using whatever in-memory technique

 is suitable.



 Merge these sorted blocks together, usually by doing several

 passes - creating successively larger blocks by merging (say) y

 blocks at a time, and merging the resulting larger blocks y at a

 time until the whole array is sorted.



A multiway merge like this can be implemented efficiently using

priority queues.



Virtual memory systems that allow arbitrarily large arrays to be

treated as if they exist in memory shouldn't be overlooked - an

in-memory sort that doesn't often exceed the "cache size" when it

does successive seeks might perform well. Careful testing needed,

the vm system could be easily swamped.



The quantity x under Turbo C++ without a dos extender will be 64k

or less. If the data items are larger than a few bytes, eg name and

address records, you can sort larger blocks by sorting a 64k array

of pointers (use large memory model!). There's usually a speed

improvement here, swapping two pointers is much faster than

swapping two big records.



The quantity y is trickier, because this is the number of input

files you will have open at once when you're merging. Dos allows

lots of file handles, but individual programs are often restricted

to 20 unless they use tricks that I haven't attempted yet. (You

think that's rough - think of the tape drive situation: the number

of open files is limited by the number of tape drives...) This

number will give you real portability problems, if you're looking

for the fastest possible sort.







Have I convinced you to go find a book yet?



Cheers



--- PPoint 1.88


* Origin: Silicon Heaven (3:711/934.16)
SEEN-BY: 711/809 934

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.