TIP: Click on subject to list as thread! ANSI
echo: public_domain
to: Kieran Haughey
from: Lewin Edwards
date: 1996-01-29 22:06:44
subject: Sorting

KH> A 1Gig hard drive on a 386DX16.. not on your nelly I will :).. that's like

I've got two 1Gb drives running on an i386SX-25 in one corner of my network.

 KH> Anyway, I am planning on getting a new computer anyway, so I don't really

Funny how -every- programmer -always- says this. Catch a programmer walking
out of a computer shop with a new friend under his arm and ask him about
it, and he'll say "It's a ...[blah]... yeah, just, like a fill-in
machine really, I'm buying a better one soon, like y'know man".
(Translation courtesy of UCLA Phreno-Linguistics and Surfboard Repair
Center).

 LE>> First, read the first letter of every word. Create a temporary file for
 LE>> each start-letter you find. (I mean, create a temporary file for each
 LE>> starting letter, but don't create a temporary file if there are no
 KH> Damned good idea there, the only problem is that to be safe you could only
 KH> keep 10 files open..  and you'd have to keep opening
and closing

DOS ??? Ho, ho, ho. How quaint. But there are good counter arguments, anyway :

1. Use a disk cache with write-behind caching.
2. Grow the JFT manually or use various DOS foolery/kludgery/insanity to
increase the number of available files.
3. Use some other OS (a *nix variant might be best for this sort of thing)
or a timeslicing operating environment on top of DOS (such as Windows)
which allows large numbers of files to remain open.

You can do other cunning things, but put your thinking cap on.

Number each record in the source file, starting at serial number 1. (I'll
assume you've got fewer than 4 billion records to deal with, and can use a
dword for the serial number). Now start to build an ordered (not
necessarily sorted) list of serial numbers.

Your list format is :

[dword] 0 -- that's why you've got to start numbering records at #1 8-)
[ASCIIZ string] sorting bin identifier ("A", "B", ...
or "AAAAA", "AAAAB", or however deep you have to go to
sort the input database).
[dword] serial number  } in no particular order
[dword] serial number  }
[.......]
[dword] 0
[ASCIIZ string]

Go down the file once. For each record, toss its serial number into the
area of the serial number list which it belongs to. So if your records were
:

 1   Fred
 2   James
 3   Albert
 4   Doglet

your list would look like :

dword 0
byte  'A',0
dword 3       ;Albert was the 3rd record in the input list

dword 0       ;end of section/beginning of new section
byte  'D',0
dword 4       ;Doglet was the 4th record in the input list

dword 0       ;end of section/beginning of new section
byte  'F',0
dword 1       ;Fred was the 1st record in the input list

dword 0       ;end of section/beginning of new section
byte  'J',0
dword 2       ;James was the 2nd record in the input list

dword 0       ;use two dwords 0 to mark the end of the list
dword 0

The size of this list will not grow very much each time you add an extra
sort digit, i.e. the list "A".."Z" is not much larger
than the list "AA".."ZZ". Depending on how much RAM is
available, this may let you do the whole thing opening only one file - the
source file - with no virtual memory.

Once you've completed the list to your satisfaction, open the source file
for reading and an output file for writing. Descend your ORDERED serial
number list. Transcribe the records from one file to the other according to
the positions of their serial numbers within the ordered list. Ignore the
tags which were used to identify the start and end points of the
"bins" in your serial file. In the example above, you descend the
list I described above thusly :

* ignore bin header
* seek to record 3 in source file. Copy to output file.
* ignore bin header
* seek to record 4 in source file. Copy to output file.
* ignore bin header
* seek to record 1 in source file. Copy to output file.
* ignore bin header
* seek to record 2 in source file. Copy to output file.
* null header - list ends - job complete.

If you need to sort using multiple passes, you can change the output
filename each time you hit a bin header - since you KNOW that all 'A'
records are in the 'A' bin, you can close and forget about the 'A' file as
soon as the 'B' bin starts. You can then recursively call the sort routine
over the sub-files that were generated during this process - AND you only
ever need at most two files open at once, and only one record in memory at
once. Again, it is best to employ this method only until the sub-file size
is reduced to manageable proportions. For a small list, other sort
algorithms are more efficient. This technique is designed specifically for
the situation where you don't have enough physical or virtual memory to
read the source file, and you need an intelligent way of splitting it into
useful chunks.

This method also only works easily if the record size is fixed, though it
is modifiable to work on variable-length records quite easily, it doubles
the list size.

I used a refinement of the technique above [yes, it was very complex 8-)]
in a comp sci project in '91 or so. We were developing in INTERPRETED
Macintosh Pascal on 8MHz 68000 machines with 1Mb RAM, and the interpreter
was running over a LAN (!!!! Talk about nightmares). We had to sort about
4,000 records, each just under 1K in size. I cut the sort time from 90-odd
minutes (simple bubblesort of the records themselves) to just over 4
minutes. Totally wasted on the brain dead instructors of course, I would
still have got full marks if I'd done the bubblesort, but it was a cool
program and a reusable idea 8-)

Basically, the idea is to apply a bit of lateral thinking - you are NOT
looking for a way to sort the list, you are looking for a way to deduce the
order into which the records would be placed if they were sorted according
to your desired criteria. By manipulating a list of tokens rather than the
records themselves, you drastically reduce memory and CPU gas-guzzling.

-- Lewin A.R.W. Edwards [Team OS/2]  Tel 0419320415 * 0412809805 * 0414927056

--- GoldED 2.41+
* Origin: ZWSBBS +61-3-98276881/+61-3-98276277/Sysop Private Node (3:634/396)
SEEN-BY: 50/99 632/348 998 633/371 634/384 396 635/402 503 544 639/252
SEEN-BY: 640/230 690/718 711/401 410 413 430 808 809 934 713/888 800/1
SEEN-BY: 7877/2809
@PATH: 634/396 384 635/503 50/99 711/808 809 934

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.