BW> FB> Perhaps you missed the original messages, but I'm
BW> FB> reading records from largeish files (~130 megabytes).
BW> FB> Each record consists of a char array and some other
BW> FB> stuff, and I need to determine how many times each
BW> FB> unique char array has been encountered and then
BW> FB> print a list of the unique char arrays in order of
BW> FB> descending frequency.
BW> OK, now I remember.
BW> It was finding the duplicates that was killing you.
BW> FB> Folks in here suggested a hash table, in that it would
BW> FB> completely eliminate all that searching.
BW> Yes, I think I was the first to suggest it.
BW> I just forgot the details.
I think that sorting an array of indices would do it.
You start with a simple struct
typedef struct {
unsigned long file_index;
int hits;
} Hit_Counter;
Then you allocate space for 40,000 or so structs.
Go through the file, one record at a time, using an
insertion sort, going from the middle of the index array and
reading the data to be compared from the source file. The
number of compares increases by one for every power of two
in the number of sorted indices. If a match is found, then
you just increase the number of hits in the index struct.
Otherwise, the new struct is inserted into the array, with
its hit counter set to 1. After all elements are processed,
the index array may be sorted by how many hits are in each.
Then the file may be again accessed, to retrieve the char
arrays, as the list is output to any chosen stream.
I think that would do it rather nicely, and the index could
saved, if desired, against future need.
It should be fairly quick, and do what is required.
If we assume a maximum of 64 bits for long and int alike, and
a minimum of 32 bits for a long and 16 bits for an int, then,
assuming that 40,000 unique char arrays are more that
required, ( you mentioned less than 30,000 initially ), we
can envision an array of structs requiring anywhere from
240,000 bytes in a typical DOS application to 640,000 bytes
in a 64-bit environment. This seems reasonable, so the final
sort, by number of repetitions, could take place entirely in
memory, saving a great deal of time.
And Soylent Green is STILL people...
> ] So far, nobody has ever wanted to ride the Unicorn twice....
---
---------------
* Origin: *YOPS ]I[* 8.4 GIG * RA/FD/FE * Milwaukee, WI (1:154/750)
|