TIP: Click on subject to list as thread! ANSI
echo: c_echo
to: FRANK BLACK
from: KURT KUZBA
date: 1998-05-26 00:19:00
subject: hash tables

BW> FB>   Perhaps you missed the original messages, but I'm
BW> FB>   reading records from largeish files (~130 megabytes).
BW> FB>   Each record consists of a char array and some other
BW> FB>   stuff, and I need to determine how many times each
BW> FB>   unique char array has been encountered and then
BW> FB>   print a list of the unique char arrays in order of
BW> FB>   descending frequency.
BW>   OK, now I remember.
BW>   It was finding the duplicates that was killing you.
BW> FB>   Folks in here suggested a hash table, in that it would
BW> FB>   completely eliminate all that searching.
BW>   Yes, I think I was the first to suggest it.
BW>   I just forgot the details.
   I think that sorting an array of indices would do it.
   You start with a simple struct
typedef struct {
   unsigned long file_index;
   int hits;
}  Hit_Counter;
   Then you allocate space for 40,000 or so structs.
   Go through the file, one record at a time, using an
   insertion sort, going from the middle of the index array and
   reading the data to be compared from the source file. The
   number of compares increases by one for every power of two
   in the number of sorted indices. If a match is found, then
   you just increase the number of hits in the index struct.
   Otherwise, the new struct is inserted into the array, with
   its hit counter set to 1. After all elements are processed,
   the index array may be sorted by how many hits are in each.
   Then the file may be again accessed, to retrieve the char
   arrays, as the list is output to any chosen stream.
   I think that would do it rather nicely, and the index could
   saved, if desired, against future need.
   It should be fairly quick, and do what is required.
   If we assume a maximum of 64 bits for long and int alike, and
   a minimum of 32 bits for a long and 16 bits for an int, then,
   assuming that 40,000 unique char arrays are more that
   required, ( you mentioned less than 30,000 initially ), we
   can envision an array of structs requiring anywhere from
   240,000 bytes in a typical DOS application to 640,000 bytes
   in a 64-bit environment. This seems reasonable, so the final
   sort, by number of repetitions, could take place entirely in
   memory, saving a great deal of time.
   And Soylent Green is STILL people...   
> ] So far, nobody has ever wanted to ride the Unicorn twice....
---
---------------
* Origin: *YOPS ]I[* 8.4 GIG * RA/FD/FE * Milwaukee, WI (1:154/750)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.