Here's something on hashing that (as I recall) was put together by
Jamshid Koshrangi (aka Quinn Tyler jackson) and posted in the QUIK_BAS
echo a while ago. It about 4 messages long and it should help:
------------------------------------------------------------
Q3.0 Advanced Topics -- "Hashing in QuickBASIC"
Q3.1 That's pretty fast! I was so used to doing a sequential search on
an unsorted list. Now that I have the QuickSort and the BiSearch
routines, I can use them as a pair for faster list searches. The
thing is, as soon as I want to add something to the list, it
puts everything out of order by only one entry, and that hardly
seems worth sorting all over again, even with something as fast
as Cornel Huth's iterative QuickSort algorithm. Are there any
alternatives to this way of doing things? I've heard talk of
something called 'hashing' but I don't have any idea of what that
is all about. How would I use hashing to avoid having to either
resort the list, or use a slow insertion algorithm? Insertion is
horrendously slow with disk files.
A3.1 Hashing is a very efficient method of record access, be it in RAM
or be it with a disk file. Basically, hashed arrays or data files
can be quickly searched for a given item by a key index. Whenever
you have to add an item to the list, you can at lightening speed,
and since hashing "sorts" the array on-the-fly, as it were, there is
no need to push records around to add new items to a hashed record.
The first concept you must understand with hashing is the key index.
Every data structure you design with hashing in mind has to have
one field that is unique. This is a prerequisite that you just can't
get around. Of course, you could actually combine several fields
to generate this unique key, which effectively serves the same
purpose. A good application of this is a Fidonet nodelist that uses
the node address as the hashing key. No two alike in theory.
But just how does this key work? First of all, let's take a look
at the Fidonet example. Every full Fidonet address is unique to
one node. Assume that the full nodelist has about 15000 entries.
Okay, if you want a hashing table to hold 15000 unique entries, then
research has shown that the table should be at least 30% greater
than the number of entries in it. That would make 19500 table
entries. This means that 4500 entries in the list will be left
empty for best hashing results.
Now, another problem comes up. How does the key come into play?
Well, let's look at a simple key: 1153999. Since the list is 19500
long, we certainly can't just put this in record 1153999. Hashing
involves dividing the key by the table size and taking the remainder
and using that as the record number:
59
---------- R 3499
19500) 1153999
Okay, 3499 is the record number in which we would put the data.
This is the basic idea behind hashing. There is a trouble, however.
Collision occurs whenever a node address, when divided by 19500 has
a remainder of 3499. That 'bucket' is already full! So, what to
do? Generate another bucket number, see if that bucket is full,
and if it is, keep generating new buckets until we find an empty
bucket.
To find an item in a hashed table, we get its key, divide by the
table size, and look at the bucket that is represented by the
remainder. If that isn't the one, we generate the next bucket
address, until we arrive at an empty bucket. If we encounter
the correct key BEFORE we arrive at an empty bucket, then we've
found our entry. If we arrive at an empty bucket, the record is
not in the table. And there you have hashing.
A well designed hashing table will yield this number of collisions
per insertion or search:
T1.0 Hashing Collision Table
TABLE FULLNESS COLLISIONS
==================================
50% 2.0
60% 2.5
70% 3.3
90% 10.0
=======>8 TABLE 1.0 ENDS HERE 8<=========
That shows better results than even the binary search, with large
lists!
Research has shown that the most efficient hashing tables, that is,
the ones with the least number of collisions, have a prime
number of entries. A table size of 1019 should produce less
collisions than one of 1000. Research has also shown that if the
prime is of the form 4K+3, where K is any positive integer, then
collisions are reduced even further. 1019 also meets this second
requirement. But, since a table size twice the size of the maximum
number of entries it will ever hold is inefficient, the 4K+3
criterion should be abandoned at a certain point in favor of any
prime number. Since most of us aren't idiot savants who can just
come up with that number to suit our needs, here is a FUNCTION,
written by Charles Graham, that accepts the maximum number of
entries a table will have, and returns the proper type of prime
number, to be used as a hashing table size:
Q3.1 How do I know when to use sequential searches, when to use
binary searches, and when to use hashing? Are there any sort
of guidelines?
A3.1 Well, first let's consider where hashing is in its prime. (You'll
pardon that one, okay?) It is best suited to dynamic list
generation where items need to be added on a regular basis, but
not deleted, since deletion is fairly difficult to implement on
a hashed list. The main strength of a hashing system is its
ability to quickly insert new items into the table in such a
manner that they can be located quickly "on-the-fly." (See T1.0
for the average number of collisions before locating the correct
entry.) Since the collisions increase with the ratio of full
buckets to empty buckets, and not with the size of the actual
table involved, hashing is more efficient than even binary
searches when lists start to become huge. Also, because the
binary method of searching demands a sorted list, insertion of
items at a later time becomes very cumbersome, even with such
techniques as the QuickSort and pushing all entries after the
insertion up by one. (Try that technique on a list of 30,000
items, when you only want to add two new items that land near
the beginning of the list, and you'll know what disk wear and
tear is all about!)
Typical applications of the hashing algorithm involve word
distribution counts, dictionary table generators that involve
dictionaries that will be added to dynamically, and things of
that nature.
Consider the word distribution count problem. Each word is a
unique key, and so is perfect for hashing. Sequential methods
only work well up until the table has so many entries in it that
looking up entries in the table becomes a real effort. Remember,
words already in the list do not need to be added twice. Binary
methods allow for quick searching, but each case of a new word
being added to the list requires a sort or cumbersome insertion.
This takes time, if a text file is of even average length.
>>> Continued to next message
* SLMR 2.1a * MAXLIB For PB v1.2 - Access arrays and files in EMS/XMS!
--- WILDMAIL!/WC v4.12
---------------
* Origin: Com-Dat BBS - Hillsboro, OR. HST DS (1:105/314.0)
|