TIP: Click on subject to list as thread! ANSI
echo: power_bas
to: GARY PRICE
from: BRIAN MCLAUGHLIN
date: 1995-10-26 08:27:00
subject: Hashing in PowerBASIC 1/4

Here's something on hashing that (as I recall) was put together by
Jamshid Koshrangi (aka Quinn Tyler jackson) and posted in the QUIK_BAS
echo a while ago. It about 4 messages long and it should help:
   ------------------------------------------------------------
Q3.0    Advanced Topics -- "Hashing in QuickBASIC"
Q3.1    That's pretty fast!  I was so used to doing a sequential search on
        an unsorted list.  Now that I have the QuickSort and the BiSearch
        routines, I can use them as a pair for faster list searches.  The
        thing is, as soon as I want to add something to the list, it
        puts everything out of order by only one entry, and that hardly
        seems worth sorting all over again, even with something as fast
        as Cornel Huth's iterative QuickSort algorithm.  Are there any
        alternatives to this way of doing things?  I've heard talk of
        something called 'hashing' but I don't have any idea of what that
        is all about.  How would I use hashing to avoid having to either
        resort the list, or use a slow insertion algorithm?  Insertion is
        horrendously slow with disk files.
A3.1    Hashing is a very efficient method of record access, be it in RAM
        or be it with a disk file.  Basically, hashed arrays or data files
        can be quickly searched for a given item by a key index.  Whenever
        you have to add an item to the list, you can at lightening speed,
        and since hashing "sorts" the array on-the-fly, as it were, there is
        no need to push records around to add new items to a hashed record.
        The first concept you must understand with hashing is the key index.
        Every data structure you design with hashing in mind has to have
        one field that is unique.  This is a prerequisite that you just can't
        get around.  Of course, you could actually combine several fields
        to generate this unique key, which effectively serves the same
        purpose.  A good application of this is a Fidonet nodelist that uses
        the node address as the hashing key.  No two alike in theory.
        But just how does this key work?  First of all, let's take a look
        at the Fidonet example.  Every full Fidonet address is unique to
        one node.  Assume that the full nodelist has about 15000 entries.
        Okay, if you want a hashing table to hold 15000 unique entries, then
        research has shown that the table should be at least 30% greater
        than the number of entries in it.  That would make 19500 table
        entries.  This means that 4500 entries in the list will be left
        empty for best hashing results.
        Now, another problem comes up.  How does the key come into play?
        Well, let's look at a simple key: 1153999.  Since the list is 19500
        long, we certainly can't just put this in record 1153999.  Hashing
        involves dividing the key by the table size and taking the remainder
        and using that as the record number:
                           59
                    ----------  R 3499
               19500) 1153999
        Okay, 3499 is the record number in which we would put the data.
        This is the basic idea behind hashing.  There is a trouble, however.
        Collision occurs whenever a node address, when divided by 19500 has
        a remainder of 3499.  That 'bucket' is already full!  So, what to
        do?  Generate another bucket number, see if that bucket is full,
        and if it is, keep generating new buckets until we find an empty
        bucket.
        To find an item in a hashed table, we get its key, divide by the
        table size, and look at the bucket that is represented by the
        remainder.  If that isn't the one, we generate the next bucket
        address, until we arrive at an empty bucket.  If we encounter
        the correct key BEFORE we arrive at an empty bucket, then we've
        found our entry.  If we arrive at an empty bucket, the record is
        not in the table.  And there you have hashing.
        A well designed hashing table will yield this number of collisions
        per insertion or search:
T1.0    Hashing Collision Table
        TABLE FULLNESS          COLLISIONS
        ==================================
             50%                   2.0
             60%                   2.5
             70%                   3.3
             90%                  10.0
=======>8 TABLE 1.0 ENDS HERE 8<=========
        That shows better results than even the binary search, with large
        lists!
        Research has shown that the most efficient hashing tables, that is,
        the ones with the least number of collisions, have a prime
        number of entries.  A table size of 1019 should produce less
        collisions than one of 1000.  Research has also shown that if the
        prime is of the form 4K+3, where K is any positive integer, then
        collisions are reduced even further.  1019 also meets this second
        requirement.  But, since a table size twice the size of the maximum
        number of entries it will ever hold is inefficient, the 4K+3
        criterion should be abandoned at a certain point in favor of any
        prime number.  Since most of us aren't idiot savants who can just
        come up with that number to suit our needs, here is a FUNCTION,
        written by Charles Graham, that accepts the maximum number of
        entries a table will have, and returns the proper type of prime
        number, to be used as a hashing table size:
Q3.1    How do I know when to use sequential searches, when to use
        binary searches, and when to use hashing?  Are there any sort
        of guidelines?
A3.1    Well, first let's consider where hashing is in its prime.  (You'll
        pardon that one, okay?)  It is best suited to dynamic list
        generation where items need to be added on a regular basis, but
        not deleted, since deletion is fairly difficult to implement on
        a hashed list.  The main strength of a hashing system is its
        ability to quickly insert new items into the table in such a
        manner that they can be located quickly "on-the-fly."   (See T1.0
        for the average number of collisions before locating the correct
        entry.)  Since the collisions increase with the ratio of full
        buckets to empty buckets, and not with the size of the actual
        table involved, hashing is more efficient than even binary
        searches when lists start to become huge.  Also, because the
        binary method of searching demands a sorted list, insertion of
        items at a later time becomes very cumbersome, even with such
        techniques as the QuickSort and pushing all entries after the
        insertion up by one.  (Try that technique on a list of 30,000
        items, when you only want to add two new items that land near
        the beginning of the list, and you'll know what disk wear and
        tear is all about!)
        Typical applications of the hashing algorithm involve word
        distribution counts, dictionary table generators that involve
        dictionaries that will be added to dynamically, and things of
        that nature.
        Consider the word distribution count problem.  Each word is a
        unique key, and so is perfect for hashing.  Sequential methods
        only work well up until the table has so many entries in it that
        looking up entries in the table becomes a real effort.  Remember,
        words already in the list do not need to be added twice.  Binary
        methods allow for quick searching, but each case of a new word
        being added to the list requires a sort or cumbersome insertion.
        This takes time, if a text file is of even average length.
>>> Continued to next message
 * SLMR 2.1a * MAXLIB For PB v1.2 - Access arrays and files in EMS/XMS!
--- WILDMAIL!/WC v4.12 
---------------
* Origin: Com-Dat BBS - Hillsboro, OR. HST DS (1:105/314.0)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.