TIP: Click on subject to list as thread! ANSI
echo: pascal
to: DAVID CHORD
from: JOHAN ZWIEKHORST
date: 1998-04-13 12:13:00
subject: Detecting archiver type

Hi David!
In your message to Jasen Betts, dated , you 
wrote:
 >DC: DOS     Extension       PAK
 >DC: DOS     Ident           -2,fe
 >DC: The '-2' shows that the *last* byte in the file is FE, which is the
 >DC: way to determine archives of this type.
 >DC: Not the best way to end a file I guess, one char - what happens if a
 >DC: .ZIP file ends with that as well?
This is how GUS (General Unpack Shell) identifies archives (excerpt from 
GUS.DOC):
=== Cut ===
      6. HOW GUS IDENTIFIES ARCHIVES
      GUS recognizes archives by searching for well-defined patterns in the
      archive file. Such a pattern can be from 1 to 7 bytes in length and
      it is extremely important that they be checked in the PROPER ORDER!
      That is what distinguishes GUS from all it's competitors: most
      programs do search for the right patterns (with the exception of the
      pattern for ZOO, which is almost always wrong), but don't do this in
      the proper order. That can result in faulty identifications,
      specifically when encountering nested archives (archives within
      archives).
      6.1. Recognition patterns as used by GUS
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ArcType Offset  Pattern         Comment
      =--=--- =--=--  =--=--=--=--    =--=--=--=--=--=--=--=--=--=---
      ARC     0       0x1A
      ARC+    0       0x1A            Method byte (offset 1) of all
      PAK                             entries needs to be scanned: if
      HYP                             >= 0x0A then PAK;
                                      >= 0x48 then HYP;
                                      == 0x14 then ARC+
                                      Note: PAK can also be recognized
                                      by locating the byte 0xFE at offset
                                      EOF-2, but GUS doesn't use that
                                      because it is less accurate than
                                      scanning the method bytes, which
                                      has to be done anyway for identi-
                                      fying ARC+ and HYP.
                                      For completeness, the record layout
                                      of an ARC archive will be given in
                                      paragraph 6.2.
      ARJ     0       0x60 0xEA
      HA      0       'HA'            Offset 4 binary ANDed with 0xFC should
                                      yield 0x20. This is an additional check
                                      that GUS performs.
      DWC     -3      'DWC'           Offset -3 means the third LAST byte
                                      of the archive file.
                                      It is possible that some junk is
                                      present at the end of an archive,
                                      because of Xmodem transmissions for
                                      example.
                                      In order to avoid GUS not recognizing
                                      the archive because of this, the last
                                      1028 bytes (or 343 triplets) are read
                                      into a buffer and if that buffer
                                      contains the string 'DWC', then we
                                      have a DWC archive.
                                      An additional check will be done,
                                      however. The `DWC' string will have
                                      to be the last item in a 27 byte
                                      structure of which the first two
                                      items are ArcStrucSize=27 (word size:
                                      2 bytes) and DirStrucSize=34 (byte
                                      size) before GUS will accept the file
                                      to be a DWC archive.
      LZH     2       '-l??-'         The '?' specifies a wildcard
                                      character.
      HAP     0       0x91 '3HF'
      HPK     0       'HPAK'
      UC2     0       'UC2' 0x1A
      ZIP     0       'PK' 0x03 0x04
      ZOO     20      0xDC 0xA7 0xC4 0xFD
                                      Most other programs search for the
                                      string 'ZOO' at the front of the
                                      archive, but that is wrong! Only
                                      the ZOO archives made using Rahul
                                      Dhesi's program would be recognized
                                      this way. ZOO archives made by an
                                      Amiga or a computer running Unix
                                      would not necessarily be recognized
                                      this way.
      SQZ     0       'HLSQZ'
      RAR     0       'Rar!' 0x1A 0x07 0x00
      6.2. Record layout of ARC/ARC+/PAK
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      The record which describes each archive entry:
            .=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----.
            |var                                               |
            |  ArcHeader : record                              |
            |                Marker: Byte;                     |
            |                Method: Byte;                     |
            |                Name  : array [1..13] of char;    |
            |                Size  : DWord;                    |
            |                Stamp : DWord;                    |
            |                CRC   : Word;                     |
            |                Length: DWord;                    |
            |              end;                                |
            `=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----'
      Procedure to scan all archive entries:
            .=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----.
            |begin                                             |
            | seek(F, 0);                                      |
            | Done := false;                                   |
            | YieldARC := ARC;                                 |
            | repeat                                           |
            |   {$I-}                                          |
            |   blockread(F, ArcHeader, sizeof(ArcHeader));    |
            |   {$I+}                                          |
            |   if IOresult = 0                                |
            |    then begin                                    |
            |          if ArcHeader.Method >= PAKid            |
            |           then begin                             |
            |                 Done := true;                    |
            |                 YieldARC := PAK;                 |
            |                 if ArcHeader.Method >= HYPid     |
            |                  then YieldARC := HYP            |
            |                  else if ArcHeader.Method = ARPid|
            |                        then YieldARC := ARp      |
            |                end                               |
            |           else MoveFilePtr(F, ArcHeader.Size);   |
            |         end                                      |
            |    else Done := true                             |
            | until Done                                       |
            |end;                                              |
            `=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----'
      This is of course all in Turbo Pascal, the language in which GUS was
      written. The above are in fact literal excerpts from GUS's source
      code.
      6.3. How GUS identifies SFX (self-extracting) archives
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      The basic principle is simple. A self-extracting archive consists of an
      extraction program in EXE form followed by the archive itself as
      appended data.
      The header of an EXE file contains information to determine the size of
      the EXE portion of the file and hence the offset where the appended 
ata
      starts.
      This proved to be true for all archive types, except for SFXs made by
      MKSARC, the ZIP/sfx as used in PKLTE115.EXE and the ZIP/sfx for OS/2.
      GUS has those offset values hardcoded.
      Should you encounter other self-extracting archive types which GUS
      doesn't recognize, please let me know. Don't forget to mention,
      however, by which program those self-extractors were made.
      6.4. Mandatory order for scanning recognition patterns
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              1 - RAR
              2 - SQZ
              3 - ZIP
              4 - HPK
              5 - UC2
              6 - HAP
              7 - ZOO
              8 - LZH
              9 - HA
             10 - ARJ
             11 - DWC
             12 - ARC/ARC+/PAK/HYP
      This order is mandatory because it guarantees the greatest chance
      for a correct recognition.
      Every other order would increase the chance for a faulty result.
      This is also the reason why the archive specifications are still
      built into GUS and not given in a seperate configuration file
      (like the one used by Jeffrey Nonken's PolyXarc, for example):
      I still haven't found a good method to have GUS determine auto-
      matically in which order the patterns have to be scanned, if a
      possibility exists that new patterns would be added to the list.
      I can't expect the users to include new patterns in the proper
      order themselves, can I? Therefore, I don't think providing GUS
      with a CFG file is very important at this time. I see no problem
      for providing a new GUS when a new and exciting archiver is
      released.
      That's it folks! If you're curious: the Borland Pascal source for
      GUS is about 1100 lines in length. Those lines are `filled' in the
      same way as those of the procedure quoted above.
      *** NOTE: you may use the scanning and identification method as
      used by GUS and as described above in your own programs, but
      please be so kind and don't forget the reference indicating where
      you got the information!
=== Cut ===
That was from GUS 1.95. Since then, I've made a v1.96 which can handle the 
JAR type as well.
 ._|~/_
        johanzw@club.innet.be
        http://www.club.innet.be/~year1355/
--- Maximus-CBCS v3.01
(2:292/100)
---------------
* Origin: Tripod BBS Belgium - bortaS bIr jablu'DI'reH QaQqu' nay'

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.