Hi David!
In your message to Jasen Betts, dated , you
wrote:
>DC: DOS Extension PAK
>DC: DOS Ident -2,fe
>DC: The '-2' shows that the *last* byte in the file is FE, which is the
>DC: way to determine archives of this type.
>DC: Not the best way to end a file I guess, one char - what happens if a
>DC: .ZIP file ends with that as well?
This is how GUS (General Unpack Shell) identifies archives (excerpt from
GUS.DOC):
=== Cut ===
6. HOW GUS IDENTIFIES ARCHIVES
GUS recognizes archives by searching for well-defined patterns in the
archive file. Such a pattern can be from 1 to 7 bytes in length and
it is extremely important that they be checked in the PROPER ORDER!
That is what distinguishes GUS from all it's competitors: most
programs do search for the right patterns (with the exception of the
pattern for ZOO, which is almost always wrong), but don't do this in
the proper order. That can result in faulty identifications,
specifically when encountering nested archives (archives within
archives).
6.1. Recognition patterns as used by GUS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ArcType Offset Pattern Comment
=--=--- =--=-- =--=--=--=-- =--=--=--=--=--=--=--=--=--=---
ARC 0 0x1A
ARC+ 0 0x1A Method byte (offset 1) of all
PAK entries needs to be scanned: if
HYP >= 0x0A then PAK;
>= 0x48 then HYP;
== 0x14 then ARC+
Note: PAK can also be recognized
by locating the byte 0xFE at offset
EOF-2, but GUS doesn't use that
because it is less accurate than
scanning the method bytes, which
has to be done anyway for identi-
fying ARC+ and HYP.
For completeness, the record layout
of an ARC archive will be given in
paragraph 6.2.
ARJ 0 0x60 0xEA
HA 0 'HA' Offset 4 binary ANDed with 0xFC should
yield 0x20. This is an additional check
that GUS performs.
DWC -3 'DWC' Offset -3 means the third LAST byte
of the archive file.
It is possible that some junk is
present at the end of an archive,
because of Xmodem transmissions for
example.
In order to avoid GUS not recognizing
the archive because of this, the last
1028 bytes (or 343 triplets) are read
into a buffer and if that buffer
contains the string 'DWC', then we
have a DWC archive.
An additional check will be done,
however. The `DWC' string will have
to be the last item in a 27 byte
structure of which the first two
items are ArcStrucSize=27 (word size:
2 bytes) and DirStrucSize=34 (byte
size) before GUS will accept the file
to be a DWC archive.
LZH 2 '-l??-' The '?' specifies a wildcard
character.
HAP 0 0x91 '3HF'
HPK 0 'HPAK'
UC2 0 'UC2' 0x1A
ZIP 0 'PK' 0x03 0x04
ZOO 20 0xDC 0xA7 0xC4 0xFD
Most other programs search for the
string 'ZOO' at the front of the
archive, but that is wrong! Only
the ZOO archives made using Rahul
Dhesi's program would be recognized
this way. ZOO archives made by an
Amiga or a computer running Unix
would not necessarily be recognized
this way.
SQZ 0 'HLSQZ'
RAR 0 'Rar!' 0x1A 0x07 0x00
6.2. Record layout of ARC/ARC+/PAK
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The record which describes each archive entry:
.=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----.
|var |
| ArcHeader : record |
| Marker: Byte; |
| Method: Byte; |
| Name : array [1..13] of char; |
| Size : DWord; |
| Stamp : DWord; |
| CRC : Word; |
| Length: DWord; |
| end; |
`=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----'
Procedure to scan all archive entries:
.=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----.
|begin |
| seek(F, 0); |
| Done := false; |
| YieldARC := ARC; |
| repeat |
| {$I-} |
| blockread(F, ArcHeader, sizeof(ArcHeader)); |
| {$I+} |
| if IOresult = 0 |
| then begin |
| if ArcHeader.Method >= PAKid |
| then begin |
| Done := true; |
| YieldARC := PAK; |
| if ArcHeader.Method >= HYPid |
| then YieldARC := HYP |
| else if ArcHeader.Method = ARPid|
| then YieldARC := ARp |
| end |
| else MoveFilePtr(F, ArcHeader.Size); |
| end |
| else Done := true |
| until Done |
|end; |
`=--=--=--=--=--=--=--=--=--=--=--=--=--=--=--=----'
This is of course all in Turbo Pascal, the language in which GUS was
written. The above are in fact literal excerpts from GUS's source
code.
6.3. How GUS identifies SFX (self-extracting) archives
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The basic principle is simple. A self-extracting archive consists of an
extraction program in EXE form followed by the archive itself as
appended data.
The header of an EXE file contains information to determine the size of
the EXE portion of the file and hence the offset where the appended
ata
starts.
This proved to be true for all archive types, except for SFXs made by
MKSARC, the ZIP/sfx as used in PKLTE115.EXE and the ZIP/sfx for OS/2.
GUS has those offset values hardcoded.
Should you encounter other self-extracting archive types which GUS
doesn't recognize, please let me know. Don't forget to mention,
however, by which program those self-extractors were made.
6.4. Mandatory order for scanning recognition patterns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 - RAR
2 - SQZ
3 - ZIP
4 - HPK
5 - UC2
6 - HAP
7 - ZOO
8 - LZH
9 - HA
10 - ARJ
11 - DWC
12 - ARC/ARC+/PAK/HYP
This order is mandatory because it guarantees the greatest chance
for a correct recognition.
Every other order would increase the chance for a faulty result.
This is also the reason why the archive specifications are still
built into GUS and not given in a seperate configuration file
(like the one used by Jeffrey Nonken's PolyXarc, for example):
I still haven't found a good method to have GUS determine auto-
matically in which order the patterns have to be scanned, if a
possibility exists that new patterns would be added to the list.
I can't expect the users to include new patterns in the proper
order themselves, can I? Therefore, I don't think providing GUS
with a CFG file is very important at this time. I see no problem
for providing a new GUS when a new and exciting archiver is
released.
That's it folks! If you're curious: the Borland Pascal source for
GUS is about 1100 lines in length. Those lines are `filled' in the
same way as those of the procedure quoted above.
*** NOTE: you may use the scanning and identification method as
used by GUS and as described above in your own programs, but
please be so kind and don't forget the reference indicating where
you got the information!
=== Cut ===
That was from GUS 1.95. Since then, I've made a v1.96 which can handle the
JAR type as well.
._|~/_
johanzw@club.innet.be
http://www.club.innet.be/~year1355/
--- Maximus-CBCS v3.01
(2:292/100)
---------------
* Origin: Tripod BBS Belgium - bortaS bIr jablu'DI'reH QaQqu' nay'
|