FM> have taken the *shortest* time for each test. BTW, where's
FM> the bug in my timing calculation? I discarded 4 results
FM> (out of 54) because they were ridiculous - either huge or
FM> negative.) Each test iterates the relevant code 5 million
I don't know about pascal, but in a C implementation, a calculation
is kept at the same data size until it needs to be promoted. You
have gone M2 * 6000, with M2 being a word. In C, that would remain
an "int" calculation, which would cause a number bigger than 32767
or 65535 to be truncated et al. In C, I would go either M2 * 6000L
or (long)M2 * 6000, to force a long multiplication.
FM> Test Time Comments
FM> 1 1.37 empty
FM> 2 42.40 SCASB/MOVSB
FM> 3 39.38 SCASB/MOVSD with fixup of last 1..3 bytes
FM> 4 38.61 SCASB/MOVSD unconditionally moving up to 4 extra bytes
FM> 5 45.53 MOV AL loop with $0a in a register
FM> 6 37.02 MOV AL loop with $0a in memory
FM> 7 44.98 MOV AL loop with slightly better initialisation
FM> 8 62.23 LODSB/STOSB loop
FM> 9 34.99 LODSB/STOSD loop (may move up to 3 extra bytes)
FM> 10 30.32 MOV EAX loop (also may move up to 3 extra bytes)
FM> Analysis:
FM> Well! Just drop all your preconceived ideas! For example, Test 6 being
FM> faster than Test 5 is amazing. And therefore I should have run Test 7
FM> again with the $0a in memory.
That isn't a memory operand, that is an immediate operand, ie the
number is part of the instruction, ie the data is already available,
as opposed to having to fetch it out of a register.
FM> If you don't mind moving a few extra bytes (past the $0a), it's worth
FM> doing so unconditionally rather than testing to see if you have to (Test
FM> 4 vs Test 3).
For a 2% improvement, I wouldn't say "worth it".
FM> In your application I should think that's perfectly OK.
Actually, it's perfectly unOK. You see, I can protect against
reading too far by using a sentinel, and I can even protect against
the double-word read, by making sure the INPUT buffer is always 3,
4, 5, 7, whatever bytes extra in length. But what I have no control
over is the size of the OUTPUT buffer. I cannot write more bytes
than the user has given me permission to do. Failure to observe
this will cause either a trap, or more likely, the first (up to)
3 bytes of the next variable to be clobbered.
FM> Double word moves are much better than byte moves (3 vs 2, 9 vs 8,
FM> 10 vs 7).
Well that certainly seems true enough. Let me code that in C and
post the assembler...
First of all, I couldn't get any of my C compilers to generate
exactly what you had written, but GCC came closest...
#include
int main(void)
{
char *t;
char *u;
register unsigned int z;
register unsigned int *x;
register unsigned int *y;
x = (unsigned int *)t;
y = (unsigned int *)u;
while (1)
{
z = *x++ = *y++;
if ((z & 0xffU) == '\n') break;
if ((z & 0xff00U) == 0x0a00U) break;
z = z >> 16;
if ((z & 0xffU) == '\n') break;
if ((z & 0xff00U) == 0x0a00U) break;
}
return (0);
}
000a 89 f1 mov ecx,esi
000c 8b 11 L1 mov edx,[ecx]
000e 89 13 mov [ebx],edx
0010 83 c1 04 add ecx,00000004H
0013 83 c3 04 add ebx,00000004H
0016 80 fa 0a cmp dl,0aH
0019 74 23 je L2
001b 89 d0 mov eax,edx
001d 25 00 ff 00 00 and eax,0000ff00H
0022 3d 00 0a 00 00 cmp eax,00000a00H
0027 74 15 je L2
0029 89 d0 mov eax,edx
002b c1 e8 10 shr eax,10H
002e 3c 0a cmp al,0aH
0030 74 0c je L2
0032 25 00 ff 00 00 and eax,0000ff00H
0037 3d 00 0a 00 00 cmp eax,00000a00H
003c 75 ce jne L1
003e 31 c0 L2 xor eax,eax
I then recoded it as this...
#include
int main(void)
{
char *t;
char *u;
register unsigned int z;
register unsigned int *x;
register unsigned int *y;
x = (unsigned int *)t;
y = (unsigned int *)u;
while (1)
{
z = *x++ = *y++;
if ((z & 0xffU) == '\n') break;
z = z >> 8;
if ((z & 0xffU) == '\n') break;
z = z >> 8;
if ((z & 0xffU) == '\n') break;
z = z >> 8;
if ((z & 0xffU) == '\n') break;
}
return (0);
}
And from gcc got...
000a 89 f1 mov ecx,esi
000c 8b 01 L1 mov eax,[ecx]
000e 89 03 mov [ebx],eax
0010 83 c1 04 add ecx,00000004H
0013 83 c3 04 add ebx,00000004H
0016 3c 0a cmp al,0aH
0018 74 1c je L2
001a 89 c2 mov edx,eax
001c c1 ea 08 shr edx,08H
001f 80 fa 0a cmp dl,0aH
0022 74 12 je L2
0024 89 c2 mov edx,eax
0026 c1 ea 10 shr edx,10H
0029 80 fa 0a cmp dl,0aH
002c 74 08 je L2
002e c1 e8 18 shr eax,18H
0031 83 f8 0a cmp eax,0000000aH
0034 75 d6 jne L1
0036 31 c0 L2 xor eax,eax
Which is actually no worse, no better. I tried a few more
styles, but I was unable to eliminate the unnecessary
reloading of the register all the time. So it is an extra
4 instructions to every loop.
It was interesting to see some of the C compilers rearranging the
code order, presumably to give the 80386 et al more chance of
processing the instructions in advance. It threw me, actually!
e.g. with Watcom...
#include
int main(void)
{
char *t;
char *u;
register unsigned int z;
register unsigned int *x;
register unsigned int *y;
x = (unsigned int *)t;
y = (unsigned int *)u;
while (1)
{
z = *x++ = *y++;
if ((z & 0xffU) == 0x0aU) break;
if (((z >>= 8) & 0xffU) == 0x0aU) break;
if (((z >>= 8) & 0xffU) == 0x0aU) break;
if (((z >>= 8) & 0xffU) == 0x0aU) break;
}
return (0);
}
0004 bf ff 00 00 00 mov edi,000000ffH
0009 8b 03 L1 mov eax,[ebx]
000b 83 c2 04 add edx,00000004H
000e 89 c1 mov ecx,eax
0010 83 c3 04 add ebx,00000004H
0013 21 f9 and ecx,edi
0015 89 42 fc mov -4H[edx],eax
0018 83 f9 0a cmp ecx,0000000aH
001b 74 22 je L2
001d c1 e8 08 shr eax,08H
0020 89 c1 mov ecx,eax
0022 21 f9 and ecx,edi
0024 83 f9 0a cmp ecx,0000000aH
0027 74 16 je L2
0029 c1 e8 08 shr eax,08H
002c 89 c1 mov ecx,eax
002e 21 f9 and ecx,edi
0030 83 f9 0a cmp ecx,0000000aH
0033 74 0a je L2
0035 c1 e8 08 shr eax,08H
0038 21 f8 and eax,edi
003a 83 f8 0a cmp eax,0000000aH
003d 75 ca jne L1
003f 31 c0 L2 xor eax,eax
FM> To my surprise the SCASB/MOVSB (or MOVSD) single instructions are *not*
FM> faster than a coded loop, even though you've got to do those extra
FM> register INCs in the loop. This is also the conclusion you came to.
Yeah, I couldn't see any mistake in my arithmetic (adding mostly
1's together!).
FM> Test 6 is basically what your compiler generated. It's good, but can be
FM> improved with hand-crafted asm. I didn't use CX & DX because this
FM> compiler doesn't support that addressing mode.
Hmmm, can be improved as above by changing the C code too, I
would expect.
FM> Test 10 can be improved further, given the results of 6 vs 5 -
FM> don't put $0a in BL or 4 in DX, make them memory operands.
True.
FM> Tests 9 & 10 are I suppose a variant of the optimisation technique
FM> called loop unrolling. As you can see, it's very effective.
Yeah. One thing though, I would expect that you could keep the
same tight code, but make it NOT copy the extra bytes. It is
the first time I've actually seen any advantage in little-endian
format, but you conveniently have the data just where you want it,
as so...
l2 mov 2[di],ax
cmp al,bl
je {at}l4
cmp ah,bl
je {at}l5
mov [di],ax
shr eax,16
cmp al,bl
je {at}l4
cmp ah,bl
jne {at}l2
l4 mov [di],ah
jmp {at}l7
l5 mov [di],ax
jmp {at}l7
l6 mov [di],ah
Ok, so there's one extra instruction used - so sue me! Actually,
you could even get that extra instruction out of the loop if you
wanted to, and move it to the termination cleanup. That would
be best.
FM> The code follows my signoff. It's not particularly elegant but I think
FM> it demonstrates what you want to know.
BTW, this wouldn't compile with Turbo Pascal 5.5. It complained
about the asm instruction. BFN. Paul.
@EOT:
---
* Origin: X (3:711/934.9)
|