TIP: Click on subject to list as thread! ANSI
echo: public_domain
to: Frank Malcolm
from: Paul Edwards
date: 1996-01-03 21:41:06
subject: movsb

FM>           have taken the *shortest* time for each test. BTW, where's
FM>           the bug in my timing calculation? I discarded 4 results
FM>           (out of 54) because they were ridiculous - either huge or
FM>           negative.) Each test iterates the relevant code 5 million

I don't know about pascal, but in a C implementation, a calculation
is kept at the same data size until it needs to be promoted.  You
have gone M2 * 6000, with M2 being a word.  In C, that would remain
an "int" calculation, which would cause a number bigger than 32767
or 65535 to be truncated et al.  In C, I would go either M2 * 6000L
or (long)M2 * 6000, to force a long multiplication.

FM> Test Time  Comments
FM>   1   1.37 empty
FM>   2  42.40 SCASB/MOVSB
FM>   3  39.38 SCASB/MOVSD with fixup of last 1..3 bytes
FM>   4  38.61 SCASB/MOVSD unconditionally moving up to 4 extra bytes
FM>   5  45.53 MOV AL loop with $0a in a register
FM>   6  37.02 MOV AL loop with $0a in memory
FM>   7  44.98 MOV AL loop with slightly better initialisation
FM>   8  62.23 LODSB/STOSB loop
FM>   9  34.99 LODSB/STOSD loop (may move up to 3 extra bytes)
FM>  10  30.32 MOV EAX loop (also may move up to 3 extra bytes)

FM> Analysis:
FM> Well! Just drop all your preconceived ideas! For example, Test 6 being
FM> faster than Test 5 is amazing. And therefore I should have run Test 7
FM> again with the $0a in memory.

That isn't a memory operand, that is an immediate operand, ie the
number is part of the instruction, ie the data is already available,
as opposed to having to fetch it out of a register.

FM> If you don't mind moving a few extra bytes (past the $0a), it's worth
FM> doing so unconditionally rather than testing to see if you have to (Test
FM> 4 vs Test 3). 

For a 2% improvement, I wouldn't say "worth it".

FM> In your application I should think that's perfectly OK.

Actually, it's perfectly unOK.  You see, I can protect against
reading too far by using a sentinel, and I can even protect against
the double-word read, by making sure the INPUT buffer is always 3,
4, 5, 7, whatever bytes extra in length.  But what I have no control
over is the size of the OUTPUT buffer.  I cannot write more bytes
than the user has given me permission to do.  Failure to observe
this will cause either a trap, or more likely, the first (up to)
3 bytes of the next variable to be clobbered.

FM> Double word moves are much better than byte moves (3 vs 2, 9 vs 8,
FM> 10 vs 7).

Well that certainly seems true enough.  Let me code that in C and
post the assembler...

First of all, I couldn't get any of my C compilers to generate
exactly what you had written, but GCC came closest...

#include 

int main(void)
{
    char *t;
    char *u;
    register unsigned int z;
    register unsigned int *x;
    register unsigned int *y;
    
    x = (unsigned int *)t;
    y = (unsigned int *)u;
    while (1)
    {
        z = *x++ = *y++;
        if ((z & 0xffU) == '\n') break;
        if ((z & 0xff00U) == 0x0a00U) break;
        z = z >> 16;
        if ((z & 0xffU) == '\n') break;
        if ((z & 0xff00U) == 0x0a00U) break;
    }    
    return (0);
}

 000a  89 f1                             mov     ecx,esi
 000c  8b 11             L1              mov     edx,[ecx]
 000e  89 13                             mov     [ebx],edx
 0010  83 c1 04                          add     ecx,00000004H
 0013  83 c3 04                          add     ebx,00000004H
 0016  80 fa 0a                          cmp     dl,0aH
 0019  74 23                             je      L2
 001b  89 d0                             mov     eax,edx
 001d  25 00 ff 00 00                    and     eax,0000ff00H
 0022  3d 00 0a 00 00                    cmp     eax,00000a00H
 0027  74 15                             je      L2
 0029  89 d0                             mov     eax,edx
 002b  c1 e8 10                          shr     eax,10H
 002e  3c 0a                             cmp     al,0aH
 0030  74 0c                             je      L2
 0032  25 00 ff 00 00                    and     eax,0000ff00H
 0037  3d 00 0a 00 00                    cmp     eax,00000a00H
 003c  75 ce                             jne     L1
 003e  31 c0             L2              xor     eax,eax


I then recoded it as this...

#include 

int main(void)
{
    char *t;
    char *u;
    register unsigned int z;
    register unsigned int *x;
    register unsigned int *y;
    
    x = (unsigned int *)t;
    y = (unsigned int *)u;
    while (1)
    {
        z = *x++ = *y++;
        if ((z & 0xffU) == '\n') break;
        z = z >> 8;
        if ((z & 0xffU) == '\n') break;
        z = z >> 8;
        if ((z & 0xffU) == '\n') break;
        z = z >> 8;
        if ((z & 0xffU) == '\n') break;
    }    
    return (0);
}

And from gcc got...

 000a  89 f1                             mov     ecx,esi
 000c  8b 01             L1              mov     eax,[ecx]
 000e  89 03                             mov     [ebx],eax
 0010  83 c1 04                          add     ecx,00000004H
 0013  83 c3 04                          add     ebx,00000004H
 0016  3c 0a                             cmp     al,0aH
 0018  74 1c                             je      L2
 001a  89 c2                             mov     edx,eax
 001c  c1 ea 08                          shr     edx,08H
 001f  80 fa 0a                          cmp     dl,0aH
 0022  74 12                             je      L2
 0024  89 c2                             mov     edx,eax
 0026  c1 ea 10                          shr     edx,10H
 0029  80 fa 0a                          cmp     dl,0aH
 002c  74 08                             je      L2
 002e  c1 e8 18                          shr     eax,18H
 0031  83 f8 0a                          cmp     eax,0000000aH
 0034  75 d6                             jne     L1
 0036  31 c0             L2              xor     eax,eax


Which is actually no worse, no better.  I tried a few more
styles, but I was unable to eliminate the unnecessary
reloading of the register all the time.  So it is an extra
4 instructions to every loop.

It was interesting to see some of the C compilers rearranging the
code order, presumably to give the 80386 et al more chance of
processing the instructions in advance.  It threw me, actually!

e.g. with Watcom...

#include 

int main(void)
{
    char *t;
    char *u;
    register unsigned int z;
    register unsigned int *x;
    register unsigned int *y;
    
    x = (unsigned int *)t;
    y = (unsigned int *)u;
    while (1)
    {
        z = *x++ = *y++;
        if ((z & 0xffU) == 0x0aU) break;
        if (((z >>= 8) & 0xffU) == 0x0aU) break;
        if (((z >>= 8) & 0xffU) == 0x0aU) break;
        if (((z >>= 8) & 0xffU) == 0x0aU) break;
    }    
    return (0);
}


 0004  bf ff 00 00 00                    mov     edi,000000ffH
 0009  8b 03             L1              mov     eax,[ebx]
 000b  83 c2 04                          add     edx,00000004H
 000e  89 c1                             mov     ecx,eax
 0010  83 c3 04                          add     ebx,00000004H
 0013  21 f9                             and     ecx,edi
 0015  89 42 fc                          mov     -4H[edx],eax
 0018  83 f9 0a                          cmp     ecx,0000000aH
 001b  74 22                             je      L2
 001d  c1 e8 08                          shr     eax,08H
 0020  89 c1                             mov     ecx,eax
 0022  21 f9                             and     ecx,edi
 0024  83 f9 0a                          cmp     ecx,0000000aH
 0027  74 16                             je      L2
 0029  c1 e8 08                          shr     eax,08H
 002c  89 c1                             mov     ecx,eax
 002e  21 f9                             and     ecx,edi
 0030  83 f9 0a                          cmp     ecx,0000000aH
 0033  74 0a                             je      L2
 0035  c1 e8 08                          shr     eax,08H
 0038  21 f8                             and     eax,edi
 003a  83 f8 0a                          cmp     eax,0000000aH
 003d  75 ca                             jne     L1
 003f  31 c0             L2              xor     eax,eax

FM> To my surprise the SCASB/MOVSB (or MOVSD) single instructions are *not*
FM> faster than a coded loop, even though you've got to do those extra
FM> register INCs in the loop. This is also the conclusion you came to.

Yeah, I couldn't see any mistake in my arithmetic (adding mostly
1's together!).

FM> Test 6 is basically what your compiler generated. It's good, but can be
FM> improved with hand-crafted asm. I didn't use CX & DX because this
FM> compiler doesn't support that addressing mode.

Hmmm, can be improved as above by changing the C code too, I
would expect.

FM> Test 10 can be improved further, given the results of 6 vs 5 -
FM> don't put $0a in BL or 4 in DX, make them memory operands. 

True.

FM> Tests 9 & 10 are I suppose a variant of the optimisation technique
FM> called loop unrolling. As you can see, it's very effective.

Yeah.  One thing though, I would expect that you could keep the
same tight code, but make it NOT copy the extra bytes.  It is
the first time I've actually seen any advantage in little-endian
format, but you conveniently have the data just where you want it,
as so...

l2 mov 2[di],ax
   cmp al,bl
   je {at}l4
   cmp ah,bl
   je {at}l5
   mov [di],ax
   shr eax,16
   cmp al,bl
   je {at}l4
   cmp ah,bl
   jne {at}l2
l4 mov [di],ah
   jmp {at}l7
l5 mov [di],ax
   jmp {at}l7
l6 mov [di],ah

Ok, so there's one extra instruction used - so sue me!  Actually,
you could even get that extra instruction out of the loop if you
wanted to, and move it to the termination cleanup.  That would
be best.

FM> The code follows my signoff. It's not particularly elegant but I think
FM> it demonstrates what you want to know.

BTW, this wouldn't compile with Turbo Pascal 5.5.  It complained
about the asm instruction.  BFN.  Paul.
@EOT:

---
* Origin: X (3:711/934.9)

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.