TIP: Click on subject to list as thread! ANSI
echo: public_domain
to: Paul Edwards
from: Frank Malcolm
date: 1996-01-05 07:13:44
subject: movsb

Hi, Paul.

PE> FM>           have taken the *shortest* time for each test. BTW, where's
PE> FM>           the bug in my timing calculation? I discarded 4 results
PE> FM>           (out of 54) because they were ridiculous - either huge or
PE> FM>           negative.) Each test iterates the relevant code 5 million

PE> I don't know about pascal, but in a C implementation, a calculation
PE> is kept at the same data size until it needs to be promoted.  You
PE> have gone M2 * 6000, with M2 being a word.  In C, that would remain
PE> an "int" calculation, which would cause a number bigger than 32767
PE> or 65535 to be truncated et al.  In C, I would go either M2 * 6000L
PE> or (long)M2 * 6000, to force a long multiplication.

Of course. I must have been asleep when I wrote that. I need to use
longint (6000) etc.

PE> FM> Test Time  Comments
PE> FM>   1   1.37 empty
PE> FM>   2  42.40 SCASB/MOVSB
PE> FM>   3  39.38 SCASB/MOVSD with fixup of last 1..3 bytes
PE> FM>   4  38.61 SCASB/MOVSD unconditionally moving up to 4 extra bytes
PE> FM>   5  45.53 MOV AL loop with $0a in a register
PE> FM>   6  37.02 MOV AL loop with $0a in memory
PE> FM>   7  44.98 MOV AL loop with slightly better initialisation
PE> FM>   8  62.23 LODSB/STOSB loop
PE> FM>   9  34.99 LODSB/STOSD loop (may move up to 3 extra bytes)
PE> FM>  10  30.32 MOV EAX loop (also may move up to 3 extra bytes)

PE> FM> Analysis:
PE> FM> Well! Just drop all your preconceived ideas! For example, Test 6 being
PE> FM> faster than Test 5 is amazing. And therefore I should have run Test 7
PE> FM> again with the $0a in memory.

PE> That isn't a memory operand, that is an immediate operand, ie the
PE> number is part of the instruction, ie the data is already available,
PE> as opposed to having to fetch it out of a register.

I meant an immediate operand. But it's still in memory rather than being
in a register - that byte has to be fetched as part of the instruction
and therefore I expected it to be slower than getting it from a
register. I guess with the cache it doesn't matter, it's not accessing
physical RAM.

PE> FM> If you don't mind moving a few extra bytes (past the $0a), it's worth
PE> FM> doing so unconditionally rather than testing to see if you have to
PE> (Test
PE> FM> 4 vs Test 3).

PE> For a 2% improvement, I wouldn't say "worth it".

Well, if you're going for absolute raw speed it is.

PE> FM> In your application I should think that's perfectly OK.

PE> Actually, it's perfectly unOK.  You see, I can protect against
PE> reading too far by using a sentinel, and I can even protect against
PE> the double-word read, by making sure the INPUT buffer is always 3,
PE> 4, 5, 7, whatever bytes extra in length.  But what I have no control
PE> over is the size of the OUTPUT buffer.  I cannot write more bytes
PE> than the user has given me permission to do.  Failure to observe
PE> this will cause either a trap, or more likely, the first (up to)
PE> 3 bytes of the next variable to be clobbered.

OK, if that's the circumstance. But tell me a couple of things. How are
you putting your \n sentinel into the input buffer? Isn't *that* going
to clobber stuff? I'm assuming that you're processing a text file, with
lines terminated by \n. If you've got an input buffer with some of these
lines in, where are you putting the sentinel without clobbering a char
in the next line? Maybe my assumption's wrong.

Secondly, it sounds like you're saying that the user of your function
will know exactly how long the input is, and potentially allocate a
buffer of just the right size. But you don't know, otherwise you
wouldn't have to scan for \n. If he knows, why doesn't he just move that
many bytes rather than doing any scan? Where am I missing the point
here?

PE> FM> Double word moves are much better than byte moves (3 vs 2, 9 vs 8,
PE> FM> 10 vs 7).

PE> Well that certainly seems true enough.  Let me code that in C and
PE> post the assembler...

PE> First of all, I couldn't get any of my C compilers to generate
PE> exactly what you had written, but GCC came closest...

PE> #include 

PE> int main(void)
PE> {
PE>     char *t;
PE>     char *u;
PE>     register unsigned int z;
PE>     register unsigned int *x;
PE>     register unsigned int *y;

PE>     x = (unsigned int *)t;
PE>     y = (unsigned int *)u;
PE>     while (1)
PE>     {
PE>         z = *x++ = *y++;
PE>         if ((z & 0xffU) == '\n') break;
PE>         if ((z & 0xff00U) == 0x0a00U) break;
PE>         z = z >> 16;
PE>         if ((z & 0xffU) == '\n') break;
PE>         if ((z & 0xff00U) == 0x0a00U) break;
PE>     }
PE>     return (0);
PE> }

PE>  000a  89 f1                             mov     ecx,esi
PE>  000c  8b 11             L1              mov     edx,[ecx]
PE>  000e  89 13                             mov     [ebx],edx
PE>  0010  83 c1 04                          add     ecx,00000004H
PE>  0013  83 c3 04                          add     ebx,00000004H
PE>  0016  80 fa 0a                          cmp     dl,0aH
PE>  0019  74 23                             je      L2
PE>  001b  89 d0                             mov     eax,edx
PE>  001d  25 00 ff 00 00                    and     eax,0000ff00H
PE>  0022  3d 00 0a 00 00                    cmp     eax,00000a00H
PE>  0027  74 15                             je      L2
PE>  0029  89 d0                             mov     eax,edx
PE>  002b  c1 e8 10                          shr     eax,10H
PE>  002e  3c 0a                             cmp     al,0aH
PE>  0030  74 0c                             je      L2
PE>  0032  25 00 ff 00 00                    and     eax,0000ff00H
PE>  0037  3d 00 0a 00 00                    cmp     eax,00000a00H
PE>  003c  75 ce                             jne     L1
PE>  003e  31 c0             L2              xor     eax,eax


PE> I then recoded it as this...

PE> #include 

PE> int main(void)
PE> {
PE>     char *t;
PE>     char *u;
PE>     register unsigned int z;
PE>     register unsigned int *x;
PE>     register unsigned int *y;

PE>     x = (unsigned int *)t;
PE>     y = (unsigned int *)u;
PE>     while (1)
PE>     {
PE>         z = *x++ = *y++;
PE>         if ((z & 0xffU) == '\n') break;
PE>         z = z >> 8;
PE>         if ((z & 0xffU) == '\n') break;
PE>         z = z >> 8;
PE>         if ((z & 0xffU) == '\n') break;
PE>         z = z >> 8;
PE>         if ((z & 0xffU) == '\n') break;
PE>     }
PE>     return (0);
PE> }

PE> And from gcc got...

PE>  000a  89 f1                             mov     ecx,esi
PE>  000c  8b 01             L1              mov     eax,[ecx]
PE>  000e  89 03                             mov     [ebx],eax
PE>  0010  83 c1 04                          add     ecx,00000004H
PE>  0013  83 c3 04                          add     ebx,00000004H
PE>  0016  3c 0a                             cmp     al,0aH
PE>  0018  74 1c                             je      L2
PE>  001a  89 c2                             mov     edx,eax
PE>  001c  c1 ea 08                          shr     edx,08H
PE>  001f  80 fa 0a                          cmp     dl,0aH
PE>  0022  74 12                             je      L2
PE>  0024  89 c2                             mov     edx,eax
PE>  0026  c1 ea 10                          shr     edx,10H
PE>  0029  80 fa 0a                          cmp     dl,0aH
PE>  002c  74 08                             je      L2
PE>  002e  c1 e8 18                          shr     eax,18H
PE>  0031  83 f8 0a                          cmp     eax,0000000aH
PE>  0034  75 d6                             jne     L1
PE>  0036  31 c0             L2              xor     eax,eax

PE> Which is actually no worse, no better.  I tried a few more
PE> styles, but I was unable to eliminate the unnecessary
PE> reloading of the register all the time.  So it is an extra
PE> 4 instructions to every loop.

And there are a couple of unnecessary SHRs in there too, as it hasn't
worked out that it can look at the byte registers to get that char from
the dword.

PE> It was interesting to see some of the C compilers rearranging the
PE> code order, presumably to give the 80386 et al more chance of
PE> processing the instructions in advance.  It threw me, actually!

Yes, me too, as I mentioned when you posted that code some months ago
comparing 3 compilers, some of the optimisations ended up with code
miles away, presumably to take absolute maximum advantage of the
pipeline.

PE> e.g. with Watcom...

PE> #include 

PE> int main(void)
PE> {
PE>     char *t;
PE>     char *u;
PE>     register unsigned int z;
PE>     register unsigned int *x;
PE>     register unsigned int *y;

PE>     x = (unsigned int *)t;
PE>     y = (unsigned int *)u;
PE>     while (1)
PE>     {
PE>         z = *x++ = *y++;
PE>         if ((z & 0xffU) == 0x0aU) break;
PE>         if (((z >>= 8) & 0xffU) == 0x0aU) break;
PE>         if (((z >>= 8) & 0xffU) == 0x0aU) break;
PE>         if (((z >>= 8) & 0xffU) == 0x0aU) break;
PE>     }
PE>     return (0);
PE> }

PE>  0004  bf ff 00 00 00                    mov     edi,000000ffH
PE>  0009  8b 03             L1              mov     eax,[ebx]
PE>  000b  83 c2 04                          add     edx,00000004H
PE>  000e  89 c1                             mov     ecx,eax
PE>  0010  83 c3 04                          add     ebx,00000004H
PE>  0013  21 f9                             and     ecx,edi
PE>  0015  89 42 fc                          mov     -4H[edx],eax
PE>  0018  83 f9 0a                          cmp     ecx,0000000aH
PE>  001b  74 22                             je      L2
PE>  001d  c1 e8 08                          shr     eax,08H
PE>  0020  89 c1                             mov     ecx,eax
PE>  0022  21 f9                             and     ecx,edi
PE>  0024  83 f9 0a                          cmp     ecx,0000000aH
PE>  0027  74 16                             je      L2
PE>  0029  c1 e8 08                          shr     eax,08H
PE>  002c  89 c1                             mov     ecx,eax
PE>  002e  21 f9                             and     ecx,edi
PE>  0030  83 f9 0a                          cmp     ecx,0000000aH
PE>  0033  74 0a                             je      L2
PE>  0035  c1 e8 08                          shr     eax,08H
PE>  0038  21 f8                             and     eax,edi
PE>  003a  83 f8 0a                          cmp     eax,0000000aH
PE>  003d  75 ca                             jne     L1
PE>  003f  31 c0             L2              xor     eax,eax

Er, what rearrangement of code order? That looks like a pretty
straightforward implementation of your code to me.

PE> FM> To my surprise the SCASB/MOVSB (or MOVSD) single instructions are *not*
PE> FM> faster than a coded loop, even though you've got to do those extra
PE> FM> register INCs in the loop. This is also the conclusion you came to.

PE> Yeah, I couldn't see any mistake in my arithmetic (adding mostly
PE> 1's together!).

I wasn't questioning your arithmetic, just that (at least my) timing
reference comes with several conditions about when those timings will
be true - like the instruction has already been prefetched and decoded.
I thought the advantage would have been with the string instructions.

PE> FM> Test 6 is basically what your compiler generated. It's good, but can be
PE> FM> improved with hand-crafted asm. I didn't use CX & DX because this
PE> FM> compiler doesn't support that addressing mode.

PE> Hmmm, can be improved as above by changing the C code too, I
PE> would expect.

Yes, by changing it from a char to an unsigned int, as in the above.
(BTW, is an int always 32 bits? What do you call 16 bit ints and 8 bit
ints? In TP you have longint, integer and shortint for 32, 16, 8
respectively.)

PE> FM> Test 10 can be improved further, given the results of 6 vs 5 -
PE> FM> don't put $0a in BL or 4 in DX, make them memory operands.

PE> True.

PE> FM> Tests 9 & 10 are I suppose a variant of the optimisation technique
PE> FM> called loop unrolling. As you can see, it's very effective.

PE> Yeah.  One thing though, I would expect that you could keep the
PE> same tight code, but make it NOT copy the extra bytes.  It is
PE> the first time I've actually seen any advantage in little-endian
PE> format, but you conveniently have the data just where you want it,
PE> as so...

PE> l2 mov 2[di],ax
PE>    cmp al,bl
PE>    je {at}l4
PE>    cmp ah,bl
PE>    je {at}l5
PE>    mov [di],ax
PE>    shr eax,16
PE>    cmp al,bl
PE>    je {at}l4
PE>    cmp ah,bl
PE>    jne {at}l2
PE> l4 mov [di],ah
PE>    jmp {at}l7
PE> l5 mov [di],ax
PE>    jmp {at}l7
PE> l6 mov [di],ah

Er, I think you're still potentially moving one extra byte, in line 1
and line 6. Plus, you're doing word moves instead of dwords. So
presumably the timing would be somewhere between my test 7 and test 10.
Or I'm misreading the above code 'cos it's not complete.

PE> Ok, so there's one extra instruction used - so sue me!  Actually,
PE> you could even get that extra instruction out of the loop if you
PE> wanted to, and move it to the termination cleanup.  That would
PE> be best.

You certainly can get the advantage of dword moves without the problem
of overrunning your output buffer, but it takes a little bit more work.
I didn't do so as I didn't think it was a problem. However, you've now
said it is. Do you want to see code, and timing?

PE> FM> The code follows my signoff. It's not particularly elegant but I think
PE> FM> it demonstrates what you want to know.

PE> BTW, this wouldn't compile with Turbo Pascal 5.5.  It complained
PE> about the asm instruction.

No, asm was introduced with TP6. Before that the only thing you could do
was create so-called 'inline' functions, and you had to specify the code
as hex values!

Regards, FIM.

 * * Dammit, where'd I leave that tagline?
@EOT:

---
* Origin: Pedants Inc. (3:711/934.24)
SEEN-BY: 690/718 711/809 934

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.