TIP: Click on subject to list as thread! ANSI
echo: public_domain
to: Paul Edwards
from: Frank Malcolm
date: 1996-01-02 08:53:28
subject: movsb

Hi, Paul.

PE> Thanks for the info.  So, ignoring the setup stuff, we have:

PE> SCASB/MOVSD = 5+3/4n = 5.75n

I think so.

PE> SCASB/MOVSB = 8n

Yes.

PE> assembler loop = 14n

Yes.

PE> C loop = 8n?

See below.

PE> FM> Basically, yes; there's no combination MOVSB/SCASB instruction. I was
PE> FM> about to write that "anyway, it'd be much faster doing it with the
PE> FM> single instructions", but it looks like it would be
close. REPE SCASB
PE> is
PE> FM> 7 + 5n clocks on a 486, REP MOVSB is 12 + 3n. (I think - the timings

PE> It is interesting that SCASB only needs to do a memory read, whilst
PE> MOVSB is a read + write, yet the former is faster!

I think you mean the latter, and yes, it is interesting. Remember that
the SCASB also has to do a compare to AL, set the flag and potentially
quit the instruction.

BUT. For a start there at least 6 footnotes to the timing info in my
reference for those instructions. ALSO, there's a whole lot of
assumptions mentioned at the beginning of the chapter.

IOW, unless you are intimately acquainted with the hardware (which I'm
certainly not) you can't really say what's going to happen in this sort
of pipelined, cached processor. I'd call the nominated timings a good
first guess, nothing better. You'd be best off timing a few hundred
million typical strings.

PE> FM> Then for a loop, say

PE> FM> L1: LODSB
PE> FM>     STOSB
PE> FM>     CMP AH ;where you've put $0a
PE> FM>     JNE L1

PE>  002c  8a 01             L2              mov     al,[ecx]
PE>  002e  88 02                             mov     [edx],al
PE>  0030  41                                inc     ecx
PE>  0031  42                                inc     edx
PE>  0032  3c 0a                             cmp     al,0aH
PE>  0034  75 f6                             jne     L2

PE> FM> it's 5+5+1+3 = 14n for the loop, once again ignoring setting up the
PE> FM> registers.

PE> Did I add up the numbers wrong or something?  The C loop looks
PE> like 8n to me.  Timings on a 486, since I determined that that
PE> was what figures you were using.  BFN.  Paul.

Yes, I was. Your compiler has generated the

mov     al,[ecx]
mov     [edx],al

pair, followed by INCs of the 2 registers. That's a logical thing for it
to do given the general case of what could be in the while loop. I'd
suggest that the MOVSB would be faster, but that relies on the compiler
picking that specific case of how you're using the WHILE loop.

I don't think you'd find, in practice, that it really was x+8n.

If speed is really critical I'd suggest timing a few hundred million
examples of 3 cases...

a) SCASB/MOVSD with special handling at the end of the loop, as
   I'm sure that would be faster than SCASB/MOVSB for typical
   strings.

b) The code your compiler generated.

c) LODSB/STOSB/CMP

or the variant

c1) LODSD/check each byte in EAX/STOSD, once again with special
    handling at the end.

Maybe I could do that, if you like. Of course it would be in Turbo
Pascal with the built-in asm, but the results should translate to
'C' as well.

Oh, all right, I've done it.

Baseline: 486DX/50. All tests run in native DOS from the command line.
          Each test was run 5 times and the results averaged for the
          table below. (Well, that was the intention. When I ran the
          tests, many (including the empty loop) resulted in just 2
          times, precisely six hundredths of a second different. I
          conclude that that is perhaps SMARTDRV doing it's thing, and
          have taken the *shortest* time for each test. BTW, where's
          the bug in my timing calculation? I discarded 4 results
          (out of 54) because they were ridiculous - either huge or
          negative.) Each test iterates the relevant code 5 million
          times. The test string was designed to be 'typical', and
          specifically to represent the worst case for the SCASB/MOVSD
          test (3 "leftover" bytes). If your strings are 'typically'
          different, do some more tests. Extended registers have not
          been used for addressing or counting - doing so should have
          minimal effect as this only affects the initialisation
          instructions, not the loop body. Look at the program code
          below to see what each test tests.

Test Time  Comments
  1   1.37 empty
  2  42.40 SCASB/MOVSB
  3  39.38 SCASB/MOVSD with fixup of last 1..3 bytes
  4  38.61 SCASB/MOVSD unconditionally moving up to 4 extra bytes
  5  45.53 MOV AL loop with $0a in a register
  6  37.02 MOV AL loop with $0a in memory
  7  44.98 MOV AL loop with slightly better initialisation
  8  62.23 LODSB/STOSB loop
  9  34.99 LODSB/STOSD loop (may move up to 3 extra bytes)
 10  30.32 MOV EAX loop (also may move up to 3 extra bytes)

Analysis:
Well! Just drop all your preconceived ideas! For example, Test 6 being
faster than Test 5 is amazing. And therefore I should have run Test 7
again with the $0a in memory.

If you don't mind moving a few extra bytes (past the $0a), it's worth
doing so unconditionally rather than testing to see if you have to (Test
4 vs Test 3). In your application I should think that's perfectly OK.

Double word moves are much better than byte moves (3 vs 2, 9 vs 8,
10 vs 7).

To my surprise the SCASB/MOVSB (or MOVSD) single instructions are *not*
faster than a coded loop, even though you've got to do those extra
register INCs in the loop. This is also the conclusion you came to.

Test 6 is basically what your compiler generated. It's good, but can be
improved with hand-crafted asm. I didn't use CX & DX because this
compiler doesn't support that addressing mode.

If you *know* what's in your segment registers, you can save a little
bit by a MOV vs LDS and/or LES (7 vs 5).

Test 10 can be improved further, given the results of 6 vs 5 -
don't put $0a in BL or 4 in DX, make them memory operands. And, from
7 vs 5, use MOV rather than the second LDS in the initialisation.

Tests 9 & 10 are I suppose a variant of the optimisation technique
called loop unrolling. As you can see, it's very effective.

The final "winning" case, Test 10, should also be a winner on shorter
strings as it has minimal setup, especially with the suggestions above.

The code follows my signoff. It's not particularly elegant but I think
it demonstrates what you want to know.

Regards, FIM.

program PaulTest;
uses Dos; { for GetTime }
const LoopMax = 5000000;
var LoopCount: longint;
    H1, M1, S1, Hund1: word;
    H2, M2, S2, Hund2: word;
    Elapsed: longint;
    SourceString, DestString: string;
    SSP, DSP: pointer;
    Test: byte;
begin
WriteLn;
SourceString := 'This is a test string of length 43 bytes**'#$0a;
DestString   := '*******************************************';
SSP := Addr (SourceString [1]);
DSP := Addr (DestString [1]);
Test := 0;
GetTime (H1, M1, S1, Hund1);
for LoopCount := 1 to LoopMax do
  begin
  { Comment out all tests except the one you're testing.
    You can easily do this by adding or removing the closing
    brace at the end of the line describing the test. Also
    set the 'Test' variable, so the output makes sense. }

  { Test 1 - nothing in the loop at all }
  asm
  end; {}

  { Test 2 - SCASB/MOVSB
  asm
  les di,ssp
  mov cx,$ffff
  mov al,$0a
  repne scasb
  lds si,ssp
  les di,dsp
  not cx
  rep movsb
  end; {}

  { Test 3 - SCASB/MOVSD
  asm
  les di,ssp
  mov cx,$ffff
  mov al,$0a
  repne scasb
  lds si,ssp
  les di,dsp
  not cx
  mov dx,cx
  shr cx,2
  db $66; rep movsw (* fake a MOVSD *)
  mov cx,dx         (* handle the last 0..3 bytes *)
  and cx,3
  rep movsb
  end; {}

  { Test 4 - SCASB/MOVSD with maybe 4 bytes extra moved
  asm
  les di,ssp
  mov cx,$ffff
  mov al,$0a
  repne scasb
  lds si,ssp
  les di,dsp
  not cx
  shr cx,2
  db $66; rep movsw (* fake a MOVSD *)
  db $66; movsw     (* unconditionally move 4 more bytes *)
  end; {}

  { Test 5 - MOV AL loop with $0a in register
  asm
  lds si,ssp
  lds di,dsp
  mov ah,$0a
  {at}l2:
  mov al,[si]
  mov [di],al
  inc si
  inc di
  cmp al,ah
  jne {at}l2
  end; {}

  { Test 6 - MOV AL loop with $0a as memory operand
  asm
  lds si,ssp
  lds di,dsp
  {at}l2:
  mov al,[si]
  mov [di],al
  inc si
  inc di
  cmp al,$0a
  jne {at}l2
  end; {}

  { Test 7 - MOV AL loop with MOV di,dsp rather than LDS di,dsp
  asm
  lds si,ssp
  mov di, word(dsp)  (* DS already has the segment *)
  mov ah,$0a
  {at}l2:
  mov al,[si]
  mov [di],al
  inc si
  inc di
  cmp al,ah
  jne {at}l2
  end; {}

  { Test 8 - LODSB/STOSB loop
  asm
  lds si,ssp
  mov di, word(dsp)
  mov ah,$0a
  {at}l2:
  lodsb
  stosb
  cmp al,ah
  jne {at}l2
  end; {}

  { Test 9 - LODSD/STOSD loop
    NB: this could move up to 3 extra bytes unnecessarily
  asm
  lds si,ssp
  mov di, word(dsp)
  mov bl,$0a
  {at}l2:
  db $66; lodsw (* fake a LODSD *)
  db $66; stosw (* and a STOSD  *)
  cmp al,bl
  je {at}l3
  cmp ah,bl
  je {at}l3
  db $66; shr ax,16 (* fake a SHR EAX *)
  cmp al,bl
  je {at}l3
  cmp ah,bl
  jne {at}l2
  {at}l3:
  end; {}

  { Test 10 - loop with MOV EAX,[SI] etc. Maybe 3 extra bytes moved }
  asm
  lds si,ssp
  lds di,dsp
  mov bl,$0a
  mov dx,4
  {at}l2:
  db $66; mov ax,[si] (* fake a MOV EAX *)
  db $66; mov [di],ax (* and again      *)
  add si,dx
  add di,dx
  cmp al,bl
  je {at}l3
  cmp ah,bl
  je {at}l3
  db $66; shr ax,16 (* fake a SHR EAX *)
  cmp al,bl
  je {at}l3
  cmp ah,bl
  jne {at}l2
  {at}l3:
  end; {}

  end;
GetTime (H2, M2, S2, Hund2);
WriteLn (deststring); { to make sure we really did it }
Elapsed := (H2 * 360000 + M2 * 6000 + S2 * 100 + Hund2) -
           (H1 * 360000 + M1 * 6000 + S1 * 100 + Hund1);
WriteLn ('Test ', Test, ': ', Elapsed / 100:0:2, ' seconds for ',
         LoopMax, ' iterations');
end.
@EOT:

---
* Origin: Pedants Inc. (3:711/934.24)
SEEN-BY: 690/718 711/809 934

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.