| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | movsb |
Hi, Paul.
PE> Thanks for the info. So, ignoring the setup stuff, we have:
PE> SCASB/MOVSD = 5+3/4n = 5.75n
I think so.
PE> SCASB/MOVSB = 8n
Yes.
PE> assembler loop = 14n
Yes.
PE> C loop = 8n?
See below.
PE> FM> Basically, yes; there's no combination MOVSB/SCASB instruction. I was
PE> FM> about to write that "anyway, it'd be much faster doing it with the
PE> FM> single instructions", but it looks like it would be
close. REPE SCASB
PE> is
PE> FM> 7 + 5n clocks on a 486, REP MOVSB is 12 + 3n. (I think - the timings
PE> It is interesting that SCASB only needs to do a memory read, whilst
PE> MOVSB is a read + write, yet the former is faster!
I think you mean the latter, and yes, it is interesting. Remember that
the SCASB also has to do a compare to AL, set the flag and potentially
quit the instruction.
BUT. For a start there at least 6 footnotes to the timing info in my
reference for those instructions. ALSO, there's a whole lot of
assumptions mentioned at the beginning of the chapter.
IOW, unless you are intimately acquainted with the hardware (which I'm
certainly not) you can't really say what's going to happen in this sort
of pipelined, cached processor. I'd call the nominated timings a good
first guess, nothing better. You'd be best off timing a few hundred
million typical strings.
PE> FM> Then for a loop, say
PE> FM> L1: LODSB
PE> FM> STOSB
PE> FM> CMP AH ;where you've put $0a
PE> FM> JNE L1
PE> 002c 8a 01 L2 mov al,[ecx]
PE> 002e 88 02 mov [edx],al
PE> 0030 41 inc ecx
PE> 0031 42 inc edx
PE> 0032 3c 0a cmp al,0aH
PE> 0034 75 f6 jne L2
PE> FM> it's 5+5+1+3 = 14n for the loop, once again ignoring setting up the
PE> FM> registers.
PE> Did I add up the numbers wrong or something? The C loop looks
PE> like 8n to me. Timings on a 486, since I determined that that
PE> was what figures you were using. BFN. Paul.
Yes, I was. Your compiler has generated the
mov al,[ecx]
mov [edx],al
pair, followed by INCs of the 2 registers. That's a logical thing for it
to do given the general case of what could be in the while loop. I'd
suggest that the MOVSB would be faster, but that relies on the compiler
picking that specific case of how you're using the WHILE loop.
I don't think you'd find, in practice, that it really was x+8n.
If speed is really critical I'd suggest timing a few hundred million
examples of 3 cases...
a) SCASB/MOVSD with special handling at the end of the loop, as
I'm sure that would be faster than SCASB/MOVSB for typical
strings.
b) The code your compiler generated.
c) LODSB/STOSB/CMP
or the variant
c1) LODSD/check each byte in EAX/STOSD, once again with special
handling at the end.
Maybe I could do that, if you like. Of course it would be in Turbo
Pascal with the built-in asm, but the results should translate to
'C' as well.
Oh, all right, I've done it.
Baseline: 486DX/50. All tests run in native DOS from the command line.
Each test was run 5 times and the results averaged for the
table below. (Well, that was the intention. When I ran the
tests, many (including the empty loop) resulted in just 2
times, precisely six hundredths of a second different. I
conclude that that is perhaps SMARTDRV doing it's thing, and
have taken the *shortest* time for each test. BTW, where's
the bug in my timing calculation? I discarded 4 results
(out of 54) because they were ridiculous - either huge or
negative.) Each test iterates the relevant code 5 million
times. The test string was designed to be 'typical', and
specifically to represent the worst case for the SCASB/MOVSD
test (3 "leftover" bytes). If your strings are 'typically'
different, do some more tests. Extended registers have not
been used for addressing or counting - doing so should have
minimal effect as this only affects the initialisation
instructions, not the loop body. Look at the program code
below to see what each test tests.
Test Time Comments
1 1.37 empty
2 42.40 SCASB/MOVSB
3 39.38 SCASB/MOVSD with fixup of last 1..3 bytes
4 38.61 SCASB/MOVSD unconditionally moving up to 4 extra bytes
5 45.53 MOV AL loop with $0a in a register
6 37.02 MOV AL loop with $0a in memory
7 44.98 MOV AL loop with slightly better initialisation
8 62.23 LODSB/STOSB loop
9 34.99 LODSB/STOSD loop (may move up to 3 extra bytes)
10 30.32 MOV EAX loop (also may move up to 3 extra bytes)
Analysis:
Well! Just drop all your preconceived ideas! For example, Test 6 being
faster than Test 5 is amazing. And therefore I should have run Test 7
again with the $0a in memory.
If you don't mind moving a few extra bytes (past the $0a), it's worth
doing so unconditionally rather than testing to see if you have to (Test
4 vs Test 3). In your application I should think that's perfectly OK.
Double word moves are much better than byte moves (3 vs 2, 9 vs 8,
10 vs 7).
To my surprise the SCASB/MOVSB (or MOVSD) single instructions are *not*
faster than a coded loop, even though you've got to do those extra
register INCs in the loop. This is also the conclusion you came to.
Test 6 is basically what your compiler generated. It's good, but can be
improved with hand-crafted asm. I didn't use CX & DX because this
compiler doesn't support that addressing mode.
If you *know* what's in your segment registers, you can save a little
bit by a MOV vs LDS and/or LES (7 vs 5).
Test 10 can be improved further, given the results of 6 vs 5 -
don't put $0a in BL or 4 in DX, make them memory operands. And, from
7 vs 5, use MOV rather than the second LDS in the initialisation.
Tests 9 & 10 are I suppose a variant of the optimisation technique
called loop unrolling. As you can see, it's very effective.
The final "winning" case, Test 10, should also be a winner on shorter
strings as it has minimal setup, especially with the suggestions above.
The code follows my signoff. It's not particularly elegant but I think
it demonstrates what you want to know.
Regards, FIM.
program PaulTest;
uses Dos; { for GetTime }
const LoopMax = 5000000;
var LoopCount: longint;
H1, M1, S1, Hund1: word;
H2, M2, S2, Hund2: word;
Elapsed: longint;
SourceString, DestString: string;
SSP, DSP: pointer;
Test: byte;
begin
WriteLn;
SourceString := 'This is a test string of length 43 bytes**'#$0a;
DestString := '*******************************************';
SSP := Addr (SourceString [1]);
DSP := Addr (DestString [1]);
Test := 0;
GetTime (H1, M1, S1, Hund1);
for LoopCount := 1 to LoopMax do
begin
{ Comment out all tests except the one you're testing.
You can easily do this by adding or removing the closing
brace at the end of the line describing the test. Also
set the 'Test' variable, so the output makes sense. }
{ Test 1 - nothing in the loop at all }
asm
end; {}
{ Test 2 - SCASB/MOVSB
asm
les di,ssp
mov cx,$ffff
mov al,$0a
repne scasb
lds si,ssp
les di,dsp
not cx
rep movsb
end; {}
{ Test 3 - SCASB/MOVSD
asm
les di,ssp
mov cx,$ffff
mov al,$0a
repne scasb
lds si,ssp
les di,dsp
not cx
mov dx,cx
shr cx,2
db $66; rep movsw (* fake a MOVSD *)
mov cx,dx (* handle the last 0..3 bytes *)
and cx,3
rep movsb
end; {}
{ Test 4 - SCASB/MOVSD with maybe 4 bytes extra moved
asm
les di,ssp
mov cx,$ffff
mov al,$0a
repne scasb
lds si,ssp
les di,dsp
not cx
shr cx,2
db $66; rep movsw (* fake a MOVSD *)
db $66; movsw (* unconditionally move 4 more bytes *)
end; {}
{ Test 5 - MOV AL loop with $0a in register
asm
lds si,ssp
lds di,dsp
mov ah,$0a
{at}l2:
mov al,[si]
mov [di],al
inc si
inc di
cmp al,ah
jne {at}l2
end; {}
{ Test 6 - MOV AL loop with $0a as memory operand
asm
lds si,ssp
lds di,dsp
{at}l2:
mov al,[si]
mov [di],al
inc si
inc di
cmp al,$0a
jne {at}l2
end; {}
{ Test 7 - MOV AL loop with MOV di,dsp rather than LDS di,dsp
asm
lds si,ssp
mov di, word(dsp) (* DS already has the segment *)
mov ah,$0a
{at}l2:
mov al,[si]
mov [di],al
inc si
inc di
cmp al,ah
jne {at}l2
end; {}
{ Test 8 - LODSB/STOSB loop
asm
lds si,ssp
mov di, word(dsp)
mov ah,$0a
{at}l2:
lodsb
stosb
cmp al,ah
jne {at}l2
end; {}
{ Test 9 - LODSD/STOSD loop
NB: this could move up to 3 extra bytes unnecessarily
asm
lds si,ssp
mov di, word(dsp)
mov bl,$0a
{at}l2:
db $66; lodsw (* fake a LODSD *)
db $66; stosw (* and a STOSD *)
cmp al,bl
je {at}l3
cmp ah,bl
je {at}l3
db $66; shr ax,16 (* fake a SHR EAX *)
cmp al,bl
je {at}l3
cmp ah,bl
jne {at}l2
{at}l3:
end; {}
{ Test 10 - loop with MOV EAX,[SI] etc. Maybe 3 extra bytes moved }
asm
lds si,ssp
lds di,dsp
mov bl,$0a
mov dx,4
{at}l2:
db $66; mov ax,[si] (* fake a MOV EAX *)
db $66; mov [di],ax (* and again *)
add si,dx
add di,dx
cmp al,bl
je {at}l3
cmp ah,bl
je {at}l3
db $66; shr ax,16 (* fake a SHR EAX *)
cmp al,bl
je {at}l3
cmp ah,bl
jne {at}l2
{at}l3:
end; {}
end;
GetTime (H2, M2, S2, Hund2);
WriteLn (deststring); { to make sure we really did it }
Elapsed := (H2 * 360000 + M2 * 6000 + S2 * 100 + Hund2) -
(H1 * 360000 + M1 * 6000 + S1 * 100 + Hund1);
WriteLn ('Test ', Test, ': ', Elapsed / 100:0:2, ' seconds for ',
LoopMax, ' iterations');
end.
@EOT:
---
* Origin: Pedants Inc. (3:711/934.24)SEEN-BY: 690/718 711/809 934 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.