| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | movsb |
Hi, Paul.
PE> FM> have taken the *shortest* time for each test. BTW, where's
PE> FM> the bug in my timing calculation? I discarded 4 results
PE> FM> (out of 54) because they were ridiculous - either huge or
PE> FM> negative.) Each test iterates the relevant code 5 million
PE> I don't know about pascal, but in a C implementation, a calculation
PE> is kept at the same data size until it needs to be promoted. You
PE> have gone M2 * 6000, with M2 being a word. In C, that would remain
PE> an "int" calculation, which would cause a number bigger than 32767
PE> or 65535 to be truncated et al. In C, I would go either M2 * 6000L
PE> or (long)M2 * 6000, to force a long multiplication.
Of course. I must have been asleep when I wrote that. I need to use
longint (6000) etc.
PE> FM> Test Time Comments
PE> FM> 1 1.37 empty
PE> FM> 2 42.40 SCASB/MOVSB
PE> FM> 3 39.38 SCASB/MOVSD with fixup of last 1..3 bytes
PE> FM> 4 38.61 SCASB/MOVSD unconditionally moving up to 4 extra bytes
PE> FM> 5 45.53 MOV AL loop with $0a in a register
PE> FM> 6 37.02 MOV AL loop with $0a in memory
PE> FM> 7 44.98 MOV AL loop with slightly better initialisation
PE> FM> 8 62.23 LODSB/STOSB loop
PE> FM> 9 34.99 LODSB/STOSD loop (may move up to 3 extra bytes)
PE> FM> 10 30.32 MOV EAX loop (also may move up to 3 extra bytes)
PE> FM> Analysis:
PE> FM> Well! Just drop all your preconceived ideas! For example, Test 6 being
PE> FM> faster than Test 5 is amazing. And therefore I should have run Test 7
PE> FM> again with the $0a in memory.
PE> That isn't a memory operand, that is an immediate operand, ie the
PE> number is part of the instruction, ie the data is already available,
PE> as opposed to having to fetch it out of a register.
I meant an immediate operand. But it's still in memory rather than being
in a register - that byte has to be fetched as part of the instruction
and therefore I expected it to be slower than getting it from a
register. I guess with the cache it doesn't matter, it's not accessing
physical RAM.
PE> FM> If you don't mind moving a few extra bytes (past the $0a), it's worth
PE> FM> doing so unconditionally rather than testing to see if you have to
PE> (Test
PE> FM> 4 vs Test 3).
PE> For a 2% improvement, I wouldn't say "worth it".
Well, if you're going for absolute raw speed it is.
PE> FM> In your application I should think that's perfectly OK.
PE> Actually, it's perfectly unOK. You see, I can protect against
PE> reading too far by using a sentinel, and I can even protect against
PE> the double-word read, by making sure the INPUT buffer is always 3,
PE> 4, 5, 7, whatever bytes extra in length. But what I have no control
PE> over is the size of the OUTPUT buffer. I cannot write more bytes
PE> than the user has given me permission to do. Failure to observe
PE> this will cause either a trap, or more likely, the first (up to)
PE> 3 bytes of the next variable to be clobbered.
OK, if that's the circumstance. But tell me a couple of things. How are
you putting your \n sentinel into the input buffer? Isn't *that* going
to clobber stuff? I'm assuming that you're processing a text file, with
lines terminated by \n. If you've got an input buffer with some of these
lines in, where are you putting the sentinel without clobbering a char
in the next line? Maybe my assumption's wrong.
Secondly, it sounds like you're saying that the user of your function
will know exactly how long the input is, and potentially allocate a
buffer of just the right size. But you don't know, otherwise you
wouldn't have to scan for \n. If he knows, why doesn't he just move that
many bytes rather than doing any scan? Where am I missing the point
here?
PE> FM> Double word moves are much better than byte moves (3 vs 2, 9 vs 8,
PE> FM> 10 vs 7).
PE> Well that certainly seems true enough. Let me code that in C and
PE> post the assembler...
PE> First of all, I couldn't get any of my C compilers to generate
PE> exactly what you had written, but GCC came closest...
PE> #include
PE> int main(void)
PE> {
PE> char *t;
PE> char *u;
PE> register unsigned int z;
PE> register unsigned int *x;
PE> register unsigned int *y;
PE> x = (unsigned int *)t;
PE> y = (unsigned int *)u;
PE> while (1)
PE> {
PE> z = *x++ = *y++;
PE> if ((z & 0xffU) == '\n') break;
PE> if ((z & 0xff00U) == 0x0a00U) break;
PE> z = z >> 16;
PE> if ((z & 0xffU) == '\n') break;
PE> if ((z & 0xff00U) == 0x0a00U) break;
PE> }
PE> return (0);
PE> }
PE> 000a 89 f1 mov ecx,esi
PE> 000c 8b 11 L1 mov edx,[ecx]
PE> 000e 89 13 mov [ebx],edx
PE> 0010 83 c1 04 add ecx,00000004H
PE> 0013 83 c3 04 add ebx,00000004H
PE> 0016 80 fa 0a cmp dl,0aH
PE> 0019 74 23 je L2
PE> 001b 89 d0 mov eax,edx
PE> 001d 25 00 ff 00 00 and eax,0000ff00H
PE> 0022 3d 00 0a 00 00 cmp eax,00000a00H
PE> 0027 74 15 je L2
PE> 0029 89 d0 mov eax,edx
PE> 002b c1 e8 10 shr eax,10H
PE> 002e 3c 0a cmp al,0aH
PE> 0030 74 0c je L2
PE> 0032 25 00 ff 00 00 and eax,0000ff00H
PE> 0037 3d 00 0a 00 00 cmp eax,00000a00H
PE> 003c 75 ce jne L1
PE> 003e 31 c0 L2 xor eax,eax
PE> I then recoded it as this...
PE> #include
PE> int main(void)
PE> {
PE> char *t;
PE> char *u;
PE> register unsigned int z;
PE> register unsigned int *x;
PE> register unsigned int *y;
PE> x = (unsigned int *)t;
PE> y = (unsigned int *)u;
PE> while (1)
PE> {
PE> z = *x++ = *y++;
PE> if ((z & 0xffU) == '\n') break;
PE> z = z >> 8;
PE> if ((z & 0xffU) == '\n') break;
PE> z = z >> 8;
PE> if ((z & 0xffU) == '\n') break;
PE> z = z >> 8;
PE> if ((z & 0xffU) == '\n') break;
PE> }
PE> return (0);
PE> }
PE> And from gcc got...
PE> 000a 89 f1 mov ecx,esi
PE> 000c 8b 01 L1 mov eax,[ecx]
PE> 000e 89 03 mov [ebx],eax
PE> 0010 83 c1 04 add ecx,00000004H
PE> 0013 83 c3 04 add ebx,00000004H
PE> 0016 3c 0a cmp al,0aH
PE> 0018 74 1c je L2
PE> 001a 89 c2 mov edx,eax
PE> 001c c1 ea 08 shr edx,08H
PE> 001f 80 fa 0a cmp dl,0aH
PE> 0022 74 12 je L2
PE> 0024 89 c2 mov edx,eax
PE> 0026 c1 ea 10 shr edx,10H
PE> 0029 80 fa 0a cmp dl,0aH
PE> 002c 74 08 je L2
PE> 002e c1 e8 18 shr eax,18H
PE> 0031 83 f8 0a cmp eax,0000000aH
PE> 0034 75 d6 jne L1
PE> 0036 31 c0 L2 xor eax,eax
PE> Which is actually no worse, no better. I tried a few more
PE> styles, but I was unable to eliminate the unnecessary
PE> reloading of the register all the time. So it is an extra
PE> 4 instructions to every loop.
And there are a couple of unnecessary SHRs in there too, as it hasn't
worked out that it can look at the byte registers to get that char from
the dword.
PE> It was interesting to see some of the C compilers rearranging the
PE> code order, presumably to give the 80386 et al more chance of
PE> processing the instructions in advance. It threw me, actually!
Yes, me too, as I mentioned when you posted that code some months ago
comparing 3 compilers, some of the optimisations ended up with code
miles away, presumably to take absolute maximum advantage of the
pipeline.
PE> e.g. with Watcom...
PE> #include
PE> int main(void)
PE> {
PE> char *t;
PE> char *u;
PE> register unsigned int z;
PE> register unsigned int *x;
PE> register unsigned int *y;
PE> x = (unsigned int *)t;
PE> y = (unsigned int *)u;
PE> while (1)
PE> {
PE> z = *x++ = *y++;
PE> if ((z & 0xffU) == 0x0aU) break;
PE> if (((z >>= 8) & 0xffU) == 0x0aU) break;
PE> if (((z >>= 8) & 0xffU) == 0x0aU) break;
PE> if (((z >>= 8) & 0xffU) == 0x0aU) break;
PE> }
PE> return (0);
PE> }
PE> 0004 bf ff 00 00 00 mov edi,000000ffH
PE> 0009 8b 03 L1 mov eax,[ebx]
PE> 000b 83 c2 04 add edx,00000004H
PE> 000e 89 c1 mov ecx,eax
PE> 0010 83 c3 04 add ebx,00000004H
PE> 0013 21 f9 and ecx,edi
PE> 0015 89 42 fc mov -4H[edx],eax
PE> 0018 83 f9 0a cmp ecx,0000000aH
PE> 001b 74 22 je L2
PE> 001d c1 e8 08 shr eax,08H
PE> 0020 89 c1 mov ecx,eax
PE> 0022 21 f9 and ecx,edi
PE> 0024 83 f9 0a cmp ecx,0000000aH
PE> 0027 74 16 je L2
PE> 0029 c1 e8 08 shr eax,08H
PE> 002c 89 c1 mov ecx,eax
PE> 002e 21 f9 and ecx,edi
PE> 0030 83 f9 0a cmp ecx,0000000aH
PE> 0033 74 0a je L2
PE> 0035 c1 e8 08 shr eax,08H
PE> 0038 21 f8 and eax,edi
PE> 003a 83 f8 0a cmp eax,0000000aH
PE> 003d 75 ca jne L1
PE> 003f 31 c0 L2 xor eax,eax
Er, what rearrangement of code order? That looks like a pretty
straightforward implementation of your code to me.
PE> FM> To my surprise the SCASB/MOVSB (or MOVSD) single instructions are *not*
PE> FM> faster than a coded loop, even though you've got to do those extra
PE> FM> register INCs in the loop. This is also the conclusion you came to.
PE> Yeah, I couldn't see any mistake in my arithmetic (adding mostly
PE> 1's together!).
I wasn't questioning your arithmetic, just that (at least my) timing
reference comes with several conditions about when those timings will
be true - like the instruction has already been prefetched and decoded.
I thought the advantage would have been with the string instructions.
PE> FM> Test 6 is basically what your compiler generated. It's good, but can be
PE> FM> improved with hand-crafted asm. I didn't use CX & DX because this
PE> FM> compiler doesn't support that addressing mode.
PE> Hmmm, can be improved as above by changing the C code too, I
PE> would expect.
Yes, by changing it from a char to an unsigned int, as in the above.
(BTW, is an int always 32 bits? What do you call 16 bit ints and 8 bit
ints? In TP you have longint, integer and shortint for 32, 16, 8
respectively.)
PE> FM> Test 10 can be improved further, given the results of 6 vs 5 -
PE> FM> don't put $0a in BL or 4 in DX, make them memory operands.
PE> True.
PE> FM> Tests 9 & 10 are I suppose a variant of the optimisation technique
PE> FM> called loop unrolling. As you can see, it's very effective.
PE> Yeah. One thing though, I would expect that you could keep the
PE> same tight code, but make it NOT copy the extra bytes. It is
PE> the first time I've actually seen any advantage in little-endian
PE> format, but you conveniently have the data just where you want it,
PE> as so...
PE> l2 mov 2[di],ax
PE> cmp al,bl
PE> je {at}l4
PE> cmp ah,bl
PE> je {at}l5
PE> mov [di],ax
PE> shr eax,16
PE> cmp al,bl
PE> je {at}l4
PE> cmp ah,bl
PE> jne {at}l2
PE> l4 mov [di],ah
PE> jmp {at}l7
PE> l5 mov [di],ax
PE> jmp {at}l7
PE> l6 mov [di],ah
Er, I think you're still potentially moving one extra byte, in line 1
and line 6. Plus, you're doing word moves instead of dwords. So
presumably the timing would be somewhere between my test 7 and test 10.
Or I'm misreading the above code 'cos it's not complete.
PE> Ok, so there's one extra instruction used - so sue me! Actually,
PE> you could even get that extra instruction out of the loop if you
PE> wanted to, and move it to the termination cleanup. That would
PE> be best.
You certainly can get the advantage of dword moves without the problem
of overrunning your output buffer, but it takes a little bit more work.
I didn't do so as I didn't think it was a problem. However, you've now
said it is. Do you want to see code, and timing?
PE> FM> The code follows my signoff. It's not particularly elegant but I think
PE> FM> it demonstrates what you want to know.
PE> BTW, this wouldn't compile with Turbo Pascal 5.5. It complained
PE> about the asm instruction.
No, asm was introduced with TP6. Before that the only thing you could do
was create so-called 'inline' functions, and you had to specify the code
as hex values!
Regards, FIM.
* * Dammit, where'd I leave that tagline?
@EOT:
---
* Origin: Pedants Inc. (3:711/934.24)SEEN-BY: 690/718 711/809 934 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.