PE> FM> Analysis:
PE> FM> Well! Just drop all your preconceived ideas! For example, Test 6 being
PE> FM> faster than Test 5 is amazing. And therefore I should have run Test 7
PE> FM> again with the $0a in memory.
PE> That isn't a memory operand, that is an immediate operand, ie the
PE> number is part of the instruction, ie the data is already available,
PE> as opposed to having to fetch it out of a register.
FM> I meant an immediate operand. But it's still in memory rather than being
FM> in a register - that byte has to be fetched as part of the instruction
The byte would have already been fetched, as part of the instruction,
is what you should be saying. The slowdown comes when the instruction
wants to access memory, and has to go back to memory to get the data.
In this case, it doesn't need to.
FM> and therefore I expected it to be slower than getting it from a
FM> register. I guess with the cache it doesn't matter, it's not accessing
FM> physical RAM.
And even without a cache, it would be accessing the instructions
32-bits at a time anyway, on a 386DX. ie just the one memory
fetch for the instruction, and you've got a complete instruction.
PE> FM> In your application I should think that's perfectly OK.
PE> Actually, it's perfectly unOK. You see, I can protect against
PE> reading too far by using a sentinel, and I can even protect against
PE> the double-word read, by making sure the INPUT buffer is always 3,
PE> 4, 5, 7, whatever bytes extra in length. But what I have no control
PE> over is the size of the OUTPUT buffer. I cannot write more bytes
PE> than the user has given me permission to do. Failure to observe
PE> this will cause either a trap, or more likely, the first (up to)
PE> 3 bytes of the next variable to be clobbered.
FM> OK, if that's the circumstance. But tell me a couple of things. How are
FM> you putting your \n sentinel into the input buffer? Isn't *that* going
FM> to clobber stuff? I'm assuming that you're processing a text file, with
FM> lines terminated by \n. If you've got an input buffer with some of these
FM> lines in, where are you putting the sentinel without clobbering a char
FM> in the next line? Maybe my assumption's wrong.
I save the value in the input buffer, then clobber it, then do my
search, and then unclobber it. I wrote THAT code right from the
start, and then made the rest of the code have to work it's way
around this bit of code!
FM> Secondly, it sounds like you're saying that the user of your function
FM> will know exactly how long the input is, and potentially allocate a
FM> buffer of just the right size. But you don't know, otherwise you
He has defined a variable, such as char buf[100], which is the
maximum length of a line he can handle. He expects to only read
up to the first '\n' though, but if for some reason he doesn't
get a '\n', he doesn't want a data overrun.
FM> wouldn't have to scan for \n. If he knows, why doesn't he just move that
FM> many bytes rather than doing any scan? Where am I missing the point
FM> here?
That is the maximum, not the amount to read.
PE> First of all, I couldn't get any of my C compilers to generate
PE> exactly what you had written, but GCC came closest...
PE> 001f 80 fa 0a cmp dl,0aH
PE> 0029 80 fa 0a cmp dl,0aH
FM> And there are a couple of unnecessary SHRs in there too, as it hasn't
FM> worked out that it can look at the byte registers to get that char from
FM> the dword.
It did in one of the cases.
PE> It was interesting to see some of the C compilers rearranging the
PE> code order, presumably to give the 80386 et al more chance of
PE> processing the instructions in advance. It threw me, actually!
PE> z = *x++ = *y++;
PE> if ((z & 0xffU) == 0x0aU) break;
PE> 0004 bf ff 00 00 00 mov edi,000000ffH
PE> 0009 8b 03 L1 mov eax,[ebx]
PE> 000b 83 c2 04 add edx,00000004H
PE> 000e 89 c1 mov ecx,eax
PE> 0010 83 c3 04 add ebx,00000004H
PE> 0013 21 f9 and ecx,edi
PE> 0015 89 42 fc mov -4H[edx],eax
PE> 0018 83 f9 0a cmp ecx,0000000aH
FM> Er, what rearrangement of code order? That looks like a pretty
FM> straightforward implementation of your code to me.
It is mucking around with ECX before it has finished the
z = *x++ = *y++.
FM> Yes, by changing it from a char to an unsigned int, as in the above.
FM> (BTW, is an int always 32 bits? What do you call 16 bit ints and 8 bit
FM> ints? In TP you have longint, integer and shortint for 32, 16, 8
FM> respectively.)
No it isn't. "int" is the natural wordsize of a machine. It is
guaranteed to be a MINIMUM of 16 bits. "long" is guaranteed to
be a minimum of 32 bits. "short" is guaranteed to be a minimum
of 16 bits, and also it is guaranteed to be <= sizeof int. "char"
is guaranteed to be a minimum of 8 bits.
In this case, I was happy to write code targetted to my platform,
it is part of the spec of a C compiler!
PE> Yeah. One thing though, I would expect that you could keep the
PE> same tight code, but make it NOT copy the extra bytes. It is
PE> the first time I've actually seen any advantage in little-endian
PE> format, but you conveniently have the data just where you want it,
PE> as so...
PE> l2 mov 2[di],ax
PE> cmp al,bl
PE> je {at}l4
PE> cmp ah,bl
PE> je {at}l5
PE> mov [di],ax
PE> shr eax,16
PE> cmp al,bl
PE> je {at}l4
PE> cmp ah,bl
PE> jne {at}l2
PE> l4 mov [di],ah
PE> jmp {at}l7
PE> l5 mov [di],ax
PE> jmp {at}l7
PE> l6 mov [di],ah
FM> Er, I think you're still potentially moving one extra byte, in line 1
FM> and line 6.
FM> Or I'm misreading the above code 'cos it's not complete.
Line 6 is the easiest to demonstrate. There is a compare of both
al and ah before doing the halfword move, so there's no problem.
PE> Ok, so there's one extra instruction used - so sue me! Actually,
FM> Plus, you're doing word moves instead of dwords. So
Two halfword *stores* instead of one dword store is where the
extra instruction comes from.
PE> you could even get that extra instruction out of the loop if you
PE> wanted to, and move it to the termination cleanup. That would
PE> be best.
FM> You certainly can get the advantage of dword moves without the problem
FM> of overrunning your output buffer, but it takes a little bit more work.
Actually, I've changed my mind about that. I don't think it is
possible to reduce the timing of instructions within that loop,
without causing an overrun. I was basically moving the testing
before the move, and then special case after the loop, but the
testing I was doing would destroy the register.
HOWEVER, what could be done is if I give you the maximum length
as WELL as just the '\n' terminator, then it could be written
just as fast.
One other thing that hasn't been taken into account here is that
dword accesses are meant to be faster if they're dword aligned.
There is no logic in here to access stuff on a dword boundary,
and do something else for the odd <= 3 bytes either end. Do
you think that would make a difference?
FM> I didn't do so as I didn't think it was a problem. However, you've now
FM> said it is. Do you want to see code, and timing?
I think what I'll take out of this discussion is that I can get
a worthwhile (10-25% or something) improvement by using dword
accesses even with my C code, which I should implement in my
fgets() function. I can add the speedups we have canvassed above.
The only thing I can't get is the fancier code to match your
assembler. I'm not sure what the penalty for that will be. That
would determine what gain I could get from going assembler.
Let me rework the C code first, and see if I can form the problem
in terms of the new code.
BFN. Paul.
@EOT:
---
* Origin: X (3:711/934.9)
|