FM> fetch). On an 8086, it's 9 clocks for an immediate operand vs 2 for a
FM> register, even on a 286 it's 3/2. As you've seen for a 386+ it's the
FM> same.
My book tells me 4 vs 2 on an 8086 and 2 vs 2 on a 80286.
PE> I save the value in the input buffer, then clobber it, then do my
PE> search, and then unclobber it. I wrote THAT code right from the
PE> start, and then made the rest of the code have to work it's way
PE> around this bit of code!
FM> You save *that* byte, or the whole buffer? (I assume the former.) How
The former.
FM> far along do you put the sentinel byte? How do you know the user's
The maximum length of the user's buffer, or the amount remaining
in my input buffer, whichever the lesser.
FM> input buffer is that long?
It's my input buffer, his output buffer. He told me the length of
his output buffer.
He makes the call:
char buf[80];
fgets(buf, 80, filePointer);
Obviously the C library doesn't want to read a byte-at-a-time from
the physical disk, so I have an input buffer, say 6144 bytes in
size, and then I read 6144 bytes at a time, and then copy from the
input buffer to the user's output buffer on request.
PE> He has defined a variable, such as char buf[100], which is the
PE> maximum length of a line he can handle. He expects to only read
PE> up to the first '\n' though, but if for some reason he doesn't
PE> get a '\n', he doesn't want a data overrun.
FM> You've explained this more fully in another message, to which I replied
FM> this morning. It's a valid consideration but I wasn't thinking that this
FM> was required to be compliant with an ISO-defined function of the same
FM> name. I thought you might be in full control, in the sense of being able
FM> to require the user (probably you) to allocate an output buffer at least
FM> 4 bytes longer than where you put your sentinel.
I can do that. Or more to the point, he tells me the maximum
length, and I can subtract 4 from that before doing my stuff.
However, what I can't do is copy even 1 extra character more
than he asked for. Actually, since I need to put an extra
character on the end anyway (the '\0'), I can afford to copy
1 extra character, and then potentially replace it with a '\0'.
PE> PE> 001f 80 fa 0a cmp dl,0aH
PE> PE> 0029 80 fa 0a cmp dl,0aH
PE> FM> And there are a couple of unnecessary SHRs in there too, as it hasn't
PE> FM> worked out that it can look at the byte registers to get that char from
PE> FM> the dword.
PE> It did in one of the cases.
FM> I don't think so. I've just reread the code you posted, and in *all*
FM> examples it does shifts or masks or both. What I meant was looking at
To access one of the bytes, it figured out it just needed to do
a cmp dl,0ah, rather than do a mask. The code is above.
FM> say dh for the second-rightmost byte, rather than either shifting edx 8
FM> bits to the right, or masking edx with $0000ff00 and testing for
FM> $00000a00.
Yes, none of them figured it out for dh, just dl.
PE> PE> 0004 bf ff 00 00 00 mov edi,000000ffH
PE> PE> 0009 8b 03 L1 mov eax,[ebx]
PE> PE> 000b 83 c2 04 add edx,00000004H
PE> PE> 000e 89 c1 mov ecx,eax
PE> PE> 0010 83 c3 04 add ebx,00000004H
PE> PE> 0013 21 f9 and ecx,edi
PE> PE> 0015 89 42 fc mov -4H[edx],eax
PE> PE> 0018 83 f9 0a cmp ecx,0000000aH
FM> Oh yes, I see what you mean. And they fucked up - there's no need for
FM> that move into ecx at all. Just use eax and move the and to just before
FM> the cmp.
You need it if you are planning on doing a destructive test of ecx,
which they are.
PE> No it isn't. "int" is the natural wordsize of a machine. It is
PE> guaranteed to be a MINIMUM of 16 bits. "long" is guaranteed to
PE> be a minimum of 32 bits. "short" is guaranteed to be a minimum
PE> of 16 bits, and also it is guaranteed to be <= sizeof int.
"char"
PE> is guaranteed to be a minimum of 8 bits.
FM> Yes of course. But given that you are writing for a particular hardware
FM> platform, is there no way of specifying that you specifically want any
FM> of the natural memory access granularities? I mean according to your
FM> definition above, a future C compiler could have int, short and char all
FM> 32 bits, which might stuff up your careful optimisations FOR THAT
FM> PLATFORM.
A compiler is targetted for a machine. On a machine where the
hardware stores characters in 32-bit variables, and only has
32-bit registers & variables, yes, I would expect the C compiler
to match that. That machine is not an 80386.
PE> PE> l2 mov 2[di],ax
PE> PE> cmp al,bl
PE> PE> je {at}l4
PE> PE> cmp ah,bl
PE> PE> je {at}l5
PE> PE> mov [di],ax
PE> PE> shr eax,16
PE> PE> cmp al,bl
PE> PE> je {at}l4
PE> PE> cmp ah,bl
PE> PE> jne {at}l2
PE> PE> l4 mov [di],ah
PE> PE> jmp {at}l7
PE> PE> l5 mov [di],ax
PE> PE> jmp {at}l7
PE> PE> l6 mov [di],ah
PE> FM> Er, I think you're still potentially moving one extra byte, in line 1
PE> FM> and line 6.
PE> FM> Or I'm misreading the above code 'cos it's not complete.
PE> Line 6 is the easiest to demonstrate. There is a compare of both
PE> al and ah before doing the halfword move, so there's no problem.
FM> But haven't you already moved a full word in line 1?
Hmm, terminology problem. I'm used to byte, halfword and fullword,
but here I have to talk byte, fullword, dword. Anyway, using the
latter (proper) terminology...
The idea is to move a dword safely. I do this by moving two
fullwords. Before I move each fullword, I do a test of each
byte in that fullword. You can count the 4 CMP's to see the
tests being done. The tests are done before doing a move.
PE> PE> Ok, so there's one extra instruction used - so sue me! Actually,
PE> FM> Plus, you're doing word moves instead of dwords. So
PE> Two halfword *stores* instead of one dword store is where the
PE> extra instruction comes from.
FM> I meant store. And what I'm saying is that I think you should do a dword
FM> store when you can, if not then a word, if not then a byte. Although
FM> given that those last 2 possibilities only happen for the leftovers,
FM> storing up to 3 bytes individually ain't gonna make much difference.
And I'm saying you can't do a dword store ever. Because you need
a destructive test, which means you need to reload your register.
Or else you need to overrun your buffer, which is not allowed.
PE> Actually, I've changed my mind about that. I don't think it is
PE> possible to reduce the timing of instructions within that loop,
PE> without causing an overrun. I was basically moving the testing
PE> before the move, and then special case after the loop, but the
PE> testing I was doing would destroy the register.
FM> That's what I think you should do, and not use a destructive test. But
FM> you may not be able to persuade your compiler to generate that. But
FM> then, given that you're trying to write code optimised for a particular
FM> hardware platform, need you have any objection to switching to the built
FM> in assembler? Or making it an external function written in asm, as you
FM> want for the maths library?
Remember that memcpy() function I posted? I had a devil of a
problem trying to get that to work, because the calling
conventions are different all the time. I am going to have to
investigate/document/query that, because (roughly):
CSET has 2 calling conventions
BCC has 4 calling conventions
GCC has unknown calling conventions
WCC has 2/3 calling conventions
PE> HOWEVER, what could be done is if I give you the maximum length
PE> as WELL as just the '\n' terminator, then it could be written
PE> just as fast.
FM> And, as I said in another message, I think that's how the function
FM> should be specified for maximum safety, protecting users from
FM> themselves. But if you're implementing an ISO-specified function maybe
FM> you don't have any choice about the interface.
I don't have a choice, but it is already safe. I used a sentinel
to maintain the safety, which is why I reformulated the problem
with "apparently" no protection.
PE> I think what I'll take out of this discussion is that I can get
PE> a worthwhile (10-25% or something) improvement by using dword
PE> accesses even with my C code, which I should implement in my
PE> fgets() function. I can add the speedups we have canvassed above.
PE> The only thing I can't get is the fancier code to match your
PE> assembler. I'm not sure what the penalty for that will be. That
PE> would determine what gain I could get from going assembler.
PE> Let me rework the C code first, and see if I can form the problem
PE> in terms of the new code.
FM> Well don't you need to implement whatever the ISO definition of fgets()
FM> is? (I don't know what fgets() is/does.)
Yes I do. But I can still get the code that I posted before,
whilst in C. I have done so, in fact. I might be making more
mods though, I'm not sure. BFN. Paul.
@EOT:
---
* Origin: X (3:711/934.9)
|