| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | movsb |
Hi, Paul.
PE> PE> FM> Analysis:
PE> PE> FM> Well! Just drop all your preconceived ideas! For example, Test 6
PE> being
PE> PE> FM> faster than Test 5 is amazing. And therefore I should
have run Test
PE> 7
PE> PE> FM> again with the $0a in memory.
PE> PE> That isn't a memory operand, that is an immediate operand, ie the
PE> PE> number is part of the instruction, ie the data is already available,
PE> PE> as opposed to having to fetch it out of a register.
It still has to get it from memory; that byte is in memory, immediately
following the instruction. On a 386+, this may make no difference
whatsoever as you say - both because of the cache and because that byte
has probably already been fetched in the 32-bit read (but it might not
have been, if the opcode happened to be the last byte of a 32-bit
fetch). On an 8086, it's 9 clocks for an immediate operand vs 2 for a
register, even on a 286 it's 3/2. As you've seen for a 386+ it's the
same.
I bet I could come up with code which was slower for an immediate
operand than a register operand, even on a 486. But it would be quite
artificial.
PE> FM> I meant an immediate operand. But it's still in memory rather than
PE> being
PE> FM> in a register - that byte has to be fetched as part of the instruction
PE> The byte would have already been fetched, as part of the instruction,
PE> is what you should be saying. The slowdown comes when the instruction
PE> wants to access memory, and has to go back to memory to get the data.
PE> In this case, it doesn't need to.
See above.
PE> FM> and therefore I expected it to be slower than getting it from a
PE> FM> register. I guess with the cache it doesn't matter, it's not accessing
PE> FM> physical RAM.
PE> And even without a cache, it would be accessing the instructions
PE> 32-bits at a time anyway, on a 386DX. ie just the one memory
PE> fetch for the instruction, and you've got a complete instruction.
See above.
PE> PE> FM> In your application I should think that's perfectly OK.
PE> PE> Actually, it's perfectly unOK. You see, I can protect against
PE> PE> reading too far by using a sentinel, and I can even protect against
PE> PE> the double-word read, by making sure the INPUT buffer is always 3,
PE> PE> 4, 5, 7, whatever bytes extra in length. But what I have no control
PE> PE> over is the size of the OUTPUT buffer. I cannot write more bytes
PE> PE> than the user has given me permission to do. Failure to observe
PE> PE> this will cause either a trap, or more likely, the first (up to)
PE> PE> 3 bytes of the next variable to be clobbered.
PE> FM> OK, if that's the circumstance. But tell me a couple of things. How are
PE> FM> you putting your \n sentinel into the input buffer? Isn't *that* going
PE> FM> to clobber stuff? I'm assuming that you're processing a text file, with
PE> FM> lines terminated by \n. If you've got an input buffer with some of
PE> these
PE> FM> lines in, where are you putting the sentinel without clobbering a char
PE> FM> in the next line? Maybe my assumption's wrong.
PE> I save the value in the input buffer, then clobber it, then do my
PE> search, and then unclobber it. I wrote THAT code right from the
PE> start, and then made the rest of the code have to work it's way
PE> around this bit of code!
You save *that* byte, or the whole buffer? (I assume the former.) How
far along do you put the sentinel byte? How do you know the user's
input buffer is that long?
PE> FM> Secondly, it sounds like you're saying that the user of your function
PE> FM> will know exactly how long the input is, and potentially allocate a
PE> FM> buffer of just the right size. But you don't know, otherwise you
PE> He has defined a variable, such as char buf[100], which is the
PE> maximum length of a line he can handle. He expects to only read
PE> up to the first '\n' though, but if for some reason he doesn't
PE> get a '\n', he doesn't want a data overrun.
You've explained this more fully in another message, to which I replied
this morning. It's a valid consideration but I wasn't thinking that this
was required to be compliant with an ISO-defined function of the same
name. I thought you might be in full control, in the sense of being able
to require the user (probably you) to allocate an output buffer at least
4 bytes longer than where you put your sentinel.
PE> FM> wouldn't have to scan for \n. If he knows, why doesn't he just move
PE> that
PE> FM> many bytes rather than doing any scan? Where am I missing the point
PE> FM> here?
PE> That is the maximum, not the amount to read.
I now understand what I was missing.
PE> PE> First of all, I couldn't get any of my C compilers to generate
PE> PE> exactly what you had written, but GCC came closest...
PE> PE> 001f 80 fa 0a cmp dl,0aH
PE> PE> 0029 80 fa 0a cmp dl,0aH
PE> FM> And there are a couple of unnecessary SHRs in there too, as it hasn't
PE> FM> worked out that it can look at the byte registers to get that char from
PE> FM> the dword.
PE> It did in one of the cases.
I don't think so. I've just reread the code you posted, and in *all*
examples it does shifts or masks or both. What I meant was looking at
say dh for the second-rightmost byte, rather than either shifting edx 8
bits to the right, or masking edx with $0000ff00 and testing for
$00000a00.
PE> PE> It was interesting to see some of the C compilers rearranging the
PE> PE> code order, presumably to give the 80386 et al more chance of
PE> PE> processing the instructions in advance. It threw me, actually!
PE> PE> z = *x++ = *y++;
PE> PE> if ((z & 0xffU) == 0x0aU) break;
PE> PE> 0004 bf ff 00 00 00 mov edi,000000ffH
PE> PE> 0009 8b 03 L1 mov eax,[ebx]
PE> PE> 000b 83 c2 04 add edx,00000004H
PE> PE> 000e 89 c1 mov ecx,eax
PE> PE> 0010 83 c3 04 add ebx,00000004H
PE> PE> 0013 21 f9 and ecx,edi
PE> PE> 0015 89 42 fc mov -4H[edx],eax
PE> PE> 0018 83 f9 0a cmp ecx,0000000aH
PE> FM> Er, what rearrangement of code order? That looks like a pretty
PE> FM> straightforward implementation of your code to me.
PE> It is mucking around with ECX before it has finished the
PE> z = *x++ = *y++.
Oh yes, I see what you mean. And they fucked up - there's no need for
that move into ecx at all. Just use eax and move the and to just before
the cmp.
PE> FM> Yes, by changing it from a char to an unsigned int, as in the above.
PE> FM> (BTW, is an int always 32 bits? What do you call 16 bit ints and 8 bit
PE> FM> ints? In TP you have longint, integer and shortint for 32, 16, 8
PE> FM> respectively.)
PE> No it isn't. "int" is the natural wordsize of a machine. It is
PE> guaranteed to be a MINIMUM of 16 bits. "long" is guaranteed to
PE> be a minimum of 32 bits. "short" is guaranteed to be a minimum
PE> of 16 bits, and also it is guaranteed to be <= sizeof int.
"char"
PE> is guaranteed to be a minimum of 8 bits.
PE> In this case, I was happy to write code targetted to my platform,
PE> it is part of the spec of a C compiler!
Yes of course. But given that you are writing for a particular hardware
platform, is there no way of specifying that you specifically want any
of the natural memory access granularities? I mean according to your
definition above, a future C compiler could have int, short and char all
32 bits, which might stuff up your careful optimisations FOR THAT
PLATFORM.
PE> PE> Yeah. One thing though, I would expect that you could keep the
PE> PE> same tight code, but make it NOT copy the extra bytes. It is
PE> PE> the first time I've actually seen any advantage in little-endian
PE> PE> format, but you conveniently have the data just where you want it,
PE> PE> as so...
PE> PE> l2 mov 2[di],ax
PE> PE> cmp al,bl
PE> PE> je {at}l4
PE> PE> cmp ah,bl
PE> PE> je {at}l5
PE> PE> mov [di],ax
PE> PE> shr eax,16
PE> PE> cmp al,bl
PE> PE> je {at}l4
PE> PE> cmp ah,bl
PE> PE> jne {at}l2
PE> PE> l4 mov [di],ah
PE> PE> jmp {at}l7
PE> PE> l5 mov [di],ax
PE> PE> jmp {at}l7
PE> PE> l6 mov [di],ah
PE> FM> Er, I think you're still potentially moving one extra byte, in line 1
PE> FM> and line 6.
PE> FM> Or I'm misreading the above code 'cos it's not complete.
PE> Line 6 is the easiest to demonstrate. There is a compare of both
PE> al and ah before doing the halfword move, so there's no problem.
But haven't you already moved a full word in line 1?
PE> PE> Ok, so there's one extra instruction used - so sue me! Actually,
PE> FM> Plus, you're doing word moves instead of dwords. So
PE> Two halfword *stores* instead of one dword store is where the
PE> extra instruction comes from.
I meant store. And what I'm saying is that I think you should do a dword
store when you can, if not then a word, if not then a byte. Although
given that those last 2 possibilities only happen for the leftovers,
storing up to 3 bytes individually ain't gonna make much difference.
PE> PE> you could even get that extra instruction out of the loop if you
PE> PE> wanted to, and move it to the termination cleanup. That would
PE> PE> be best.
PE> FM> You certainly can get the advantage of dword moves without the problem
PE> FM> of overrunning your output buffer, but it takes a little bit more work.
PE> Actually, I've changed my mind about that. I don't think it is
PE> possible to reduce the timing of instructions within that loop,
PE> without causing an overrun. I was basically moving the testing
PE> before the move, and then special case after the loop, but the
PE> testing I was doing would destroy the register.
That's what I think you should do, and not use a destructive test. But
you may not be able to persuade your compiler to generate that. But
then, given that you're trying to write code optimised for a particular
hardware platform, need you have any objection to switching to the built
in assembler? Or making it an external function written in asm, as you
want for the maths library?
PE> HOWEVER, what could be done is if I give you the maximum length
PE> as WELL as just the '\n' terminator, then it could be written
PE> just as fast.
And, as I said in another message, I think that's how the function
should be specified for maximum safety, protecting users from
themselves. But if you're implementing an ISO-specified function maybe
you don't have any choice about the interface.
PE> One other thing that hasn't been taken into account here is that
PE> dword accesses are meant to be faster if they're dword aligned.
PE> There is no logic in here to access stuff on a dword boundary,
PE> and do something else for the odd <= 3 bytes either end. Do
PE> you think that would make a difference?
Yes, very much so. As you have since discovered and my message yesterday
demonstrates.
PE> FM> I didn't do so as I didn't think it was a problem. However, you've now
PE> FM> said it is. Do you want to see code, and timing?
PE> I think what I'll take out of this discussion is that I can get
PE> a worthwhile (10-25% or something) improvement by using dword
PE> accesses even with my C code, which I should implement in my
PE> fgets() function. I can add the speedups we have canvassed above.
PE> The only thing I can't get is the fancier code to match your
PE> assembler. I'm not sure what the penalty for that will be. That
PE> would determine what gain I could get from going assembler.
PE> Let me rework the C code first, and see if I can form the problem
PE> in terms of the new code.
Well don't you need to implement whatever the ISO definition of fgets()
is? (I don't know what fgets() is/does.)
Regards, FIM.
* * Just what does that mean?
@EOT:
---
* Origin: Pedants Inc. (3:711/934.24)SEEN-BY: 690/718 711/809 934 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.