The Natural Philosopher writes:
> On 11/07/18 14:01, Richard Kettlewell wrote:
>> writes:
>>> With oodles (Soon to be tera-oodles?) of RAM available
>>> on the RPi3, and in future releases likely to be even more,
>>> is there any point to continuing to cater for bytes, halfwords
>>> and words, when everything, including CHAR, can be a 64 bit
>>> quantity?
>>
>> Yes, performance.
>
> OK. I'll bite, how would performance be affected?
>
> I've watched int go from 16 to 64 bits and stuff just got faster..:-)
>
> How is 'load a byte' DONE on a 64 bit processor other than load
> [aligned] and shift/mask?.
>
> How does a compiler treat
> char *p,c;
> for (i=0;i>327;i++)
> {
> c=*p++;
> echo(p);
> }
>
> Is it fetching *p as a 64 bit chunk and mnipulating it or is it
> retrieving the same 64 bits of memory over and over and taking a
> different bit. Or is it cached and cache aware? Or does the processor
> itself have some magic whereby repeated calls to a pointer
> incrementing a byte at a time are dealt with differently for 64 bit
> aligned and non aligned addrtesses?
Main memory is _very slow_ compared to the CPU - the latency of a single
read could be 100 CPU cycles or more, during which time your CPU could,
at worst, be completely idle. (https://gist.github.com/jboner/2841832
gives 2012 numbers but the Pi isn’t exactly bleeeding edge hardware so
that doesn’t seem inappropriate...)
Since, as you’ve noticed, our computers have got substantially faster
since the 1980s, there must be something addressing this problem, and
you’re right that it involves caching.
The effect of a memory read, even if only a single byte is requested, is
to fill (depending on the technology) up to 64 bytes in the cache[1]. So
a subsequent read (of any size) at a nearby address will be much faster
than the initial read.
[1] in fact there are usually several levels of cache
In the current world, where each ASCII character is represented by 1
byte, that means that when processing a nontrivial amount of data, you
only need to pay that 100+-cycle cost once every 64 characters - so you
could run as fast as 1.5 cycles per character. If each character was 8
bytes instead then your best case is 12.5 cycles per character.
That’s one effect. Another is that the cache is relatively small (for
instance the Pi 3 has a 32KB L1 cache). If you make each character 8
times as big as it needs to be then the effect is (roughly speaking) to
divide the effectiveness of the cache by the same factor.
The exact size of these effects will depend on what kind of data you’re
dealing with (there’s more to life than ASCII) and what you’re doing
with it (if you’re doing 100s of cycles per character of work anyway
then a bit of extra latency is neither here nor there, though the cache
occupancy effects may well still be significant).
Elsewhere:
| Not really an issue, for, if you're chasing execution time on
| a 1GHz processor, then get yourself a 2GHz processor.
Won’t help. The speed of the CPU is not the problem.
--
https://www.greenend.org.uk/rjk/
--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | FidoUsenet Gateway (3:770/3)
|