TIP: Click on subject to list as thread! ANSI
echo: rberrypi
to: THE NATURAL PHILOSOPHER
from: DRUCK
date: 2021-01-05 21:20:00
subject: Re: AI and decompilation?

On 05/01/2021 10:28, The Natural Philosopher wrote:
> Yes. I am certain that certain compilers and certain languages leave a
> fingerprint, Always THAT resister, used to do THAT job, always that 
> particular sequence of assembly to mimic that high level construct.

They certainly do, I wrote !ARMalyser to analyse RISC OS executables and
to aid the conversion from the old 26 bit ARM mode to modern Aarch32. It
was very obvious if Norcroft C, GCC or handwritten assembly had been 
used by looking at any chunk of the code, not just the obvious file headers.

> I think it is up to a limited point entirely possible to make an AI that 
> could replace  machine code with editable and compilable  source code.
> But there will always be the Problem Of Induction. Many many possible 
> constructs in source using an infinite number of random variable and 
> function  names, could compile to the same object code. And there is no 
> way to reinstate the comments either, so it becomes an exercise 
> ultimately in hand editing and reinstating the comments manually - 
> almost as big a job as writing from scratch.

I was not attempting to turn the executable in to a high level language,
but to give the user as much help understanding the assembler code as 
possible, to aid the conversion.

At the lowest level identifying what was code and what was data, easy in
well defined executable formats produced by compilers, but hard in 
handwritten assembler, which had often used every trick in the book to 
squeeze out performance on a 8MHz ARM2 with 512MB of RAM.

The next step was using knowledge of the Standard C Library functions 
and SWI APIs to annotate the registers passed and returned from the APIs
and where those registers contain static addresses, the data blocks they
point to.

To allow code to be modified with additional instructions to recreate 
flag preserving behaviour of the 26 bit code (in the few cases it is 
actually necessary) and data added to make the larger 32 bit file 
headers, all code and data addresses are identified and converted in to 
labels.

ARMalyser outputs in the standard Object Assembler syntax so it can be 
reassembled to produce an identical executable, and subsequently 
modified. It can also add syntax colouring in various formats such as 
XML, HTML/CSS for viewing.

If you were in marketing you could say the code which does this is 'AI',
but its really a huge chunk of tangled heuristics, which works well most
of the time, but occasionally miss-identifies code or data. Its a bit 
too eager to identify code, due to the tricks assembler programmers 
used, if I ripped all that out and only worked on compiler generated 
executables, it would be a lot more reliable.

---druck

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | FidoUsenet Gateway (3:770/3)

SOURCE: echomail via QWK@docsplace.org

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.