Software Analysis by Reverse Engineering

The thesis here at the Geoff Chappell, Software Analyst website is that software can feasibly be subjected to a process analogous to literary criticism. Much as a literary critic may read the text of a novel sufficiently closely to discern weaknesses in the plot or to make out in-jokes contrived for the special enjoyment of the author’s friends, so may a software analyst study the instructions of a computer program sufficiently closely to spot errors in the code and to learn of features that the program’s manufacturer does not disclose.

We are all familiar with the work of literary critics. Most of us are comfortable with it, though authors and their publishers are sometimes less so. Most of us do not need, and ordinarily would not want, that the critic should be helped by the author, let alone that the critic’s understanding of the work should depend on contact with the author. Yet with software, the craft of analysis is so undeveloped that even among experts in computer science, only very few expect that useful analysis is practicable, or even possible, without help from the manufacturer and specifically without access to the source code.

This notion that the source code is all-important has become especially well established at the hands of those who promote open-source licensing. I say that whatever the merits of publishing software with or without its source code—though please note that I always give you the source code for my demonstrations—we can and ought do better at developing techniques for examining software without needing the source code. Instead, we have software manufacturers insisting that their products can be used only in conformance with so-called license agreements that tell the consumer not to study the software too closely. Society is failed while the computer programming industry sells its wares but denies the buyers any realistic means of independent inspection. Do we tolerate this for anything else but software?

Reverse Engineering

To me, reverse engineering is a process of reading the software’s binary code to find what the software can make the computer do. It is this code that the computer reads and obeys, not the source code. Though binary code is not easily read by humans, a representation in assembly-language mnemonics, as shown in a debugger, is exactly equivalent (on all but some very obscure theoretical points) to what the computer reads. Indeed, reverse engineering as I practise it is essentially debugging in advance of having a bug to debug.

If your notions of right and wrong would have it that reverse engineering, even in the limited sense I just described, can only be wrong, then this website is not for you. Since pretty much everything you can learn here about any software will be something you would never permit yourself to use, you may as well not read on.

And while I’m presuming to suggest how you approach my work, we may as well deal with the question of bias. Because reverse engineering is an intensive exercise, I have long specialised in software that I perceive as offering the best return on my effort, mostly for being software that has a large ecology of dependent programmers. For better or worse, this means my studies are directed almost entirely at Microsoft Windows and at Microsoft’s tools for Windows programming. Some of you apparently take my criticism of Microsoft as indicating that I am biased against Microsoft and even that I’m a Microsoft hater (whatever that might be). Certainly, if you’re reading an analysis and you feel I haven’t made my case by reference to what’s in the code, then infer what bias you want. Better yet, write to me to alert me to what you’ve spotted as a defect in my argument. But if all you’re doing is weighing criticism against praise as some rough-and-ready technique for detecting bias, then please ask yourself why you expect these to balance. This site doesn’t set out to review software for its usability, even by programmers, or to instruct about software design. The whole purpose is to demonstrate that inspection from outside can show, in more detail than most think possible, that what is in the software is not exactly what is said to be. Where I write about discrepancies, take me to task if you think I don’t allow enough for the human frailty of programmers, but please don’t expect that for each fault I find in the software I should write something positive for balance. Where there’s no discrepancy between the software and what its manufacturer says about it, there’s no cause even for comment, let alone for praise.

What Good Can Come?

As our society relies ever more on computer software sold as a consumer product, I cannot be the only one who is troubled that we have hardly any means of inspecting the product independently of its manufacturer. Of course, a bug in Windows is not a matter of life and death, nor even is the possibility that any software manufacturer might mislead its customers, or our courts, having calculated that there is no realistic chance of being exposed. Yet for no consumer product other than computer software has our society accepted anything like so much dependence on the manufacturer to be open and truthful about the product’s behaviour and especially about defects. Arguably the only mechanism through which we might hope to attain openness and truthfulness in the manufacture of software is competition. This generally has been a very effective process for raising standards, but how much it can sensibly be relied on for software is at best debatable. Software seems unusually susceptible to the development of monopolies but the complexity of the product means that regulators who might ordinarily hope to redress a lack of competition, whether natural or contrived, plainly don’t have practical means of knowing the products well enough to be effective, let alone wise.

As we look to a future of new technology, especially in genetics where we did not invent the machinery that we will nonetheless try to program, we surely ought not let it become our custom to trust so much in the manufacturers. For the programming of software on electronic technology, the best that our society seems to have managed is some notion of manufacturers disclosing their source code, whether voluntarily as so-called open source or compulsorily at the direction of a court. To me, this is analogous to requiring that manufacturers of food disclose their recipes or at least list their ingredients. It’s fine enough as far as it goes—and I am one who always reads those details on packaged food—but it’s no substitute for independent chemical analysis. Our food supply surely has more integrity, and the ingredients lists and nutritional details more credibility, because the manufacturers of food know that their wares can be analysed independently, even if such analysis is hardly ever done. We ought to have something like that for software, but we are nowhere near it. Indeed, we are so far from having it that few have ever thought to expect it and most of them gave up long ago on ever seeing it.

Arrested Development

Given that mass-market computing has been with us for 30 years, and has developed in scale and sophistication by orders of magnitude, it is conspicuous that reverse engineering of computer software is almost as primitive now as then. This lack of development is in one sense a measure of how good computer software has become. As much as we all get frustrated by some programs in some circumstances, the fact is that commercial software mostly does work as advertised. If reverse engineering hasn’t grown as fast as programming and marketing and everything else that makes an industry, the simple reason may be that nobody needs it to.

Yet if reverse engineering is something that nobody wants, then it generates a lot of talk. Cryptographers have long established the notion that if a cipher depends on the secrecy of its algorithm then it is no good at all because the algorithm will just be reverse engineered from whatever program uses it. The growing field of computer forensics relies heavily on gleaning information from caches and logs, whose binary data is often in a proprietary format which becomes known more widely (if inexactly) only because of reverse engineering. The part of the computer security industry that would defend us all from malware is forever in the news with talk of the good work being done to reverse engineer this virus or that worm. And there is seemingly no end of Windows programmers who have not just tried their hand at reverse engineering to find their way round some quirk in Windows, but claim to be very good at it. This website is arguably just an extreme manifestation of this last example.

Whatever the talk, there’s very little sign, at least in public and outside of the computer security industry, of reverse engineering being anyone’s specialised work at which they become ever better experienced through full-time study, practice and application. Unsurprisingly, there’s also very little sign, again maybe only in public, but even in the computer security industry, of any reverse engineering that anyone has good cause to be proud of. Picking through the instructions that will be executed by a dumb computer isn’t undemanding intellectually—but it’s not String Theory. This is not work that’s fundamentally too difficult for all but the brightest minds. That there’s so little to show is presumably not because people aren’t bright enough or don’t work hard enough but because something causes them not to aim very high or perhaps even not to see how high they might aim.

Ambition

This website exists as the main record of my contribution to showing what can be aimed for in the reverse engineering of software, especially of Windows, and to bringing it within reach.

Though I have long intended that techniques must some day be taught, this is expressly not a website that teaches reverse engineering. You will see very little sign here of how anything was discovered. In part, that’s because I believe it must first be shown that an awful lot can be discovered. But there is also that the how is boring. I don’t mean that pejoratively. While you’re learning how to do something, or teaching it to others, the how had better be interesting. But once learnt, it should be nothing more than technique, subsumed into the subconscious. As vital as it may have been to learn the alphabet all those years ago, we mostly do not as adults have more than the slightest awareness of individual letters in the words we read. That so many malware analyses and other reports of reverse engineering dwell so much on how the information was obtained is, to me, a sign of how primitive is the reverse engineering and how low are the expectations of readers’ abilities. If you’re spending your time explaining your basic algebra or automating your arithmetic, then although I don’t say that what you do is without merit, I reckon you’re probably not doing higher mathematics.

With sufficient will—and, importantly, sufficient support—to develop the skill, software can feasibly be studied without having the source code, without assistance from the manufacturer beyond the generally published documentation, and even without running the software. Indeed, if reverse engineering were supported as full-time work, with a career structure that programmers might pursue instead of programming, then much of the debate about open versus closed source code might even go away. When everything you can want to know about a program’s behaviour can be found by reverse engineering, then source code for software becomes more like blueprints for a building: a record at best of what was intended or specified, but not necessarily of what got built.

Public Good

To some extent, it already doesn’t matter whether source code is open or closed. With or without source code, the threshold for understanding complex software more deeply than supported by the manufacturer’s documentation has long been too high for an ordinary cost of doing business. It takes time and care to read and (properly) understand zillions of lines of source code. In many cases, a proficient reverse engineer isn’t signficantly encumbered for not having the source code, and may actually have the advantage.

Only very rarely can some information about someone else’s software be worth enough to any one software company to bear the whole cost of finding the information, whether by reverse engineering from binary code or by reading the source code. Yet when the someone else’s software is something like Windows, with which much other software must inter-operate, an awful lot of time and money is wasted in total by software companies for not having enough information known with enough certainty. Reverse engineering Windows looks like a classic case of something that can only be funded collectively as a public good.

The Goal

The aim at this site is to show that practicable techniques of reverse engineering are at least within reach, albeit as one man’s lone demonstration. Practicality is demonstrated by producing lots of information that would otherwise not be known, at least not with the same precision and comprehensiveness.

In principle, a given aspect of a program’s behaviour should be discoverable to any desired accuracy by studying to sufficient depth the set of instructions that constitute the program’s binary code. Analysis conducted according to this principle is essentially an exercise in deducing how the computer would behave if executing the program.

Note the subjunctives. The analytic techniques sought here have no need for access to whatever source code was used to produce the program. Neither is any assistance needed from the software’s manufacturer. Indeed, it is not necessary even to run the software—not until the analysis produces predictions to be tested by experiments.

The Benefits

Direct benefits of this sort of analysis flow to three broad groups: