Why "collapse" (not "rot") is the way to think about software problems

For decades, programmers have talked about the tendency of software to become less reliable over time as "rot," but Konrad Hinsen makes a compelling case that the right metaphor is "collapse," because the reason software degrades is that the ground underneath it (hardware, operating systems, libraries, programming languages) has shifted, like the earth moving under your house.

Building on this metaphor, Hinsen identifies strategies we use to keep our houses standing: building only on stable ground; building in reinforcements to counteract the expected degree of shaking; fixing the house after every quake, or giving up and rebuilding the house every time it falls down.

These strategies are of limited use to software developers, though: building in a risk-free environment means using systems that don't change, which severely limits your options (some large fraction of ATM transactions today loop through a system running COBOL!); we don't really know how to make software that remains reliable when its underlying substrates change; and rebuilding software from scratch over and over again only works for very trivial code.

Which really leaves us with only option 3: constant repairs.

I love this analysis but I wonder where "technology debt" fits in (the idea that you shave a corner or ignore a problem, then have to devote ever-larger amounts of resources to shoring up this weak spot, until, eventually, the amount of work needed to keep the thing running exceeds all available resources and it collapses).

As a first step, consider the time scale of change in your own project. Do you develop software that implements well-known and trusted methods for use by a large number of researchers? In that case, your software will evolve very slowly, fulfilling the same role for decades. At the other extreme, if your software is developed as part of research ina fast-moving field like machine learning or bioinformatics, it will evolve rapidly, and last year's release may be of interest only for the history of science. As a rule of thumb,the time scale of layer-4 software is the duration of the project it serves plus the length of time you expect your computations to remain reproducible. For layer-3 software,it's the time scale of methodological advance in its research domain that matters. Check for example How old are the methodological papers that you tend to cite. Infrastructure software, i.e. layers 1 and 2, can fulfill its role only if it is more conservative than anything that depends on it, so itstime scale of change is defined by its intended application domains.

Next, you must estimate the time scale of change of your dependencies. For layer-3 dependencies, that should be rather straightforward, as they are likely to evolve in the same research community as yourself, and thus evolve on similar time scales as your own work. For infrastructure software, the task is more difficult. The fact that you are considering to adopt package X as a dependency does not mean that the developers of X have your needs in mind. Soyou have to look at the past evolution of X, and perhaps at the time scales of the major clients of X, to get an idea of what to expect for the future. For young projects, there isn't much past to study, so you should estimate their time scale by their age.

Once you have all these time scale estimates, you can identify the most risky dependencies: those whose timescales of change are faster than your own. If you go for strategy number 3, i.e. adapting your code rapidly to changes in the dependencies, then you might have to invest a lot of effort into catching up with those fast-moving projects

Dealing With Software Collapse [Konrad Hinsen/IEEE Computing in Science and Engineering]

(via Four Short Links)