Week 3: Software Heritage for Code Studies: A Guide Based on Programming Historian

Titaÿna · January 26

Posted by Titaÿna Kauffmann, C²DH, University of Luxembourg

For those who want to work with Software Heritage, the best practical guide available is a 2024 Programming Historian tutorial by Sabrina Granger, Baptiste Mélès, and Frédéric Santos: "Préserver et rendre identifiables les logiciels de recherche avec Software Heritage". One problem: it's only in French. This post summarizes its key points for non-French speakers, with examples from the exhibition.

A note on the name: Software Heritage's founding documents speak of preserving "the technical and cultural heritage of humanity." In French, heritage is patrimoine — a term that carries connotations of inheritance, of what we receive from and pass to generations. It frames code not just as functional artifact but as something worth transmitting.

This framing has institutional weight. In 2003, UNESCO's Charter on the Preservation of Digital Heritage explicitly listed software among the "born-digital" materials requiring preservation — alongside texts, databases, images, and web pages. The Charter notes that "where resources are 'born digital', there is no other format but the digital object." Unlike a manuscript that can be photographed or a painting that can be reproduced, source code has no analogue original to fall back on. If the digital form is lost, there is nothing else.

In 2016, UNESCO and Inria signed an agreement specifically recognizing Software Heritage's mission. At the signing, François Hollande captured the stakes: "What is expected of us is to be able to control, to be able to transmit, is to be able to put these technologies, this information, these elements that become of the heritage at the service of humanity." The verb transmit — transmettre — is key. Heritage is not just what we keep but what we pass on.

What is Software Heritage?

Software Heritage is a nonprofit initiative launched in 2016 by Inria (French National Institute for Research in Digital Science and Technology) with UNESCO's support. Its mission: collect, preserve, and share all publicly available source code. The archive currently contains over 18 billion unique source files from more than 300 million projects.

The archive addresses a real problem. As the tutorial authors note, URLs in academic publications have a lifespan somewhere between that of a hamster (2 years) and a penguin (15-20 years). Personal websites disappear when people change jobs or retire. Institutional pages break when organizations rename themselves. Even major forges offer no guarantee — remember Google Code?

Software Heritage provides a stable alternative: code archived once remains accessible, with permanent identifiers that won't break.

Why Source Code (Not Executables)?

The tutorial emphasizes a key distinction. Software exists in three forms: source code (human-readable instructions), compilation (the translation process), and executable (machine-readable binary).

Software Heritage archives only source code, not executables. The reason is practical: compilation is largely a one-way process. When source code becomes a binary, comments, variable names, and documentation are stripped away. What remains runs on machines but can't be read or reconstructed by humans. Archiving the executable would preserve the program's function; archiving the source preserves how its creators thought.

The tutorial points to the Apollo guidance computer code as an example. The source contains comments like explanations of why certain routines exist. These are stripped away entirely in the executable. You can explore this yourself in Software Heritage by searching for the repository URL https://github.com/virtualagc/virtualagc.

How Software Heritage Archives Code

The tutorial describes two archiving methods:

Automatic harvesting: Software Heritage regularly crawls major forges (GitHub, GitLab, Bitbucket, etc.) and package archives (npm, PyPI, etc.). Most public open-source code is already archived.

Manual archiving: Anyone can trigger archiving of a public repository using the "Save Code Now" feature. You don't need to be the code's author — you just need the repository URL.

Beyond the code itself, Software Heritage preserves development histories. Every commit is archived with its author, timestamp, and message. This matters for understanding how software evolved. The tutorial cites a 2008 Linux kernel commit by Matthew Wilcox explaining why he simplified a section of code — that reasoning, captured in the commit message, helps future readers understand the codebase.

Gabriel Alcaras's exhibition panel on git-stash makes similar use of commit history. Nanako Shiraishi's original commit from June 30, 2007 includes her explanation: "When my boss has something to show me and I have to update, for some reason I am always in the middle of doing something else." The timestamp tells us it was a Saturday afternoon in Japan. The commit message tells us she wrote it to solve a personal workflow problem. Software Heritage preserves all of this.

Finding Code in the Archive

The tutorial outlines several search strategies depending on what you know:

If you know the project name: Search for the repository in a search engine, find its URL (e.g., https://github.com/torvalds/linux), then enter that URL in Software Heritage's search.

If you have a code snippet: Search for the exact text in a search engine to find which repository contains it, then locate that repository in Software Heritage.

If you have a file: Drag and drop it onto the Software Heritage homepage. The system will tell you if identical content exists in the archive.

If you have a SWHID: Paste it directly into the search bar.

The SWHID Identifier System

This is the core of the tutorial's practical guidance. SWHIDs (SoftWare Hash Identifiers) are permanent identifiers designed specifically for software. Unlike DOIs, which are assigned by a registry, SWHIDs are intrinsic — computed directly from the content through cryptographic hashing. The identifier is tied to the exact content: change one character, and the SWHID changes.

As of April 2025, SWHID is an ISO standard (ISO/IEC 18670).

There are five types, suited to different needs:

Type	What it identifies	When to use
`snapshot`	A complete state of a repository at harvest time	Citing a project's overall state
`release`	A tagged version	Citing a specific software release
`revision`	A single commit	Citing a specific change or commit message
`directory`	A folder of files at a point in time	Citing a project's source tree
`content`	A specific file (can include line numbers)	Citing exact code you're analyzing

The tutorial gives concrete examples. If you're a researcher who used version 1.1.1 of a specific R package, you'd want a release SWHID. If you're writing about a specific algorithm implementation, you'd want a content SWHID pointing to that file, possibly with line numbers.

Obtaining a SWHID: Step by Step

The tutorial walks through the process:

Navigate to the code in Software Heritage's archive
For files, you can select specific lines by clicking line numbers (shift-click for ranges)
Open the "Permalinks" tab on the right side of the interface
Choose the appropriate identifier type
Optionally check "Add contextual information" to include qualifiers (origin repository, path, etc.)
Copy the SWHID

The contextual qualifiers are useful for readers. A bare SWHID like swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 points to content but doesn't explain where it came from. A qualified SWHID includes the origin repository, the snapshot, and the file path.

When to Use Which Identifier

The tutorial distinguishes between two main use cases:

For citation (attributing intellectual credit, pointing readers to a version): Use DOI or HAL-ID if available, which link to descriptive metadata. These are like an identity card — they tell you about the software.

For technical verification (ensuring reproducibility, examining specific code): Use SWHID, which points directly to content. These are like fingerprints — they identify the exact artifact.

The two are complementary. A HAL deposit might include a SWHID that points to the precise archived code.

Adding Metadata with CodeMeta

The tutorial's final section covers CodeMeta, a standard for software metadata. A codemeta.json file in a repository provides structured information (authors, license, dependencies, related publications) in a format that bridges different platforms' metadata vocabularies.

The tutorial links to CodeMeta Generator, a web tool for creating these files without writing JSON manually.

Connecting to the Exhibition

Several exhibition panels feature code accessible through Software Heritage:

Stockfish (Mathieu Acher's panel): Browse the repository, navigate to src/position.cpp, find the line PRNG rng(1070372);

Git-stash (Gabriel Alcaras's panel): The original commit can be accessed directly via its SWHID: swh:1:rev:f2c66ed196d1d1410d014e4ee3e2b585936101f5

Wenyan (Baptiste Mélès's panel): Browse the repository

Other panels feature code that required archival work outside Software Heritage — ELIZA's original MAD-SLIP source was transcribed from MIT archive printouts, the Latin American World Model survives as printed code sheets. This distinction matters for thinking about what gets automatically preserved and what requires deliberate recovery.

Resources

Software Heritage archive: archive.softwareheritage.org
Save Code Now: archive.softwareheritage.org/save
SWHID documentation: docs.softwareheritage.org
Full French tutorial: programminghistorian.org/fr/lecons/preserver-logiciels-recherche
Exhibition website: www.sourcecode-exhibition.softwareheritage.org

jshrager · January 26

Thank you, @Titaÿna for this detailed explanation of this incredibly important project. I'm going to express a slightly controversial position on this, but i want to be clear that I don't think that the project is useless, just that it isn't as useful as it may seem.

One of my best friends is a ... I'm not sure what the term is so I'll call her a "Heritage Linguist". She tries to preserve language that are dying out (and although she doesn't work on this, there is another thread that tries to figure out languages that are already lost.) I am a professional computer scientist. As such, software being the focus of my career, I have backups of backups of my backups in the house, in a safe deposit box, and all over the web. Backups are the most important thing in my life, next to my family (and it's close as to which is more important! ... jk! :-)

I am dubious of both of our projects in exactly the same way that I am dubious of the Software Heritage project. The part that concerns me is this: "The source contains comments like explanations of why certain routines exist. These are stripped away entirely in the executable." I do NOT think that the project should also store executables -- quite the opposite, having a huge amount of personal experience with backing up my own code, the executables are, indeed, quite useless (with rare exception), however, code is not nearly as important by itself as this project makes it out to be.

Code runs in a context -- the data, the use cases, the computing environment, the "meaning" of the code in the setting -- you might call it the cultural surround of the program. In the "computational world" (meaning approx. since the enlightenment, but primarily the past 75 years) most code would be meaningless -- or at least only trivially meaningful without that, and all that is very very hard -- mostly actually impossible to store!

Let me give you a precise example: I currently work on methane removal technologies. There is, of course, a lot of involved, but that code runs on a device that is embedded in a device that is embedded in another device that is embedded in a lab that is embedded in a company that puts these nested systems upon systems on cargo ships to mitigate methane. The code, even with the comments, means essentially nothing without all that context. I back it up, of course, but I don't expect to run it again -- the context would be gone. Why do I back it up then? Two reasons: First, is the obvious one that it might get instantly lost. Then there is the secondary usage, which is that I can use old code moving forward at adapt it to new contexts as our devices change, saving me rewriting, and rediscovering, for example, that you can't turn on all the lamps at once or a pulse fries the relays. But why do I then why do carry my backups upon backups of code around with me after any number of many application contexts are long gone and the code is useless.

There is the additional reason, as well, that it's legally required for published papers. But this is in large part a fantasy in the same sense: I always retain code that goes with my papers, and hand it over freely if asked. But I've also had a lot of experience getting other scientists' "preserved" code, and even with those scientists (sometimes grudging) help, it is incredibly hard to get any code to work, or even to understand it. Why? Because ... and here's the point of this whole post ... the context of the science isn't the code!

The reason I have my backups upon backups upon backups for long dead context is that ... well, I guess I could maybe use a few lines in future projects, but TBH it's easier and in the end faster to recreate most code than to scour backups trying to find three lines of code that could be helpful now. No, the reason I keep all my code is the same reason I have pictures of my family: I can't part with my own history. Or, putting it more positively, I like looking at it. It's my creation -- my output -- my family, and I want to have pictures of them. But when I'm gone, and the context is long gone, that stuff should all be thrown in a shredder in order to same future generations from wasting their time looking at my old useless code.

This brings me back to my friend the Heritage Linguist. I've had this argument with her. There is, of course, some reasons to retain dying languages, esp. in order to communicate with elders, and perhaps to read elder writings (or understand oral histories). But what I tell her (and we argue about) is that the number of dead languages is enormous -- probably far larger than the number of extinct animal species since humans have had language. And the number of git pushes that there are each year isn't even imaginable. I have backups upon backups upon backups of my useless software -- from my professional standpoint, the software heritage project sounds insane, but even more, given the contextual argument, pretty useless.

Now, this isn't to say that it isn't fun to reanimate old code -- or to look at pictures of my family. I engage in both myself! And paleontology is fun too, although folks argue about it utility as well. (At least in paleontology we're know what the context was like in most cases.)

Okay, see, I warned you I was going to take a controversial position.

markcmarino · January 26

I just want to take this moment to congratulate Titaÿna Kauffmann (@titayna) and the whole organizing team! The inclusion of this exhibition of code by the Software Heritage Foundation as a UNESCO event marks a significant development in our understanding of code as a cultural artifact. The exhibition delivers a clear announcement that code is culture.

Although not the first effort by the Software Heritage Foundation, this exhibition brings "code as culture" to the global stage. The work that began with the partnership in 2016 marks an acknowledgement of a fundamental premise of Critical Code Studies: that the study of computer source code can help us to access the history of people, communities, organizations, and ideas.

When we started these working groups and the efforts of Critical Code Studies, talking about code as a cultural object sounded strange to most people, even the people developing it. Code was seen as something provisional, something to be used but not studied, something that operated within the walls of the building, not meant to be preserved like other cultural artifacts.

Code was seen as the domain of the programmer, not acknowledging the many cultures and communities in which programmers operate. What could code mean for the rest of the world?

UNESCO offers us much richer understanding of world heritage, preserving dances, food, rituals and festivals, and even languages. They preserve manuscripts, archives, film and audio, and rare books. They are invested in indigenous languages and practices, preserving and documenting what otherwise could be lost.

I treasure Hollande's call "to put these technologies, this information, these elements that become of the heritage at the service of humanity." I would hope that the work that we do here in the working group helps us put this technology in service of humanity through our readings, interpretations, and discussions. A key part of that transmission is drawing out the meaning of these objects, which requires interdisciplinary dialogues.

I look forward to our discussions of the exhibition and hearing about the gathering in Paris this week.

What other code do you think should be preserved? What code objects from the exhibition light your imagination?

Titaÿna · January 27

Thank you both for these opening reflections.

A quick note on my position: I'm not a Software Heritage employee working on archival infrastructure—I'm a PhD student in history at the University of Luxembourg, collaborating with them on this exhibition. My interest is in how historians might use preserved code, not in the technical systems that preserve it.

@markcmarino — I share your sense that this exhibition marks something significant. The UNESCO framing matters: it positions code within a broader understanding of heritage that already encompasses manuscripts, oral traditions, ritual practices.

@jshrager — I genuinely appreciate the provocation. You're identifying something important: the fantasy that if we save the artifact, we've saved the history. Your methane removal example is vivid. The code alone won't tell someone in fifty years what you were actually doing.

But I come at this from a historian's vantage, and for historians, the calculus looks different. We're accustomed to working with fragments. The medieval historian doesn't have the village, the smells, the social relations—she has a charter, maybe a chronicle, possibly some archaeological remains. She cross-references, triangulates, acknowledges what's lost. The fragment is valuable not because it's sufficient, but because it's something where otherwise there would be nothing.

I'll stay with Apollo since we all know it, but I want to be precise about what that case actually demonstrates. The code survived through paper listings that ended up at the MIT Museum; dedicated individuals later undertook digitisation and transcription. Crucially, the Virtual AGC project didn't stop at preservation—they built an emulator. The code can run again. This pairing of archived source and functional emulation opens possibilities that neither alone provides: we can study the static text and observe execution behaviour, test hypotheses about how the system responded under specific conditions.

What can we learn from the code itself? Technical architecture. Memory management strategies. Commenting conventions. We find lines like "TEMPORARY, I HOPE HOPE HOPE"—which tells us something about pressure, about the gap between intention and implementation, about what programmers chose to record in their annotations. But the code doesn't tell us who wrote that line, under what circumstances, or what "temporary" meant in the project's timeline.

For other dimensions, we turn to other sources. Margaret Hamilton's 2004 oral history describes being frequently the only woman in professional settings, coining "software engineering"—a term hardware engineers initially dismissed—and bringing her daughter to the lab because the work demanded it and support structures didn't exist. This testimony reveals labour conditions, professional recognition, and gender dynamics at a major software project. It provides a parallel account of the conditions of production that we can place alongside the technical artifact.

The historian's task is to hold these sources in relation without collapsing one into the other. The code tells us what decisions were made; Hamilton's testimony tells us something about the environment in which decisions were made. Neither explains the other directly. We triangulate, remaining attentive to what each source can and cannot support.

So I'd reframe your point slightly. You're right that code alone is radically insufficient. But the question is whether having the code, alongside other sources, enables historical work that would otherwise be impossible. For Apollo, certain analyses—of software architecture, of engineering tradeoffs, of what the programmers actually built versus what retrospective accounts claim—depend on access to the code itself. For most 1960s software, we cannot even attempt such work.

There's a recent example that illustrates what becomes possible when code survives at scale. Aycock et al.'s study of code re-use in Atari 2600 games analysed nearly 2,000 ROM images to trace how routines spread across games and companies. They could identify specific code sequences appearing in games by different developers, track a programmer's evolving practice across titles, and detect corporate boundaries in re-use patterns. But they didn't stop at code analysis—they combined it with oral history from a developer who kept his 8-inch floppy disks and printouts for forty years. The code alone didn't explain why certain routines spread; the testimony and preserved artifacts filled those gaps. This is exactly the triangulation I'm describing: preserved code made the technical analysis possible, but understanding required multiple source types working together.

Your backups-upon-backups might be more than sentiment. Fifty years from now, a historian might want to understand early methane removal technology—not just the physics, but the software architecture, the embedded systems relationships. They probably won't run it without emulation work. But if the code survives alongside your papers, your lab's records, perhaps an interview, they'll have fragments to triangulate. That's better than nothing. And sometimes fragments enable work no one anticipated.

jshrager · January 27

@Titaÿna Again, let me be clear that I'd be the last to say that saving code isn't useful. But since the claim is that source code is the critical form, I think that this is misleading; the thing to save is context. When you are reading code you are actually more reading context than code. Try reading the Apollo guidance code without the comments and even more important, the commentary; without knowing what the display looked, like or if you came across that code without knowing that it was the Apollo guidance code, it isn't much better than the made-up-code mentioned in another thread from Jurassic Park. So, again, I'm not against saving the code, but I think that what we ought to be doing in addition, for anything that we think is important, is at least interviewing the authors of the code while we can. There's already some (not, IMHO enough) of that in the case of the Apollo code, but there's pretty much nothing but code comments for probably 99.999% of the code in the repo (actually, the commit comments are often more useful than the code itself because it tells us what the authors were thinking, whereas the code just tells us what the authors thought might be obscure enough to comment on). Fortunately, like the Apollo code, I'm in good shape a regards IPL-V and LT, because there is extremely good documentation for them both. The LT code would be utterly uninterpretable without knowing IPL-V, and IPL-V would be utterly uninterpretable if all I had was a non-working IPL-V interpreter, but an hour of an interview with Simon and Newell about it would be worth more than the code itself. (I could have done this, because I worked with them at CMU, but I wasn't on about this project than, alas. Again, fortunately it was extremely well documented by them and their colleagues at the time.) BTW, I often don't even read my own code when I'm re-using it in a new context -- I mostly either know what it was intended to do, or read the block comments (or often just the function name) to understand what the code was intended to do, and then sometimes adapt the code, or more often than not rewrite it in the current context.

Titaÿna · January 28

@jshrager — You're right, and I want to agree more fully: code without context is often nearly useless for understanding what actually happened. Your point about reading context rather than code—even with your own work—captures something fundamental about how software knowledge lives in people, not just in files.

For me, the aim of CCS—at least from a historian's perspective—is to work within that dialectic: acknowledging that code doesn't exist in a vacuum outside of society, but also recognising that code is itself part of the context that shapes society. Historical research, at its core, tries to connect those two movements together. But Software Heritage is not a historian's initiative. It's an archival one.

Gabriel Alcaras's exhibition panel on git-stash illustrates this well. Software Heritage preserves commit f2c66ed196d1d1410d014e4ee3e2b585936101f5—Nanako Shiraishi's contribution from a Saturday afternoon in June 2007. The commit message tells us she wrote it because her boss kept interrupting her work. That's already more context than most code carries. But as Alcaras notes, we know almost nothing else about her: a name, an email address, the timestamp. The code is preserved; the person behind it remains largely inaccessible. The archival layer did its job. The research layer—the oral history that might have told us more—was never built.

I'd distinguish preservation practices from research practices. Software Heritage isn't claiming that code alone suffices for historical understanding—they're making a more modest bet: that code is part of humanity's legacy and worth preserving systematically, before we know what will matter. All preservation strategies have limits; theirs trades depth for scale. The strengths are systematic collection (letting future researchers decide what's important), durability (forges disappear, personal backups fail), and citability for academic purposes. What we do with preserved code—the triangulation with oral history, documentation, context—that's the research layer built on top. CCS operates in that layer. But without the archival layer beneath it, we can't even begin. And you're absolutely right that the oral history side is the urgent one—code will wait; people won't.

JoeyJones · January 28

I think this is a great project. When it comes to code, we can think of it in different ways. We can consider it like the wiring of a house: indispensable but hidden from view and rarely interesting in itself. Or we can also consider it like a play script: a set of instructions that allow us to have a performance, an artefact to preserve. Whether any given bit of code really is interesting or useful is often hard to say in the short term.

I realise this is outside the scope of the software heritage archive, but I think in some domains, saving the executable is also valuable. For example, the Interactive Fiction Archive saves text-based games and (and sometimes their code where available). This allows people to still play them, as the pool of people who can run an old file format in an interpreter is much larger than the number of people who can compile an old game from its source.

Becky · February 5

I am often thinking about how to address the gaps in the critical code archive so to speak - what to do when the code is not available, and how one reads (and executes) those gaps, and how the constitution of an archive informs this. This project, and the conversation here, are so stimulating for that thinking; thank you!

Howdy, Stranger!

Categories

In this Discussion