It looks like you're new here. If you want to get involved, click one of these buttons!
Title: Ice-Air Data Appendix
Author/s: University of Washington Center for Human Rights
Language/s: pmd -- python + markdown
Year/s of development: 2019
The Ice-Air Data Appendix is a Jupyter notebook document that accompanied the University of Washington Center for Human Rights' report: "Hidden in Plain Sight: ICE Air and the Machinery of Mass Deportation." It processes almost 2 million passenger records from ICE's deportation flight database, obtained through FOIA. ICE is the U.S. Immigration and Customs Enforcement. This pmd file is audit code that loads the cleaned ICE Air data set, enforces the schema, runs consistency checks, and generates groupings to verify the credibility of the data used in the appendix.
Source: Arts: (Alien Repatriation Tracking System) ICE's internal database for managing deportation charter flights, tracking: passengers, flights, passenger traits, pickup and drop off locations.
The data work follows "Principled Data Processing" developed by Human Rights Data Group (HRDAG).
What aspects of ICE operations and ideology (deport/punish/oppress/purify) can we critique through this code? What is the value of critiquing the pipeline around code we cannot see? How does this code model ways of monitoring and auditing the State?
Source File: ice-air-data-appendix.pmd
This is an appendix to the report Hidden in Plain Sight: ICE Air and the Machinery of Mass Deportation, which uses data from ICE's Alien Repatriation Tracking System (ARTS) released by ICE Enforcement and Removal Operations pursuant to a Freedom of Information Act request by the University of Washington Center for Human Rights. This appendix intended to provide readers with greater detail on the contents, structure, and limitations of this dataset, and the process our researchers performed to render it suitable for social scientific analysis. The appendix is a living document that will be updated over time in order to make ICE Air data as widely-accessible and transparently-documented as possible.
The project repository contains all the data and code used for the production of the report.
# Get optimal data types before reading in the ARTS dataset
with open('input/dtypes.yaml', 'r') as yamlfile:
column_types = yaml.load(yamlfile)
read_csv_opts = {'sep': '|',
'quotechar': '"',
'compression': 'gzip',
'encoding': 'utf-8',
'dtype': column_types,
'parse_dates': ['MissionDate'],
'infer_datetime_format': True}
df = pd.read_csv('input/ice-air.csv.gz', **read_csv_opts)
# The ARTS Data Dictionary as released by ICE
data_dict = pd.read_csv('input/ARTS_Data_Dictionary.csv.gz', compression='gzip', sep='|')
data_dict.columns = ['Field', 'Definition']
# A YAML file containing the field names in the original ARTS dataset
with open('hand/arts_cols.yaml', 'r') as yamlfile:
arts_cols = yaml.load(yamlfile)
# Asserting characteristics of key fields
assert sum(df['AlienMasterID'].isnull()) == 0
assert len(df) == len(set(df['AlienMasterID']))
assert sum(df['MissionID'].isnull()) == 0
assert sum(df['MissionNumber'].isnull()) == 0
assert len(set(df['MissionID'])) == len(set(df['MissionNumber']))
Comments
Mark, thank you so much for this. There is a strange subtlety to this code -- it is notebook-paradigm, and it resembles iPython / Jupyter notebooks, but it isn't. I am orienting myself to this in a kind of reader's walking guide leading from the report, to the appendix, through the rendering and execution tool, to the code:
> Published from src/ice-air-data-appendix.pmd using Pweave 0.30.3 on 24-04-2019.
> "a scientific report generator and a literate programming tool for Python. Pweave can capture the results and plots from data analysis and works well with NumPy, SciPy and matplotlib. It is able to run python code from source document and include the results and capture matplotlib plots in the output." It takes a custom
.pmdtext file as input (indicating PWeave Markdown), which contains its own mix of Python code for executing, Markdown for presentation, and template tags such as<%print %>for dynamic rendering -- all in a much thinner format than an iPython or Jupyter.ipynbfile. It outputs HTML such as the Data Appendix web page.It is worth noting that the team appears to use PWeave only for writing and presentation, and doesn't seem to be using it for live analysis and data exploration. This makes sense as PWeave doesn't appear to support an interactive editing mode -- they have report generation set up to be triggered from a shell Make command. Nevetheless, they they are also using interactive notebooks in analysis. For example,
installment1/analyze/note/contains a collection of nine .ipynb iPython notebooks files.Looking at the PWeave / pmd format, I notice two things - first, the
echo=True/echo=Falseon selected python fenced code blocks, such as the first two.Code is executed (e.g. the python libraries are imported) when the PWeave report is generated. At the same time, the code block is also printed (what was executed is transparent) -- in general. However, the very next fenced code block has
echo=False-- not because it is secretly manipulating the data, but because it would hinder readability in the output. So the .pmd file is demonstrating transparency, but as a principle, not an ironclad rule or law: code still isn't shown if it would make the Appendix harder to read and understand.Another thing that I notice is how PWeave supports templating elements so that the results from a code block may be displayed below it or incorporated into text, for example:
In the case of rolling rounds of data updates and data cleaning, this keeps parts of the report cleanly dynamic so that out-of-date summary data won't slip through editing. This feature may have been extremely useful in rounds of revision leading up to the release of the report and Appendix -- at which point the report, appendix, and backing notebook may have become fixed. Strangely, moving the code into a subbranch later that September during a repo cleanup completely truncated the version history of the file:
From here I imagine that one could:
assertstatements!)...and, course, think about the bigger picture: how the ethics of a network of Human rights violation "hidden in plain sight" is addressed through the data, software, and code practices in this code snippet, repo, and project.
This is amazing! One thing that has always fascinated me in a CCS sort of way (even though, as anyone on Team ELIZA will tell you, I'm really don't have my head around what CCS is about) is variable names because they almost always reveal the specific meaning intended for that variable, and, unlike the code itself, the variables are a completely free choice of the author -- that is, you could change all the names of all the variables to random strings, and the program would (at least for most programs) works exactly the same, which isn't true of, for example, randomly changing the additions to subtractions, ands to ors, etc. So we have in variable names a pure window into the thoughts of the authors. For example, even in just this simple short snippet above we have terms like "mission" and "alien". The ... what's the right critical term? ... complexities, implications, hege-something-or-other of "alien" is, of course, obvious, and discussed in public in great detail because it shows up in the news, but the term "mission" has always struck me as interesting; It's not the neutral "flight" (I'm a pilot, we don't call our flights "missions"!) - it militarizes what is actually just a flight (often on a standard airline!) into something that seems like it's military pilots flying fighter jets into enemy airspace at tree level, and then picking up, or in this case dropping off, our valient soldiers from the bowels of the battle.
@jshrager: Indeed, variable names as an unconstrained, overdetermined meaning might be a core move in CCS: "What are all the interesting names in the relevant namespaces?" (variables, functions, arguments, modules, data columns, etc.) David Berry's current experiments with the CCS Workbench could perhaps add a button for that....
Digging a bit deeper into the role of Pweave in the code+data journalism pipeline that we are looking at here, I at first wondered someone working with the Center for Human Rights had home-rolled Pweave. But no: it was actually created by Matti Pastell, Professor of Future Farming Technologies at Natural Resources Institute Finland. It is a Python-inspired adaptation of the approach of Sweave wiki manual, a "function in the statistical programming language R that enables integration of R code into LaTeX documents" -- that is, it weaves together code chunks and formatting. Sweave I would guess was originally named from weaving in "S", the statistical language that R directly descends from, and keeping its old name in R might be an artifact of being a longstanding function in both.
Like Sweave, Pweave is conceptually based on a
nowebformat that is like code notebook documents, yet also unlike them. I think this is interesting in part because it aligns this conversation with a deeper conversation during this working group about histories of document formatting, e.g. Scribe.So, I want to point this conversation in a different direction. Rather than focusing on this code as the object of study in itself, what happens if we think of this code as part of a tool for getting at the system we cannot see, the system of mass deportations? What if we read the code in light of this quote from the appendix:
So if ICE processes are in a black box, this code tries to get at what's in that black box by examining the outputs. This code hypotheses about how ICE functions. We cannot see inside ARTS (itself a grotesque acronym) or Palantir. Too often software and systems of control, whether related to policing, immigration, the distribution of resources, or other crucial functions, are hidden from those affected by them or those governed by the political leaders who put the systems in place. But we can write code to examine the functioning of these systems, to try to hold them accountable by examining the pipeline.
Take the assertions, which if I understand, are typically checks for debugging, but in this case:
makes an assertion that each person is assigned a unique number -- testing out a hypothesis about a system we are trying to understand through this code. The hypothesis: "Every deportee has a unique ID that is used exactly once." If that is the case, repeat deportees cannot be tracked, at least not in this released data -- which really just points to how incomplete the data is. Surely, ICE uses additional forms of identification to track its deportees. But that information has not been released. The information is not very free.
Of course, assigning humans numbers in this instance, also reminds us of what a poor, partial, a dehumanized view we get through this data.
At the same time, the University of Washington software makes its code available for scrutiny, performing a level of transparency as a political praxis.