Ice-Air Data Appendix

markcmarino · January 17

Title: Ice-Air Data Appendix
Author/s: University of Washington Center for Human Rights
Language/s: pmd -- python + markdown
Year/s of development: 2019

The Ice-Air Data Appendix is a Jupyter notebook document that accompanied the University of Washington Center for Human Rights' report: "Hidden in Plain Sight: ICE Air and the Machinery of Mass Deportation." It processes almost 2 million passenger records from ICE's deportation flight database, obtained through FOIA. ICE is the U.S. Immigration and Customs Enforcement. This pmd file is audit code that loads the cleaned ICE Air data set, enforces the schema, runs consistency checks, and generates groupings to verify the credibility of the data used in the appendix.

Source: Arts: (Alien Repatriation Tracking System) ICE's internal database for managing deportation charter flights, tracking: passengers, flights, passenger traits, pickup and drop off locations.

The data work follows "Principled Data Processing" developed by Human Rights Data Group (HRDAG).

What aspects of ICE operations and ideology (deport/punish/oppress/purify) can we critique through this code? What is the value of critiquing the pipeline around code we cannot see? How does this code model ways of monitoring and auditing the State?

Source File: ice-air-data-appendix.pmd

Hidden in Plain Sight: ICE Air Data Appendix

This is an appendix to the report Hidden in Plain Sight: ICE Air and the Machinery of Mass Deportation, which uses data from ICE's Alien Repatriation Tracking System (ARTS) released by ICE Enforcement and Removal Operations pursuant to a Freedom of Information Act request by the University of Washington Center for Human Rights. This appendix intended to provide readers with greater detail on the contents, structure, and limitations of this dataset, and the process our researchers performed to render it suitable for social scientific analysis. The appendix is a living document that will be updated over time in order to make ICE Air data as widely-accessible and transparently-documented as possible.

The project repository contains all the data and code used for the production of the report.

# Get optimal data types before reading in the ARTS dataset
    with open('input/dtypes.yaml', 'r') as yamlfile:
        column_types = yaml.load(yamlfile)
    read_csv_opts = {'sep': '|',
                     'quotechar': '"',
                     'compression': 'gzip',
                     'encoding': 'utf-8',
                     'dtype': column_types,
                     'parse_dates': ['MissionDate'],
                     'infer_datetime_format': True}
    df = pd.read_csv('input/ice-air.csv.gz', **read_csv_opts)

    # The ARTS Data Dictionary as released by ICE
    data_dict = pd.read_csv('input/ARTS_Data_Dictionary.csv.gz', compression='gzip', sep='|')
    data_dict.columns = ['Field', 'Definition']

    # A YAML file containing the field names in the original ARTS dataset
    with open('hand/arts_cols.yaml', 'r') as yamlfile:
        arts_cols = yaml.load(yamlfile)

    # Asserting characteristics of key fields
    assert sum(df['AlienMasterID'].isnull()) == 0
    assert len(df) == len(set(df['AlienMasterID']))
    assert sum(df['MissionID'].isnull()) == 0
    assert sum(df['MissionNumber'].isnull()) == 0
    assert len(set(df['MissionID'])) == len(set(df['MissionNumber']))

jeremydouglass · January 20

Mark, thank you so much for this. There is a strange subtlety to this code -- it is notebook-paradigm, and it resembles iPython / Jupyter notebooks, but it isn't. I am orienting myself to this in a kind of reader's walking guide leading from the report, to the appendix, through the rendering and execution tool, to the code:

The Center for Human Rights publishes the web report "Hidden in Plain Sight: ICE Air and the Machinery of Mass Deportation" [archive] on 2019-04-23, a piece of web-based investigative reporting with heavy data-journalism elements including charts and graphs. It focuses primarily on the U.S. Immigration and Customs Enforcement (ICE) air transportation network for "removal" of national detainees to foreign countries, especially as that intersects with secrecy and human rights violations.
The report links to the the data that backs it, the Data Appendix [archive] last updated on 2019-04-24 (the next day), a web rendering of a project Python data notebook. While the output exactly resembles e.g. an iPython / Jupyter notebook (with alternating blocks of formatted text, code, and code output) it is not; it carries the footer:
> Published from src/ice-air-data-appendix.pmd using Pweave 0.30.3 on 24-04-2019.
PWeave ("Scientific Reports using Python"), which is:
> "a scientific report generator and a literate programming tool for Python. Pweave can capture the results and plots from data analysis and works well with NumPy, SciPy and matplotlib. It is able to run python code from source document and include the results and capture matplotlib plots in the output." It takes a custom .pmd text file as input (indicating PWeave Markdown), which contains its own mix of Python code for executing, Markdown for presentation, and template tags such as <%print %> for dynamic rendering -- all in a much thinner format than an iPython or Jupyter .ipynb file. It outputs HTML such as the Data Appendix web page.
The page was generated from the ice-air-data-appendix.pmd file. It mixes the presentation of the appendix (narrating the data and methodology) with the live marshaling and presentation of the data it describes when the document is generated. The live data journalism performs authenticity through transparency -- its conclusions are based on data in the repository, but conceptually are linked back to the original ICE source spreadsheets hosted on Google Drive. The method performs a contrast with the ICE networks and data / provenance that it investigates, which are opaque and obscure their subject as much as they reveal it.

It is worth noting that the team appears to use PWeave only for writing and presentation, and doesn't seem to be using it for live analysis and data exploration. This makes sense as PWeave doesn't appear to support an interactive editing mode -- they have report generation set up to be triggered from a shell Make command. Nevetheless, they they are also using interactive notebooks in analysis. For example, installment1/analyze/note/ contains a collection of nine .ipynb iPython notebooks files.

Looking at the PWeave / pmd format, I notice two things - first, the echo=True / echo=False on selected python fenced code blocks, such as the first two.

```python, imports, echo=True
# Author: University of Washington Center for Human Rights
# Title: Hidden in Plain Sight: ICE Air Data Appendix
# Date: 2019-04-24
# License: GPL 3.0 or greater

import pandas as pd
import numpy as np
import yaml
import matplotlib.pyplot as plt
```

Code is executed (e.g. the python libraries are imported) when the PWeave report is generated. At the same time, the code block is also printed (what was executed is transparent) -- in general. However, the very next fenced code block has echo=False -- not because it is secretly manipulating the data, but because it would hinder readability in the output. So the .pmd file is demonstrating transparency, but as a principle, not an ironclad rule or law: code still isn't shown if it would make the Appendix harder to read and understand.

Another thing that I notice is how PWeave supports templating elements so that the results from a code block may be displayed below it or incorporated into text, for example:

The raw ARTS dataset was released by ICE as
<%= clean_stats['number_of_input_files'] %> XLSX format files.

In the case of rolling rounds of data updates and data cleaning, this keeps parts of the report cleanly dynamic so that out-of-date summary data won't slip through editing. This feature may have been extremely useful in rounds of revision leading up to the release of the report and Appendix -- at which point the report, appendix, and backing notebook may have become fixed. Strangely, moving the code into a subbranch later that September during a repo cleanup completely truncated the version history of the file:

Commit 0379b60 : Moved installment1 into separate subbranch, will need to update some paths and symlinks

From here I imagine that one could:

search for the earlier file version history in the repo
look at the algorithmic arguments (such as the code block of assert statements!)
dive into the data pipeline
examine the source data in the repo, and/or its connection to the Google Drive raw data
examine how the data journalism and infographics in the top-level report traces it claims into the Appendix
try to clarify whether the .pmd file hand-authored or is a (semi-)automated transformation of a source such as an .ipynb iPython notebook file

...and, course, think about the bigger picture: how the ethics of a network of Human rights violation "hidden in plain sight" is addressed through the data, software, and code practices in this code snippet, repo, and project.

jshrager · January 21

This is amazing! One thing that has always fascinated me in a CCS sort of way (even though, as anyone on Team ELIZA will tell you, I'm really don't have my head around what CCS is about) is variable names because they almost always reveal the specific meaning intended for that variable, and, unlike the code itself, the variables are a completely free choice of the author -- that is, you could change all the names of all the variables to random strings, and the program would (at least for most programs) works exactly the same, which isn't true of, for example, randomly changing the additions to subtractions, ands to ors, etc. So we have in variable names a pure window into the thoughts of the authors. For example, even in just this simple short snippet above we have terms like "mission" and "alien". The ... what's the right critical term? ... complexities, implications, hege-something-or-other of "alien" is, of course, obvious, and discussed in public in great detail because it shows up in the news, but the term "mission" has always struck me as interesting; It's not the neutral "flight" (I'm a pilot, we don't call our flights "missions"!) - it militarizes what is actually just a flight (often on a standard airline!) into something that seems like it's military pilots flying fighter jets into enemy airspace at tree level, and then picking up, or in this case dropping off, our valient soldiers from the bowels of the battle.

jeremydouglass · January 22

@jshrager: Indeed, variable names as an unconstrained, overdetermined meaning might be a core move in CCS: "What are all the interesting names in the relevant namespaces?" (variables, functions, arguments, modules, data columns, etc.) David Berry's current experiments with the CCS Workbench could perhaps add a button for that....

jeremydouglass · January 22

Digging a bit deeper into the role of Pweave in the code+data journalism pipeline that we are looking at here, I at first wondered someone working with the Center for Human Rights had home-rolled Pweave. But no: it was actually created by Matti Pastell, Professor of Future Farming Technologies at Natural Resources Institute Finland. It is a Python-inspired adaptation of the approach of Sweave wiki manual, a "function in the statistical programming language R that enables integration of R code into LaTeX documents" -- that is, it weaves together code chunks and formatting. Sweave I would guess was originally named from weaving in "S", the statistical language that R directly descends from, and keeping its old name in R might be an artifact of being a longstanding function in both.

Like Sweave, Pweave is conceptually based on a noweb format that is like code notebook documents, yet also unlike them. I think this is interesting in part because it aligns this conversation with a deeper conversation during this working group about histories of document formatting, e.g. Scribe.

markcmarino · January 24

So, I want to point this conversation in a different direction. Rather than focusing on this code as the object of study in itself, what happens if we think of this code as part of a tool for getting at the system we cannot see, the system of mass deportations? What if we read the code in light of this quote from the appendix:

"While this dataset provides the first public view into the operations of ICE Air, it raises as many questions as it answers.... It is important, therefore, to be conscious of the limitations to the conclusions we can draw from this data alone, to 'ground-truth' any observations by comparing to the lived experiences of immigrant communities, and to continue to demand full transparency from federal and local governments about the mechanics of deportations."

So if ICE processes are in a black box, this code tries to get at what's in that black box by examining the outputs. This code hypotheses about how ICE functions. We cannot see inside ARTS (itself a grotesque acronym) or Palantir. Too often software and systems of control, whether related to policing, immigration, the distribution of resources, or other crucial functions, are hidden from those affected by them or those governed by the political leaders who put the systems in place. But we can write code to examine the functioning of these systems, to try to hold them accountable by examining the pipeline.

Take the assertions, which if I understand, are typically checks for debugging, but in this case:

    assert len(df) == len(set(df['AlienMasterID']))

makes an assertion that each person is assigned a unique number -- testing out a hypothesis about a system we are trying to understand through this code. The hypothesis: "Every deportee has a unique ID that is used exactly once." If that is the case, repeat deportees cannot be tracked, at least not in this released data -- which really just points to how incomplete the data is. Surely, ICE uses additional forms of identification to track its deportees. But that information has not been released. The information is not very free.

Of course, assigning humans numbers in this instance, also reminds us of what a poor, partial, a dehumanized view we get through this data.

At the same time, the University of Washington software makes its code available for scrutiny, performing a level of transparency as a political praxis.

Howdy, Stranger!

Categories

In this Discussion

Ice-Air Data Appendix

Hidden in Plain Sight: ICE Air Data Appendix

Comments