It looks like you're new here. If you want to get involved, click one of these buttons!
Hi all! I'll be around all week to chat about any thoughts, ideas, and questions you may have concerning my 2022 DHQ article on Amazon's Alexa. Please also engage with each other.
Here's the original abstract, with key words and thoughts italicized.
"Reverse Engineering the Gendered Design of Amazon’s Alexa: Methods in Testing Closed-Source Code in Grey and Black Box Systems"
This article examines the gendered design of Amazon Alexa’s voice-driven capabilities, or, “skills,” in order to better understand how Alexa, as an AI assistant, mirrors traditionally feminized labour and sociocultural expectations. While Alexa’s code is closed source — meaning that the code is not available to be viewed, copied, or edited — certain features of the code architecture may be identified through methods akin to reverse engineering and black box testing. This article will examine what is available of Alexa’s code — the official software developer console through the Alexa Skills Kit, code samples and snippets of official Amazon-developed skills on Github, and the code of an unofficial, third-party user-developed skill on Github — in order to demonstrate that Alexa is designed to be female presenting, and that, as a consequence, expectations of gendered labour and behaviour have been built into the code and user experiences of various Alexa skills. In doing so, this article offers methods in critical code studies toward analyzing code to which we do not have access. It also provides a better understanding of the inherently gendered design of AI that is designated for care, assistance, and menial labour, outlining ways in which these design choices may affect and influence user behaviours.
*
Here are some discussion questions to get us started:
How do humanities- and social sciences-based theories and methods augment CCS? For instance, in this article, I describe Anne Balsamo's method of "hermeneutic reverse engineering," through which I used methods of close reading and discourse analysis to analyze Alexa's responses, the Alexa skills kit console, and some of the code used to build Alexa skills.
What other examples of closed-source and proprietary code and software do you think would benefit from a mixed method approach to reverse engineering their design?
What does CCS lend to analyses of gendered, racialized, and class-based biases in code and software?
Comparisons are sometimes made between language-based AI such as Alexa and ChatGPT, but in addition to these AI systems stemming from different types of NLP design, they also create different user expectations about what they should be used for and how users are expected to interact with each interface. Discuss.
Comments
First off, @Lai-Tze Fan, this is a wonderful essay and an excellent example of how to apply CCS to closed-source or proprietary software. Researchers like Safiya Noble and Joy Buolamwini have called for an examination the encoded bias of software systems, particularly AI. However, trying to find a rigorous and reproducible methodology can be challenging, and basing critiques in what are essentially individual encounters or anecdotal use cases risk all sorts of incorrect generalizations about the software. In your study, by taking on code made for ALEXA, that is available, you have found a way to augment your reading.
There is a little more code available, mostly of a demonstration variety in the github repository for ALEXA, designed to help people writing new ALEXA skills. However, demo code has that generic quality, where it is stripped of key context, like its authorship or occasion for being created.
I find it interesting that the UNESCO report doesn't discuss the code directly or offer any examples. You would think such a research entity might be granted a bit of access.
If it's all right, I'd like to bring in your two examples from the article for further discussion here:
This historic code for when ALEXA does not understand the user
And this sample code for "Make me a Sandwich" demo skill.
Can you talk a little bit about the challenges of finding and choosing this code and any concerns you had about using it in your critique?
Thanks so much for the lovely support and feedback, Mark!
To answer your question--Can you talk a little bit about the challenges of finding and choosing this code and any concerns you had about using it in your critique?--I will discuss each of the examples separately!
First, I came across the code for when Alexa doesn't understand the user, aka "RandomConfusionMessage" in Amazon's Github repository for Alexa. Back then (2021), they pinned several respositories as introductory and fundamental templates to build Alexa Skills, so I took this structural direction as signalling where they placed their values (lol this word) for programming and customizing Skills. I chose this specific code example because it made me realize that any gibberish would return a RandomConfusion response from Alexa. In turn, this meant that any meaningful response--including to problematic language and requests to which Alexa seemed to have a clever answer, such as sassy remarks to "make me a sandwich"--was included in the larger code scripts that I could not see. In other words, somewhere in that unseen code, "make me a sandwich" is included, or else I would have gotten one of the RandomConfusionMessages. This was really a process of deduction in realizing that the presence of something on the level of the interface meant the presence of a corresponding part on the level of the backend, or code.
Second, I believe I came across the code for the "Make me a sandwich" demo Skill by Googling "Alexa + [selection from a list of problematic statements that are not fun to type]". I found a lot of popular webites, Reddit discussions, and YouTube videos making fun of the ability to tell Alexa to "make me a sandwich," but more interestingly, I found code by a third-party user through which one could actually program Alexa to order a sandwich from the Jimmy John's American sandwich chain. Upon further investigation, I noted that this was not an Alexa Skill that exists in the Amazon store, so you can't actually download it for Alexa; there are multiple reasons that the Skill would not be accepted by Amazon, which I outline in the article. But also, analyzing this third-party code for its differences from approved Alexa Skills set me on a course of digging into what terms and conditions Amazon has for third-party Skills proposed for Alexa; this is where I noted item #8 in their policy, which states that Skills shouldn't have any hateful representation. So apparently gendered stereotypes don't count? Hrm.
I also wanted to add in response to your generous comments, Mark: in truth, I had not had any lofty aspirations to augment or improve upon the incredible work of Noble, Buolamwini, Gebru, Benjamin, and others who have explored algorithmic bias. Instead, I was curious about learning more about Alexa from a playful and exploratory stance, wondering how someone from a humanities and social sciences background could research code with the tools that Big Tech companies want us to have. So I would say that curiosity and fun were my main research methods, which let the research findings and ensuing methods emerge a little more organically, without much stress of worrying what I'd offer up to CCS and the special issue of DHQ. I hope students in particular can hear what I'm saying there: trust the process, but also if it feels like you don't have the same background as "experts," that doesn't mean you can't take the time to dig around and learn things, see things from different perspectives that "experts" don't necessarily have. (:
Whew, that was a mouthful. If you all prefer, I can keep my responses shorter in the future. ;P
Thanks @Lai-Tze Fan for such an insightful article. In response to the code snippet of getRandomConfusionMessage() of Alexa's responses when the input value is not decipherable, I found it notable that of the five responses, four of them are assigning fault to Alexa (by Alexa) for not understanding the phrase. If the functional intention is for the user to enact more clearly their desired command, I wonder why empty phrases like "I'll add that to my list of things to learn soon!" would be relevant beyond for upholding a social guise (the system isn't actually recording and learning from the phrase). Perhaps coding Alexa to respond with a question, statement, or advice for the user to better clarify what they are saying in a way that registers with Alexa would be more instructive and appropriate.
I'm curious though about the methodology of reverse engineering as being both technical and speculative practice in the sense of following data trails to hypothesize an omission. A sense of openness to empty space, of not knowing, a suspension of disbelief, if you will, seems productive in crafting new methodologies and ways of knowing. Really inspiring work! I was reminded of recent initiates such as RedPajama, which involves various data science research groups and start-ups, to to make LLMs open-sourced by recreating datasets that LLMs like Llama2 were trained on. Of course that data is not readily available, but what RedPajama interestingly did was recreate the dataset from scratch by drawing from data sources (such as Wikipedia and open books) and pre-processing, filtering, and tuning the quality filters of the data to match the number of tokens as reported by Meta in their published papers. Because information about how Meta made selections of data from the public data sources was not available, the RedPajama team would make estimates to create its equivalents, and engaged a community of people to do so. This case draws me to open data movements as kind of a cornerstone from which to mobilize more autonomy from creators and developers in customizing foundation models and what we consider to be such, and imagining the possibility of alternative neural architectures.
Hi @andreask , that's a great observation. I also thought about the fact that Alexa was put in the position of blame, but it reminds me a lot of the kinds of responses that have been documented by female programmers (e.g., of the ENIAC computer) in the 1940s and 50s who were supposedly not encouraged to ask many questions, question their commanders, or even have much of a retort. Wendy Hui Kyong Chun's article "On Software, or the Persistence of Visual Culture" (2005) addresses this in the section "Yes, Sir," in which computer scientist Dr. Grace Hopper describes that it felt like her only response to men's programming commands was to be "yes, sir." I also think about how these expectations of docility are expected in customer service (which is often why women are asked to serve in front-facing service positions, including in cash and HR). Just settle your customer down and deal with the problem on your own.
As for your second comment, the STEM-based field of reverse engineering is fairly speculative insofar as a researcher has to imagine scenarios of past (rather than present or future) that would have brought products into being. Balsamo's humanities-based approach to this in hermeneutic reverse engineering asks us to ask questions of social and cultural context as well ... but she does not address code. I guess I have the question I have for others in return is how reverse engineering data can happen in increasingly proprietary software futures ... and as you say, open data movements seem to provide the answer.
I want to add my to the praise for this article, in general, and more specifically by drawing attention to the very useful section on Hermeneutic Reverse Engineering as an. approach to examining processes in black/grey box systems, such as Alexa and in OpenAI's GPT LLMs.
One of the things I focused on in my initial analysis of Lillian-Yvonne Bertram's A Black Story May Contain Sensitive Content is the black box testing technique you begin that section with. In your article, you astutely note the difference in responses to the gendered minimal pair of prompts to Alexa "You're pretty" vs "You're handsome." And throughout your article, you not only expand to other useful approaches, but also note that the system is constantly changing and evolving, and corrects problematic responses.
Part of what I find fascinating is how Bertram's prompt "tell me a Black story" yields consistent, though increasingly censored responses over time as Open AI's GPT LLM continues to be developed and trained with user interaction and new data. In my comment, I use the metaphor of language prompts as a kind of sonar mapping of unseen (black box) spaces.
And so these humanist, poetic even, examinations of black box systems keep revealing and mapping the unseen algorithms of oppression (pun-nod to Safiya Umoya Noble, who helped unveil this method) built into these systems. Thank you for your useful article that expands our toolbox for understanding and critiquing processes in closed systems.
Thanks for this, @Lai-Tze Fan.
One thing I've recently discovered which makes me feel quite old: the "make me a sandwich" thing is sometimes perceived differently by younger folks.
A couple of weeks ago, I was lamenting that Sudo themselves have leaned into the joke, and their logo is now this: https://commons.m.wikimedia.org/wiki/File:Sudo_logo.png
They explicitly changed the logo to this based on the xkcd: https://xkcd.com/149/
Several folks in the chat asked me why I thought the new logo was upsetting, and I brought up the original sexist connotations of the original comic, and how it hard harmful repercussions for years afterwards. Some of the people involved were simply too young to recognize the reference, as the comic is 21 years old at this point. Some were old enough, but had simply forgotten!
This is not to excuse the reproduction of the joke in Alexa's comic, but more to bring up how nefarious these kinds of referents can be; sometimes they are uncritically reproduced by people who would find the original referent to be distasteful, but simply did not have access to it in the first place and therefore didn't realize what they were doing. It can also make it difficult to discuss these issues; luckily my friends are all understanding of things like this, but I am reminded of discussions about the "okay" hand sign, co-opted by white supremacists in recent years. This was chosen for the purposes of its illegibility, and I remember trying to discuss with people at the time how making the sign makes them look, yet many would say "you're being ridiculous, it's just an okay sign." In this way, the Jimmy Johns application is doing a similar thing: "how can you imply that "make me a sandwich" is misogynist, that's just simply what you do when you order!"
@leoflores Dearest Leo! My apologies for the delay--I was travelling and didn't have my log-in info, but I'm back now ... and just in time before this portal of posts closes in on us ...
Your response is very kind in its alignments with works and scholars who I already admire so much, particularly the creative/critical work of Lillian-Yvonne Bertram. I'm really intrigued by your contribution to the conversation through using the metaphor of language as a sonar mapping, which is exactly what I picture too ... feeling out the walls of the black box by throwing sounds and objects to see what bounces, what sticks, and where. What are the parameters? In that sense, language becomes a tool for the method of triangulation: throw out language and see what it throws back, then deduce who/what is doing the throwing.
@Steve.Klabnik Apologies for the delay, Steve! It's lovely to make your virtual acquaintance.
Wow, I was not aware of Sudo's logo changing ... I am disappointed if they are leaning into the comic reference directly and will be sure to note that you made this observation when talking about implications of that "make me a sandwich" prompt. As you say, however, 21 years is long enough that the reference has lost some context.
I guess in this sense, I found, for once, web resources like Wikipedia and even Urban Dictionary and Know Your Meme to be useful research resources for the historical archive and genealogical overview that they offered to certain concepts and terms that have grown up with the Internet. Since sites like Wikipedia give insight into editing history, these traces, edits, and updates also reveal a lot of needed context. And since sites like Urban Dictionary have community-based voting systems, they have an internal content management system that LLM designers have also drawn on as a system within a system (e.g., it's likely that the GPT models use many Reddit forums in their training). I wonder what other kinds of crowd-sourced knowledge systems exist out there that can help add context to conversations where it might otherwise be seen as outdated ...
Thanks for your thoughts!