Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

2024 Participants: Hannah Ackermans * Sara Alsherif * Leonardo Aranda * Brian Arechiga * Jonathan Armoza * Stephanie E. August * Martin Bartelmus * Patsy Baudoin * Liat Berdugo * David Berry * Jason Boyd * Kevin Brock * Evan Buswell * Claire Carroll * John Cayley * Slavica Ceperkovic * Edmond Chang * Sarah Ciston * Lyr Colin * Daniel Cox * Christina Cuneo * Orla Delaney * Pierre Depaz * Ranjodh Singh Dhaliwal * Koundinya Dhulipalla * Samuel DiBella * Craig Dietrich * Quinn Dombrowski * Kevin Driscoll * Lai-Tze Fan * Max Feinstein * Meredith Finkelstein * Leonardo Flores * Cyril Focht * Gwen Foo * Federica Frabetti * Jordan Freitas * Erika FülöP * Sam Goree * Gulsen Guler * Anthony Hay * SHAWNÉ MICHAELAIN HOLLOWAY * Brendan Howell * Minh Hua * Amira Jarmakani * Dennis Jerz * Joey Jones * Ted Kafala * Titaÿna Kauffmann-Will * Darius Kazemi * andrea kim * Joey King * Ryan Leach * cynthia li * Judy Malloy * Zachary Mann * Marian Mazzone * Chris McGuinness * Yasemin Melek * Pablo Miranda Carranza * Jarah Moesch * Matt Nish-Lapidus * Yoehan Oh * Steven Oscherwitz * Stefano Penge * Marta Pérez-Campos * Jan-Christian Petersen * gripp prime * Rita Raley * Nicholas Raphael * Arpita Rathod * Amit Ray * Thorsten Ries * Abby Rinaldi * Mark Sample * Valérie Schafer * Carly Schnitzler * Arthur Schwarz * Lyle Skains * Rory Solomon * Winnie Soon * Harlin/Hayley Steele * Marylyn Tan * Daniel Temkin * Murielle Sandra Tiako Djomatchoua * Anna Tito * Introna Tommie * Fereshteh Toosi * Paige Treebridge * Lee Tusman * Joris J.van Zundert * Annette Vee * Dan Verständig * Yohanna Waliya * Shu Wan * Peggy WEIL * Jacque Wernimont * Katherine Yang * Zach Whalen * Elea Zhong * TengChao Zhou
CCSWG 2024 is coordinated by Lyr Colin (USC), Andrea Kim (USC), Elea Zhong (USC), Zachary Mann (USC), Jeremy Douglass (UCSB), and Mark C. Marino (USC) . Sponsored by the Humanities and Critical Code Studies Lab (USC), and the Digital Arts and Humanities Commons (UCSB).

From the introductory chapter: "a job interview"

I've been prodded to disseminate this a little more widely during Week 1. For reference, the two approaches were as follows:

// Approach 1
function anagram(text) {
  var a = text.split("");
  for (var i = 0; i < a.length; i += 1) {
    var letter = a[i];
    var j = Math.floor(Math.random() * a.length);
    a[i] = a[j];
    a[j] = letter;
  }
  return a.join("");
}

and

// Approach 2
function anagram(text) {
  return text.split("").sort(function () {return 0.5- Math.random()}).join("");
}

A thing to note about these is that they don't achieve the same thing. I initially posted a demonstration of that here.

Functionality

Firstly, note that both approaches are biased in terms of the distribution of permutations they create.

There's a simple counting argument here as to the bias in the two approaches. (Both might be suitable for a follow-up question during this hypothetical interview.)

For the first approach, we make n swaps, each picking a target from n elements. There are thus n^n potential paths through the algorithm. However, there are n! distinct permutations. Because the former value doesn't divide cleanly by the latter [example: n = 8, n! = 40320, n^n = 16777216, 16777216 isn't evenly divisible by 40320], the argument is that there must be some bias in the outcome - some "anagrams" crop up more frequently than others. (This is evident even when we're producing anagrams of a simple string like "abc", which you can try by hand.)

For the second approach, any analysis of behaviour needs to rely on the fact that the sorting algorithm is likely to have been picked with a runtime of n . log_2 n. [There are multiple ways to implement a sort, but that's a common behaviour. For n=8, a sort may make 8 . (log_2 8) = 24 comparisons.] Each time the sort algorithm makes a comparison, it'll perform one of two operations: swap, or don't swap. Thus, there are 2 ^ (n log_2 n) potential paths through the algorithm - or again, n^n. [For the case n=8, this comes out as 16777216 again.]

However, sorting algorithms try hard to not make comparisons when they "know" what the result would be; thus, this approach produces a much stronger bias in the sampled outcomes. Indeed, some potential anagrams of eight characters don't show up at all during the sample run.

[Assuming the sort function isn't broken, all permutations can turn up - we'd just require far more runs before we expected to see the extreme cases.]

On the question of which of those approaches is "stronger": the first approach produces outputs which are closer to uniform. Additionally, it requires a far smaller change to bring its output to a strictly uniform distribution. The second approach may be "clever", but it's by no means clear that it could be adjusted, should uniformity of output be desired.

If I had to interpret those two samples, I'd say it was much more likely that the second was a reproduction of a trick seen elsewhere, whereas the first seems more likely to be an on-the-spot invention.

The desirability of uniformity of distribution and the problem statement

Why desire a uniform distribution? There's certainly a certain symmetrical draw to it. At the other end of the spectrum, sorting all input characters may suffice - or simply returning the input string. (Where Unicode strings are concerned, "returning the value you've been given" is about the safest thing you can do - see below.)

If the requirement in the problem statement ["An anagram contains all the same characters of the initial string or word in a different order"] is a strong one, then the last option doesn't work - and both of the suggested approaches also have a flaw. However, under those circumstances, we should note that there are inputs where no algorithm can meet a strict interpretation of that requirement. [Examples: any single-character string, or "AAA", etc.]

There may be other, less specific versions of "better," when picking a distribution - for instance, finding letter-orderings which produce more challenging anagrams for a human solver, or ones that embed wry phrases.

Pedantry to one side, however...

Broader observation

In typical i18n-blindness, all approaches here don't deal correctly with combining characters and other tricky Unicode features.

I'd also observe that there's a western bias - perhaps even an Anglophone one - to the problem statement. I'm not sure that the notion of an anagram even applies across all the languages that Unicode covers. But the problem there can be found closer to home: if you asked Herr Gauß how many anagrams his surname generated, the answer probably wouldn't be 24. The Unicode consortium offers some partial approaches to dealing with this, but they are by no means comprehensive (and a desire for a comprehensive approach is misplaced, usually arising from an underestimation of how complex the problem space is). Software that deals with human text is notoriously hard to get even close to correct in a cross-cultural fashion. (cf. the question of "how should software handle names?" which has been the subject of a number of writeups.)

Comments

  • Excellent commentary. Thanks for bring up a few great facts.

    1. Choosing a random permutation using the Math.random()-0.5 does not give a uniform distribution and should never be used in "real" code. It is BAD
      https://news.ycombinator.com/item?id=2728914

    2. The proper algorithm to use is the Fischer-Yates Shuffle https://medium.com/@oldwestaction/randomness-is-hard-e085decbcbb2 https://en.wikipedia.org/wiki/Fisher–Yates_shuffle

    3. Anything dealing with text is WAY HARDER than people understand. They have to know about normalization (the two characters LATIN CAPITAL LETTER E followed by COMBING ACUTE ACCENT should be the same as the single character LATIN CAPITAL LETTER E WITH ACUTE) and then there are emojis and languages with all kinds of markers and things etc. etc. Most programming languages, including JavaScript and Java, do a terrible job with them.

    I've been fighting the battle against "i18n blindness" so it's nice to see this. Characters are very, very hard. I've written about this here: https://cs.lmu.edu/~ray/notes/charenc/. I tell students:

    String indexing and length-taking are highly advanced topics

    Until you have learned about Unicode, UTF encoding schemes, and Unicode normalization, you shouldn’t really be mucking around with the index positions of individual characters or taking the length of strings.

    If you do find yourself interviewing for an internship position and someone asks you a question about indexing and lengths, simply ask your interviewer: “may I ignore astral characters and normalization concerns?” They will probably be impressed. Then you can go ahead and assume one code unit per character.

    Anagramming is indeed an example of a problem that makes sense only in certain cultural contexts. All text processing is culturally dependent.

  • edited January 2020

    Thanks for taking the time to post your analysis. When I read these two contrasting examples from the Introduction, I also thought there was an inconsistency about the assertion that “both code snippets do the same thing”, since — of course — the whole point of the discussion is that they don't. But on re-reading, I note @markcmarino was careful to write that the second example “performs the same operation” as the first (Marino 2020, 6). For my money, that's just the right amount of wiggle room to keep the (rhetorically effective) example intact, notwithstanding your analysis.

    What makes your analysis nevertheless so useful is that illustrates how running the same code over and over again (trivially cheap for such an example) can sometimes debunk the apparent “sameness” — let's call it functional equivalence — of two implementations, which may only become obvious in the limit case of repeated execution.

    Also, I really liked your observation that certain inputs (e.g. 'aaaa') can reveal deficiencies in the problem statement of the interview question. Adversarial inputs ­— e.g. “naughty strings” — are a useful tool for software testers. In the spirit of joint enterprise, CCS should avail itself of as much of these ideas as possible. Pedantry... with purpose!

  • @jang -- Re: your comment

    random.shuffle(s)   # A uniform shuffle; the implementation is *very* nearly the same as approach #1
    

    ...where is that implementation of random.shuffle()?

    Also, I wasn't too familiar with ß and haven't thought about it in terms of anagrams. Is the idea here that valid anagrams of Gauß might include Gaßu or Gussa OR Gasus?

    Interesting point on i18n, unicode, and the cultural specificity of anagrams. It also points to the cultural context of the questioning -- interviewees might perhaps expect that they were being asked a question about performing unit operations on character sequences, rather than creating outputs that are semantically or alphabetically valid in a particular language. It also points to the metagame of the interview, as @rtoal suggests: when can (and should) an interviewee or a working programmer ask clarifying questions about the domain of application, assumptions, and goals for the code they produce, even (or especially) if it is viewed as a mere utility.

    Why desire a uniform distribution?

    There are lots of reasons, but for certain kinds of problems -- in areas like cryptography, compression, search, or the financial or gambling industries -- then it might be extremely important to understand distribution bias and efficiently mitigate it where needed (it might also be helpful to understanding PNRGs in general when working on such problems). For example, knowing outlier values that come up more or less often could provide the crucial element that a gambler needs to "beat the house" or a codebreaker to "crack the code."

  • edited January 2020

    It's worth noting as well that while usually one might prefer more concise code, giving a sort algorithm an unstable comparator is a violation of most APIs and for me would be a red flag in an interview.

    As a bit of a rabbit-hole, unstable comparison functions can increase the likelihood of worst-case runtime performance. In this particular case it seems that mozilla prefers mergesort (which has an optimal timebound as @jang assumed) but chromium uses quicksort which has a non-optimal worst-case complexity -- although this worst case is triggered by having unequal sized partitions which this particular function is not biased to trigger.

    In any case, the first result has linear runtime, and at the heart of most interviews I've been in is establishing a basic understanding of the performance of code, and a preference for more efficient code absent a good reason to do otherwise (such as distributional bias).

    While the performance of the anagram function itself may not be critical to the success of the hypothetical system using it, job interviews, and educational assignments, generally use the meaningless as a proxy for the meaningful and expect one to treat tedium with reverence (which is perhaps a cheeky summary of computing as a whole, echoing @eamonnbell's 'pedantry with purpose').

  • edited January 2020

    @eamonnbell -- thanks for sharing the Big List of Naughty Strings. If you look at the .json file of the strings, you indeed find a big list -- presented without comment. But if you look at the bins.txt file, it is broken into labeled categories of input types, sometimes with short descriptions.

    The categories of "naughty strings" are interesting -- some are specific to certain languages or platforms or even applications (like IOS iMessage), some are cultural practices (like emoticons), some are forms of hacking / known vulnerabilities (like SQL injection), and some are second order effects, like innocuous strings which may be blocked by profanity filters (the Scunthorpe problem).

    The document is also quite playful. One category, "Human injection: Strings which may cause human to reinterpret worldview" simply contains:

    If you're reading this, you have been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

    Even beyond the complexity of it all, it is impossible for me to browse through these categories without being constantly made aware how deeply culturally embedded and situated text processing is -- and in how many different ways.

    The full list of categories:

    • Reserved Strings : Strings which may be used elsewhere in code
    • Numeric Strings : Strings which can be interpreted as numeric
    • Special Characters : ASCII punctuation. All of these characters may need to be escaped in some contexts. Divided into three groups based on (US-layout) keyboard position.
    • Non-whitespace C0 controls
    • Non-whitespace C1 controls
    • Whitespacez
    • Unicode additional control characters
    • Byte order marks
    • Unicode Symbols : Strings which contain common unicode symbols (e.g. smart quotes)
    • Unicode Subscript/Superscript/Accents : Strings which contain unicode subscripts/superscripts; can cause rendering issues
    • Quotation Marks : Strings which contain misplaced quotation marks; can cause encoding errors
    • Two-Byte Characters : Strings which contain two-byte characters: can cause rendering issues or character-length issues
    • Two-Byte Letters : Strings which contain two-byte letters: can cause issues with naïve UTF-16 capitalizers which think that 16 bits == 1 character
    • Special Unicode Characters Union : A super string recommended by VMware Inc. Globalization Team: can effectively cause rendering issues or character-length issues to validate product globalization readiness.
    • Changing length when lowercased : Characters which increase in length (2 to 3 bytes) when lowercased
    • Japanese Emoticons : Strings which consists of Japanese-style emoticons which are popular on the web
    • Emoji : Strings which contain Emoji; should be the same behavior as two-byte characters, but not always
    • Regional Indicator Symbols : Regional Indicator Symbols can be displayed differently across fonts, and have a number of special behaviors
    • Unicode Numbers : Strings which contain unicode numbers; if the code is localized, it should see the input as numeric
    • Right-To-Left Strings : Strings which contain text that should be rendered RTL if possible (e.g. Arabic, Hebrew)
    • Ogham Text : The only unicode alphabet to use a space which isn't empty but should still act like a space.
    • Trick Unicode : Strings which contain unicode with unusual properties (e.g. Right-to-left override)
    • Zalgo Text : Strings which contain "corrupted" text. The corruption will not appear in non-HTML text, however.
    • Unicode Upsidedown : Strings which contain unicode with an "upsidedown" effect
    • Unicode font : Strings which contain bold/italic/etc. versions of normal characters
    • Script Injection : Strings which attempt to invoke a benign script injection; shows vulnerability to XSS
    • SQL Injection : Strings which can cause a SQL injection if inputs are not sanitized
    • Server Code Injection : Strings which can cause user to run code on server as a privileged user
    • Command Injection (Ruby) : Strings which can call system commands within Ruby/Rails applications
    • XXE Injection (XML) : String which can reveal system files when parsed by a badly configured XML parser
    • Unwanted Interpolation : Strings which can be accidentally expanded into different strings if evaluated in the wrong context, e.g. used as a printf format string or via Perl or shell eval. Might expose sensitive data from the program doing the interpolation, or might just represent the wrong string.
    • File Inclusion : Strings which can cause user to pull in files that should not be a part of a web server
    • Known CVEs and Vulnerabilities : Strings that test for known vulnerabilities
    • MSDOS/Windows Special Filenames : Strings which are reserved characters in MSDOS/Windows
    • IRC specific strings : Strings that may occur on IRC clients that make security products freak out
    • Scunthorpe Problem : Innocuous strings which may be blocked by profanity filters
    • Human injection : Strings which may cause human to reinterpret worldview
    • Terminal escape codes : Strings which punish the fools who use cat/type on this file
    • iOS Vulnerabilities : Strings which crashed iMessage in various versions of iOS
    • Persian special characters : This is a four characters string which includes Persian special characters
  • edited January 2020

    (On the question of Python's random.shuffle:)

    You'll find it here

    There's a quick mention of its rationale here.

    Something like this ought to suffice also:

        // Approach 1, modified slightly
        function anagram(text) {
          var a = text.split("");
          for (var i = 0; i < a.length; i += 1) {
            var letter = a[i];
            var j = Math.floor(Math.random() * (i + 1)); /* here */
            a[i] = a[j];
            a[j] = letter;
          }
          return a.join("");
        }
    

    or in Python:

    def explicit_uniform(s):
        s = list(s)
        for i in range(len(s)):
            j = random.randrange(i + 1)  # That's selected from [0..i], inclusive
            s[i], s[j] = s[j], s[i]
        return "".join(s)
    
  • @jang s[i], s[j] = s[j], s[i] is a fun way to do a swap.

    @rtoal an equivalent result to using the Fisher-Yates shuffle to get a uniform distribution would be the use of Factoradics (Factorial Number System), which were also described by Knuth, though originating elsewhere.

    The main difference is that we randomly select a permutation up front in the range of [0, n! -1]. The mixed-radix representation of the Factoradic then allows us to map (or 'unrank' as its sometimes termed, at least with the similar idea of Combinadic numbers) this permutation ID to a specific permutation of the input. Runtime is also linear.

    This approach could be preferable to minimize the amount of entropy consumed from the random source, or if we want a compact representation of anagrams to then store in bitmaps or dictionaries.

Sign In or Register to comment.