blog.robur.coop/articles/2024-02-03-python-str-repr.html

285 lines
23 KiB
HTML
Raw Permalink Normal View History

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>
Robur's blog - Python&apos;s `str.__repr__()`
</title>
<meta name="description" content="Reimplementing Python string escaping in OCaml">
<link type="text/css" rel="stylesheet" href="../css/hl.css">
<link type="text/css" rel="stylesheet" href="../css/style.css">
<script src="../js/hl.js"></script>
<link rel="alternate" type="application/rss+xml" href="../feed.xml" title="blog.robur.coop">
</head>
<body>
<header>
<h1>blog.robur.coop</h1>
<blockquote>
The <strong>Robur</strong> cooperative blog.
</blockquote>
</header>
<main><a href="/index.html">Back to index</a>
<article>
<h1>Python&apos;s `str.__repr__()`</h1>
<ul class="tags-list"><li><a href="/tags/ocaml.html">ocaml</a></li><li><a href="/tags/python.html">python</a></li><li><a href="/tags/unicode.html">unicode</a></li></ul><p>Sometimes software is written using whatever built-ins you find in your programming language of choice.
This is usually great!
However, it can happen that you depend on the precise semantics of those built-ins.
This can be a problem if those semantics become important to your software and you need to port it to another programming language.
This story is about Python and its <code>str.__repr()__</code> function.</p>
<p>The piece of software I was helping port to <a href="https://ocaml.org/">OCaml</a> was constructing a hash from the string representation of a tuple.
The gist of it was basically this:</p>
<pre><code class="language-python">def get_id(x):
id = (x.get_unique_string(), x.path, x.name)
return myhash(str(id))
</code></pre>
<p>In other words it's a Python tuple consisting of mostly strings but also a <code>PosixPath</code> object.
The way <code>str()</code> works is it calls the <code>__str__()</code> method on the argument objects (or otherwise <code>repr(x)</code>).
For Python tuples the <code>__str__()</code> method seems to print the result of <code>repr()</code> on each elemenet separated by a comma and a space and surrounded by parenthesis.
So good so far.
If we can precisely emulate <code>repr()</code> on strings and <code>PosixPath</code> it's easy.
In the case of <code>PosixPath</code> it's really just <code>'PosixPath('+repr(str(path))+')'</code>;
so in that case it's down to <code>repr()</code> on strings - which is <code>str.__repr__()</code>,</p>
<p>There had been a previous attempt at this that would use OCaml's string escape functions and surround the string with single quotes (<code>'</code>).
This works for some cases, but not if the string has a double quote (<code>&quot;</code>).
In that case OCaml would escape the double quote with a backslash (<code>\&quot;</code>) while python would not escape it.
So a regular expression substitution was added to replace the escape sequence with just a double quote.
This pattern of finding small differences between Python and OCaml escaping had been repeated,
and eventually I decided to take a more rigorous approach to it.</p>
<h2>What is a string?</h2>
<p>First of all, what is a string? In Python? And in OCaml?
In OCaml a string is just a sequence of bytes.
Any bytes, even <code>NUL</code> bytes.
There is no concept of unicode in OCaml strings.<br>
In Python there is the <code>str</code> type which is a sequence of Unicode code points<sup><a href="#fn-python-bytes" id="ref-1-fn-python-bytes" role="doc-noteref" class="fn-label">[1]</a></sup>.
I can recommend reading Daniel Bünzli's <a href="https://ocaml.org/p/uucp/13.0.0/doc/unicode.html#minimal">minimal introduction to Unicode</a>.
Already here there is a significant gap in semantics between Python and OCaml.
For many practical purposes we can get away with using the OCaml <code>string</code> type and treating it as a UTF-8 encoded Unicode string.
This is what I will do as in both the Python code and the OCaml code the data being read is a UTF-8 (or often only the US ASCII subset) encoded string.</p>
<h2>What does a string literal look like?</h2>
<h3>OCaml</h3>
<p>I will not dive too deep into the details of OCaml string literals, and focus mostly on how they are escaped by the language built-ins (<code>String.escaped</code>, <code>Printf.printf &quot;%S&quot;</code>).
Normal printable ASCII is printed as-is.
That is, letters, numbers and other symbols except for backslash and double quote.
There are the usual escape sequences <code>\n</code>, <code>\t</code>, <code>\r</code>, <code>\&quot;</code> and <code>\\</code>.
Any byte value can be represented with decimal notation <code>\032</code> or octal notation '\o040' or hexadecimal notation <code>\x20</code>.
The escape functions in OCaml has a preference for the decimal notation over the hexadecimal notation.
Finally I also want to mention the Unicode code point escape sequence <code>\u{3bb}</code> which represents the UTF-8 encoding of U+3BB.
While the escape functions do not use it, it will become handy later on.
Illegal escape sequences (escape sequences that are not recognized) will emit a warning but otherwise result in the escape sequence as-is.
It is common to compile OCaml programs with warnings-as-errors, however.</p>
<h3>Python</h3>
<p>Python has a number of different string literals and string-like literals.
They all use single quote or double quote to delimit the string (or string-like) literals.
There is a preference towards single quotes in <code>str.__repr__()</code>.
You can also triple the quotes if you like to write a string that uses a lot of both quote characters.
This format is not used by <code>str.__repr__()</code> so I will not cover it further, but you can read about it in the <a href="https://docs.python.org/3/reference/lexical_analysis.html#strings">Python reference manual</a>.
The string literal can optionally have a prefix character that modifies what type the string literal is and how its content is interpreted.</p>
<p>The <code>r</code>-prefixed strings are called <em>raw strings</em>.
That means backslash escape sequences are not interpreted.
In my experiments they seem to be quasi-interpreted, however!
The string <code>r&quot;\&quot;</code> is considered unterminated!
But <code>r&quot;\&quot;&quot;</code> is fine as is interpreted as <code>'\\&quot;'</code><sup><a href="#fn-raw-escape-example" id="ref-1-fn-raw-escape-example" role="doc-noteref" class="fn-label">[2]</a></sup>.
Why this is the case I have not found a good explanation for.</p>
<p>The <code>b</code>-prefixed strings are <code>bytes</code> literals.
This is close to OCaml strings.</p>
<p>Finally there are the unprefixed strings which are <code>str</code> literals.
These are the ones we are most interested in.
They use the usual escape <code>\[ntr&quot;]</code> we know from OCaml as well as <code>\'</code>.
<code>\032</code> is <strong>octal</strong> notation and <code>\x20</code> is hexadecimal notation.
There is as far as I know <strong>no</strong> decimal notation.
The output of <code>str.__repr__()</code> uses the hexadecimal notation over the octal notation.
As Python strings are Unicode code point sequences we need more than two hexadecimal digits to be able to represent all valid &quot;characters&quot;.
Thus there are the longer <code>\u0032</code> and the longest <code>\U00000032</code>.</p>
<h2>Intermezzo</h2>
<p>While studying Python string literals I discovered several odd corners of the syntax and semantics besides the raw string quasi-escape sequence mentioned earlier.
One fact is that Python doesn't have a separate character or Unicode code point type.
Instead, a character is a one element string.
This leads to some interesting indexing shenanigans: <code>&quot;a&quot;[0][0][0] == &quot;a&quot;</code>.
Furthermore, strings separated by spaces only are treated as one single concatenated string: <code>&quot;a&quot; &quot;b&quot; &quot;c&quot; == &quot;abc&quot;</code>.
These two combined makes it possible to write this unusual snippet: <code>&quot;a&quot; &quot;b&quot; &quot;c&quot;[0] == &quot;a&quot;</code>!
For byte sequences, or <code>b</code>-prefixed strings, things are different.
Indexing a bytes object returns the integer value of that byte (or character):</p>
<pre><code class="language-python">&gt;&gt;&gt; b&quot;a&quot;[0]
97
&gt;&gt;&gt; b&quot;a&quot;[0][0]
&lt;stdin&gt;:1: SyntaxWarning: 'int' object is not subscriptable; perhaps you missed a comma?
Traceback (most recent call last):
File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
TypeError: 'int' object is not subscriptable
</code></pre>
<p>For strings <code>\x32</code> can be said to be shorthand for <code>&quot;\u0032&quot;</code> (or <code>&quot;\u00000032&quot;</code>).
But for bytes <code>&quot;\x32&quot; != &quot;\u0032&quot;</code>!
Why is this?!
Well, bytes is a byte sequence and <code>b&quot;\u0032&quot;</code> is not interpreted as an escape sequence and is instead <strong>silently</strong> treated as <code>b&quot;\\u0032&quot;</code>!
Writing <code>&quot;\xff&quot;.encode()</code> which encodes the string <code>&quot;\xff&quot;</code> to UTF-8 is <strong>not</strong> the same as <code>b&quot;\xff&quot;</code>.
The bytes <code>&quot;\xff&quot;</code> consist of a single byte with decimal value 255,
and the Unicode wizards reading will know that the Unicode code point 255 (or U+FF) is encoded in two bytes in UTF-8.</p>
<h2>Where is the Python code?</h2>
<p>Finding the implementation of <code>str.__repr__()</code> turned out to not be so easy.
In the end I asked on the Internet and got a link to <a href="https://github.com/python/cpython/blob/963904335e579bfe39101adf3fd6a0cf705975ff/Objects/unicodeobject.c#L12245-L12405">cpython's <code>Objects/unicodeobject.c</code></a>.
And holy cow!
That's some 160 lines of C code with two loops, a switch statement and I don't know how many chained and nested if statements!
Meanwhile the OCaml implementation is a much less daunting 52 lines of which about a fifth is a long comment.
It also has two loops which each contain one much more tame match expression (roughly a C switch statement).
In both cases they first loop over the string to compute the size of the output string.
The Python implementation also counts the number of double quotes and single quotes as well as the highest code point value.
The latter I'm not sure why they do, but my guess it's so they can choose an efficient internal representation.
Then the Python code decides what quote character to use with the following algorithm:<br>
Does the string contain single quotes but no double quotes? Then use double quotes. Otherwise use single quotes.
Then the output size estimate is adjusted with the number of backslashes to escape the quote character chosen and the two quotes surrounding the string.</p>
<p>Already here it's clear that a regular expression substitution is not enough by itself to fix OCaml escaping to be Python escaping.
My first step then was to implement the algorithm only for US ASCII.
This is simpler as we don't have to worry much about Unicode, and I could implement it relatively quickly.
The first 32 characters and the last US ASCII character (DEL or <code>\x7f</code>) are considered non-printable and must be escaped.
I then wrote some simple tests by hand.
Then I discovered the OCaml <a href="https://github.com/zshipko/ocaml-py">py</a> library which provides bindings to Python from OCaml.
Great! This I can use to test my implementation against Python!</p>
<h2>How about Unicode?</h2>
<p>For the non-ascii characters (or code points rather) they are either considered <em>printable</em> or <em>non-printable</em>.
For now let's look at what that means for the output.
A printable character is copied as-is.
That is, there is no escaping done.
Non-printable characters must be escaped, and python wil use <code>\xHH</code>, <code>\uHHHH</code> or <code>\UHHHHHHHH</code> depending on how many hexadecimal digits are necessary to represent the code point.
That is, the latin-1 subset of ASCII (<code>0x80</code>-<code>0xff</code>) can be represented using <code>\xHH</code> and neither <code>\u00HH</code> nor <code>\U000000HH</code> will be used etc.</p>
<h3>What is a printable Unicode character?</h3>
<p>In the cpython <a href="https://github.com/python/cpython/blob/963904335e579bfe39101adf3fd6a0cf705975ff/Objects/unicodeobject.c#L12245-L12405">function</a> mentioned earlier they use the function <code>Py_UNICODE_ISPRINTABLE</code>.
I had a local clone of the cpython git repository where I ran <code>git grep Py_UNICODE_ISPRINTABLE</code> to find information about it.
In <a href="https://github.com/python/cpython/blob/963904335e579bfe39101adf3fd6a0cf705975ff/Doc/c-api/unicode.rst?plain=1#L257-L265">unicode.rst</a> I found a documentation string for the function that describes it to return false if the character is nonprintable with the definition of nonprintable as the code point being in the categories &quot;Other&quot; or &quot;Separator&quot; in the Unicode character database <strong>with the exception of ASCII space</strong> (U+20 or <code> </code>).</p>
<p>What are those &quot;Other&quot; and &quot;Separator&quot; categories?
Further searching for the function definition we find in <a href="https://github.com/python/cpython/blob/963904335e579bfe39101adf3fd6a0cf705975ff/Include/cpython/unicodeobject.h#L683"><code>Include/cpython/unicodeobject.h</code></a> the definition.
Well, we find <code>#define Py_UNICODE_ISPRINTABLE(ch) _PyUnicode_IsPrintable(ch)</code>.
On to <code>git grep _PyUnicode_IsPrintable</code> then.
That function is defined in <a href="https://github.com/python/cpython/blob/963904335e579bfe39101adf3fd6a0cf705975ff/Objects/unicodectype.c#L158-L163"><code>Objects/unicodectype.c</code></a>.</p>
<pre><code class="language-C">/* Returns 1 for Unicode characters to be hex-escaped when repr()ed,
0 otherwise.
All characters except those characters defined in the Unicode character
database as following categories are considered printable.
* Cc (Other, Control)
* Cf (Other, Format)
* Cs (Other, Surrogate)
* Co (Other, Private Use)
* Cn (Other, Not Assigned)
* Zl Separator, Line ('\u2028', LINE SEPARATOR)
* Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
* Zs (Separator, Space) other than ASCII space('\x20').
*/
int _PyUnicode_IsPrintable(Py_UCS4 ch)
{
const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);
return (ctype-&gt;flags &amp; PRINTABLE_MASK) != 0;
}
</code></pre>
<p>Ok, now we're getting close to something.
Searching for <code>PRINTABLE_MASK</code> we find in <a href="https://github.com/python/cpython/blob/963904335e579bfe39101adf3fd6a0cf705975ff/Tools/unicode/makeunicodedata.py#L450-L451"><code>Tools/unicode/makeunicodedata.py</code></a> the following line of code:</p>
<pre><code class="language-Python">if char == ord(&quot; &quot;) or category[0] not in (&quot;C&quot;, &quot;Z&quot;):
flags |= PRINTABLE_MASK
</code></pre>
<p>So the algorithm is really if the character is a space character or if its Unicode general category doesn't start with a <code>C</code> or <code>Z</code>.
This can be implemented in OCaml using the uucp library as follows:</p>
<pre><code class="language-OCaml">let py_unicode_isprintable uchar =
(* {[if char == ord(&quot; &quot;) or category[0] not in (&quot;C&quot;, &quot;Z&quot;):
flags |= PRINTABLE_MASK]} *)
Uchar.equal uchar (Uchar.of_char ' ')
||
let gc = Uucp.Gc.general_category uchar in
(* Not those categories starting with 'C' or 'Z' *)
match gc with
| `Cc | `Cf | `Cn | `Co | `Cs | `Zl | `Zp | `Zs -&gt; false
| `Ll | `Lm | `Lo | `Lt | `Lu | `Mc | `Me | `Mn | `Nd | `Nl | `No | `Pc | `Pd
| `Pe | `Pf | `Pi | `Po | `Ps | `Sc | `Sk | `Sm | `So -&gt;
true
</code></pre>
<p>After implementing unicode I expanded the tests to generate arbitrary OCaml strings and compare the results of calling my function and Python's <code>str.__repr__()</code> on the string.
Well, that didn't go quite well.
OCaml strings are just any byte sequence, and ocaml-py expects it to be a UTF-8 encoded string and fails on invalid UTF-8.
Then in qcheck you can &quot;assume&quot; a predicate which means if a predicate doesn't hold on the generated value then the test is skipped for that input.
So I implement a simple verification of UTF-8.
This is far from optimal because qcheck will generate a lot of invalid utf-8 strings.</p>
<p>The next test failure is some unassigned code point.
So I add to <code>py_unicode_isprintable</code> a check that the code point is assigned using <code>Uucp.Age.age uchar &lt;&gt; `Unassigned</code>.</p>
<p>Still, qcheck found a case I hadn't considered: U+61D.
My python version (Python 3.9.2 (default, Feb 28 2021, 17:03:44)) renders this as <code>'\u061'</code> while my OCaml function prints it as-is.
In other words my implementation considers it printable while python does not.
I try to enter this Unicode character in my terminal, but nothing shows up.
Then I look it up and its name is <code>ARABIC END OF TEXT MARKER</code>.
The general category according to uucp is <code>`Po</code>.
So this <strong>should</strong> be a printable character‽</p>
<p>After being stumped by this for a while I get the suspicion it may be dependent on the Python version.
I am still on Debian 11 and my Python version is far from being the latest and greatest.
I ask someone with a newer Python version to write <code>'\u061d'</code> in a python session.
And 'lo! It prints something that looks like <code>''</code>!
Online I figure out how to get the unicode version compiled into Python:</p>
<pre><code class="language-Python">&gt;&gt;&gt; import unicodedata
&gt;&gt;&gt; unicodedata.unidata_version
'13.0.0'
</code></pre>
<p>Aha! And with uucp we find that the unicode version that introduced U+61D to be 14.0:</p>
<pre><code class="language-OCaml"># Uucp.Age.age (Uchar.of_int 0x61D);;
- : Uucp.Age.t = `Version (14, 0)
</code></pre>
<p>My reaction is this is seriously some ungodly mess we are in.
Not only is the code that instigated this journey highly dependent on Python-specifics - it's also dependent on the specific version of unicode and thus the version of Python!</p>
<p>I modify our <code>py_unicode_isprintable</code> function to take an optional <code>?unicode_version</code> argument and replace the &quot;is this unassigned?&quot; check with the following snippet:</p>
<pre><code class="language-OCaml">let age = Uucp.Age.age uchar in
(match (age, unicode_version) with
| `Unassigned, _ -&gt; false
| `Version _, None -&gt; true
| `Version (major, minor), Some (major', minor') -&gt;
major &lt; major' || (major = major' &amp;&amp; minor &lt;= minor'))
</code></pre>
<p>Great! I modify the test suite to first detect the unicode version python uses and then pass that version to the OCaml function.
Now I can't find anymore failing test cases!</p>
<h2>Epilogue</h2>
<p>What can we learn from this?
It is easy to say in hindsight that a different representation should have been chosen.
However, arriving at this insight takes time.
The exact behavior of <code>str.__repr__()</code> is poorly documented.
Reaching my understanding of <code>str.__repr__()</code> took hours of research and reading the C implementation.
It often doesn't seem to be worth it to spend so much time on research for a small function.
Technical debt is a real thing and often hard to predict.
Below is the output of <code>help(str.__repr__)</code>:</p>
<pre><code class="language-Python">__repr__(self, /)
Return repr(self)
</code></pre>
<p>Language and (standard) library designers could consider whether the slightly nicer looking strings are worth the added complexity users eventually are going to rely on - inadvertently or not.
I do think strings and bytes in Python are a bit too complex.
It is not easy to get a language lawyer<sup><a href="#fn-language-lawyer" id="ref-1-fn-language-lawyer" role="doc-noteref" class="fn-label">[3]</a></sup> level understanding.
In my opinion it is a mistake to not at least print a warning if there are illegal escape sequences - especially considering there are escape sequences that are valid in one string literal but not another.</p>
<p>Unfortunately it is often the case that to get a precise specification it is necessary to look at the implementation.
For testing your implementation hand-written tests are good.
Testing against the original implementation is great, and if combined with property-based testing or fuzzing you may find failing test cases you couldn't dream up!
I certainly didn't see it coming that the output depends on the Unicode version.
As is said, testing can only show the presence of bugs, but with a, in a sense, limited domain like this function you can get pretty close to showing absence of bugs.</p>
<p>I enjoyed working on this.
Sure, it was frustrating and at times I discovered some ungodly properties, but it's a great feeling to study and understand something at a deeper level.
It may be the last time I need to understand Python's <code>str.__repr__()</code> this well, but if I do I now have the OCaml code and this blog post to reread.</p>
<p>If you are curious to read the resulting code you may find it on github at <a href="https://github.com/reynir/python-str-repr">github.com/reynir/python-str-repr</a>.
I have documented the code to make it more approachable and maintainable by others.
Hopefully it is not something that you need, but in case it is useful to you it is licensed under a permissive license.</p>
<p>If you have a project in OCaml or want to port something to OCaml and would like help from me and my colleagues at <a href="https://robur.coop/">Robur</a> please <a href="https://robur.coop/Contact">get in touch</a> with us and we will figure something out.</p>
<section role="doc-endnotes"><ol>
<li id="fn-python-bytes">
<p>There is as well the <code>bytes</code> type which is a byte sequence like OCaml's <code>string</code>.
The Python code in question is using <code>str</code> however.</p>
<span><a href="#ref-1-fn-python-bytes" role="doc-backlink" class="fn-label">↩︎︎</a></span></li><li id="fn-raw-escape-example">
<p>Note I use single quotes for the output. This is what Python would do. It would be equivalent to <code>&quot;\\\&quot;&quot;</code>.</p>
<span><a href="#ref-1-fn-raw-escape-example" role="doc-backlink" class="fn-label">↩︎︎</a></span></li><li id="fn-language-lawyer">
<p><a href="http://catb.org/jargon/html/L/language-lawyer.html">A person, usually an experienced or senior software engineer, who is intimately familiar with many or most of the numerous restrictions and features (both useful and esoteric) applicable to one or more computer programming languages. A language lawyer is distinguished by the ability to show you the five sentences scattered through a 200-plus-page manual that together imply the answer to your question “if only you had thought to look there”.</a></p>
<span><a href="#ref-1-fn-language-lawyer" role="doc-backlink" class="fn-label">↩︎︎</a></span></li></ol></section>
</article>
</main>
<footer>
<a href="https://github.com/xhtmlboi/yocaml">Powered by <strong>YOCaml</strong></a>
<br />
</footer>
<script>hljs.highlightAll();</script>
</body>
</html>