Adventures in Unicode: Parsing superscript exponents

I’ve always wanted to add support for parsing exponentiation using superscript characters, instead of clunky infix operators like ‘^’ or ‘**’. If I want 2 raised to the tenth power, 2**10 just doesn’t look as good as 2¹⁰. Parsing and evaluating these exponents is simple enough with pyparsing, but first the digits need to be convertible to regular int values.

The regular digits 0 through 9 work nicely enough. Since they are contiguous and ordered in the ASCII sequence, and Python has support for them, we can write int(‘9’) and get the value 9. Here we look at all 10 numeric characters:

digits = "0123456789"
for i in sorted(digits):
    print(repr(i), ord(i), end=' ')
    try:
        print(int(i))
    except ValueError:
        print('-error-')

And we get a friendly predictable listing of characters, ordinal values, and converted int values:

'0' 48 0
'1' 49 1
'2' 50 2
'3' 51 3
'4' 52 4
'5' 53 5
'6' 54 6
'7' 55 7
'8' 56 8
'9' 57 9

But the superscript digits are more difficult. Using the same for loop, but with this changed line:

digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"

Now we get these results:

'²' 178 -error-
'³' 179 -error-
'¹' 185 -error-
'⁰' 8304 -error-
'⁴' 8308 -error-
'⁵' 8309 -error-
'⁶' 8310 -error-
'⁷' 8311 -error-
'⁸' 8312 -error-
'⁹' 8313 -error-

What the heck?! They aren’t even in sorted order! Only the 1, 2, and 3 superscripts are even in the 8-bit 0-255 range, and even they aren’t in order.

So looking ahead to when we eventually parse and gather these exponent strings, we need to roll our own int() method to convert to usable Python ints.

I suggest we build a dict for mapping each superscript to its int value:

superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
super_int_map = {digit: value for value, digit in enumerate(superscript_digits)}

Enumerating over the superscript digits in order gives us (int, digit) tuples, which make it easy to convert to a dict using a dict comprehension.

Now we just need a function that will convert a string made up of one or more superscript characters, and give back an int value, for example converting ‘⁷⁸⁹‘ to 789. This code is a typical exercise in beginning programming classes, so should not be much of a surprise.

def super_to_int(superscript_str):
    ret = 0

    # iterate over each digit in the input, get its value (from 0-9)
    # and add it to the current running total, after multiplying
    # the running total by 10
    for chr in superscript_str:
        ret = ret * 10 + super_int_map[chr]

    return ret

With that in our back pocket, we can start writing a parser, knowing that when it comes time to convert superscript exponents to ints, our conversion function is all ready to go.

Pyparsing comes ready-made with a number of helpful classes and objects for building up parsers a piece at a time. So we can begin by writing the expression that will parse an integer subscript using a pyparsing Word class, which indicates that we want to parse a word group composed of one or more characters, in this case the characters are the superscript digits.

exponent = pp.Word(superscript_digits).setName("integer exponent")

The reason for naming this expression becomes clear when we run some simple tests on it, using the runTests() method that is provided on all pyparsing parse expressions:

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

Giving

¹²³
['¹²³']

¹⁰
['¹⁰']

10
^
FAIL: Expected integer exponent (at char 0), (line:1, col:1)

Without the name, we would have gotten a more cryptic-looking message like “Expected W:(⁰¹²³…) (at char 0), (line:1, col:1)”.

We actually will want to parse and evaluate strings like “2¹⁰“, so we also need an expression for a regular integer. Integer expressions are so common, pyparsing provides a standard expression in its pyparsing_common namespace class, so we can just use that.

integer = pp.pyparsing_common.integer
integer.runTests("""\
   2
   42
   4294967296
""")

Giving

2
[2]

42
[42]

4294967296
[4294967296]

There is a subtle difference from the previous tests – these values have already been converted to ints! There are no quotation marks around the parsed values, as you see above when we tested out our exponent expression. So when we want to perform the final exponentiation operation, the base number will already be in int form.

pyparsing does this using a parse-time callback, called a parse action. Each expression in the parser can have one or more parse actions attached to it, for the purposes of conversion, validation, data structuring, or whatever. We will use this same feature to convert our superscript exponents to ints.

Pyparsing is pretty flexible in letting you define the arguments passed to parse actions. The most common is to just pass the list of parsed tokens. For this particular expression the list is always just 1 element long, since the expression we are acting on parses just a single word of superscript digits.

def convert_parsed_exponent(tokens):
    return super_to_decimal(tokens[0])

exponent.addParseAction(convert_parsed_exponent)

Rerunning our previous tests now gives:

¹²³
[123]

¹⁰
[10]

10
^
FAIL: Expected integer exponent (at char 0), (line:1, col:1)

Now the first two parse correctly, but we are still stumbling over plain old “10”.

To handle our expressions like “2¹⁰“, we define another expression, this time combining the expressions we already have. We’ll allow for our parser to handle integers with or without exponents:

raised_number = integer + pp.Optional(exponent, default=1)

We simply use the ‘+’ operator to show that these two expressions should occur one after the next. We wrap our exponent expression using the pyparsing Optional class, so that pyparsing won’t complain when parsing values that have no exponent. In the event that there is no exponent, we would still like a default value to be given. In this case, a logical default exponent if none is explicitly given is 1, since any number raised to 1 is just that same number.

Testing out our combined expression, shows that we are getting pretty close:

raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Gives

2¹⁰
[2, 10]

10³
[10, 3]

2³²
[2, 32]

10⁰
[10, 0]

10
[10, 1]

The last step will be to do the computation of the actual exponentiation operation. But now that we are used to using parse actions, a second one added to the raised_number expression does the job.

def raise_to_power(t):
    return t[0]**t[1]
raised_number.addParseAction(raise_to_power)

raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Gives the desired results:

2¹⁰
[1024]

10³
[1000]

2³²
[4294967296]

10⁰
[1]

10
[10]

Note that raise_to_power() has no code in it to convert the tokens to ints. This is because the two expressions that make up a raised_number have each already converted their strings to ints, so the resulting parsed tokens are no longer just strings, but ints. Similarly, the code that performs conversion from superscript string to int has no exception handling in case of bad input. Why? Because the input has already been screened in the parser so that only valid superscript strings are sent to the parse action.

Here are some ideas for other expressions that this parser does not currently handle:

    -1³
    -1²
    (4-5)³
    (1/2)³
    1/2¹⁰
    2⁻¹⁰
    6.02×10²³

Here is the full listing of code from this article:

# -*- coding: utf-8 -*-

import pyparsing as pp

def show_numeric_characters(digits):
    for i in sorted(digits):
        print(repr(i), ord(i), end=' ')
        try:
            print(int(i))
        except ValueError:
            print('-error-')

digits = "0123456789"
superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
show_numeric_characters(digits)
show_numeric_characters(superscript_digits)

# a pyparsing expression to parse an exponent of superscript digits
exponent = pp.Word(superscript_digits).setName("integer exponent")

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

# pyparsing-provided expression to parse plain old integers
integer = pp.pyparsing_common.integer
# alternate form, which could handle a leading '-' sign
# integer = pp.Regex(r'-?\d+').addParseAction(lambda t: int(t[0]))
integer.runTests("""\
    2
    42
    4294967296
""")

# function to convert superscript string to int
superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
super_int = dict((digit, value) for value, digit in enumerate(superscript_digits))
def super_to_decimal(superscript_str):
    ret = 0
    for chr in superscript_str:
        ret = ret * 10 + super_int[chr]
    return ret

# parse action to convert a string of superscript digits to an int
def convert_parsed_exponent(tokens):
    return super_to_decimal(tokens[0])
exponent.addParseAction(convert_parsed_exponent)

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

# define an expression to parse an integer optionally raised
# to an exponent power
raised_number = integer + pp.Optional(exponent, default=1)

# add parse action to perform exponentiation
def raise_to_power(t):
    return t[0]**t[1]
raised_number.addParseAction(raise_to_power)

# take it for a spin!
raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Pyparsing is available through PyPI using pip install. The home for pyparsing has recently moved to GitHub: https://github.com/pyparsing/pyparsing

UPDATE: The unicodedata module in Python’s standard library includes the digit() method, which will do the same job as our super_int_map, converting superscripts (and subscripts too!) to their corresponding int values. We still need to do the conversion in sup_to_int that handles multiple digit superscripts, though.

,

Leave a comment

Gibberish Haiku Generator

On a lark, I wrote this haiku generator one afternoon at work, with the Python becoming more and more obfuscated as I went:

from random import choice as CH
from itertools import product as P
from functools import partial as PA
def haiku(syls=dict((i,s.split()) for i,s in enumerate(
    "fair blue green sun earth sky air lake bird fish wheat tree sand life death grim "
    "man boy girl runs plays hunts far old young deep high I my his her sings sad old "
    "red black white night day sleeps broods joy eats drinks swims leaps dreams worm "
    "eye hand arm leg foot mouth blood tears years moon stars days wild dog cat time "
    "friend son takes lives dies loves hates laughs cries shouts flies flames burnt "
    "tall short grass low mourns moans chants wind blows grows rain snow light dark "
    "prays prayer soul tone hue shade fades rich leaf leaves arch oak rose song just "
    "child void|"
    "golden yellow meadow open mountain nature garden ocean tiger vision only "
    "flower zebra lion woman baby wonder joyful dances dancing laughing single "
    "morning evening sleepy drowsy awake asleep away hunger pleasure simple deadly "
    "today tonight progress prefers dances mother father daughter droplet ocean "
    "laughter sorrow running jumping rushing turning whispers cascade arrow simple "
    "longing elder willow honey water heaven above below after before river "
    "quiet embers wishes under over respite relief journey afar glistens abyss "
    "shimmers ripples shudders shivers trembles slumbers hiding gently slowly "
    "slender brooding somber ochre velvet ruby entwined spirit blossom ribbon iris "
    "willow meaning resolve omen|"
    "waterfall aurora indigo butterfly lioness whimsical yesterday enemy "
    "contemplates considers grandfather grandmother family glistening emerges "
    "fortunate absolute horizon precipice oasis shivering shimmering tenderly "
    "sleepily carefully wistfully beautiful entwining sepia unknowing "
    "feverish|"
    "caterpillar captivating fascinating philosophy iridescent "
    "intermingled understanding proximity unknowable unreachable|"
    "inescapable undeniable inevitable unfathomable".split('|'),1)),
    combs=map(PA(filter,None),P(*[range(6)]*7))):
    return '\n'.join(' '.join(CH(syls[x])
                    for x in CH(filter(lambda c:sum(c)==n, combs)))
                        for n in (5,7,5))

for i in range(10):
    print (haiku()+'\n')

Some samples:

feverish moans runs
moans unfathomable time
garden baby earth

absolute moon cries
wheat moon evening blossom song
earth takes shudders plays

life grim precipice
blue understanding abyss
fades grim aurora

broods willow grim dies
fair respite iridescent
foot shade I hiding

my stars family
iridescent heaven sand
hunts grass cascade runs

dreams turning red snow
waterfall shade shimmering
captivating stars

sand grim tonight night
moon nature shimmering prays
my flower shade blows

old white glistening
meadow only shade slender
wild his omen cat

willow hue whispers
dies grandfather son spirit
garden dark above

tears garden relief
laughs shimmering tenderly
song time mountain dies

Are these really haiku? Not in the conventional sense, and certainly not ones that would hold up to critical review. But if you select a seed dictionary of words that collectively suggest a particular mood, the generated verse can evoke that same mood.

Leave a comment

Snippet: Class that never raises AttributeError

Sometimes we’d like to have a class that is tolerant of attribute accesses that weren’t anticipated at class design time, returning a default ‘None’ value for an uninitialized attribute. Using defaultdict to override the `__dict__` attribute makes this very easy:

from collections import defaultdict

class X(object):
    def __init__(self):
        self.__dict__ = defaultdict(lambda : None)
    def __getattr__(self, attr):
        return self.__dict__[attr]

Now we can initialize an object with some attributes, but access others and get back the default None:

x = X()
x.color = "RED"
x.size = 6

print x.color
print x.size
print x.material

prints:

RED
6
None

Of course, this defeats a key validation feature in Python, so you’ll need to take extra care that you specify attributes correctly.

Leave a comment

Snippet: sgn function

From time to time, I find that I need a function to return +1, 0, or -1 representing the sign of a value: +1 or -1 for positive or negative values, or 0 for a zero value. I remember learning this as the sgn function in high school. Python’s standard lib leaves this out, but does supply cmp as a built-in, so the standard approach would probably be to define sgn using:

sgn = lambda x : cmp(x,0)

I found a nice website of Bit-Twiddling Hacks the other day, and it had a nice alternative for sgn, which in Python would be:

sgn = lambda x : (x>0) - (x<0)

The elegance of this appeals to me and I did a quick timing pass:

C:\Users\Paul>python -m timeit "[cmp(x,0) for x in (-100,0,14)]"
1000000 loops, best of 3: 0.645 usec per loop

C:\Users\Paul>python -m timeit "[lambda x:x>0 - x<0 for x in (-100,0,14)]"
1000000 loops, best of 3: 0.496 usec per loop

So by cutting out the repeated function calls to cmp, this also has the benefit of being just a tad faster.

1 Comment

Snippet: Uniquify a sequence, preserving order

I came across some code today that used a set to keep track of previously seen values while iterating over a sequence, keeping just the not-seen-before items. A brute force kind of thing:

seen = set()
unique = []
for item in sequence:
    if not item in seen:
        unique.append(item)
        seen.add(item)

I remembered that I had come up with a simple form of this a while ago, using a list comprehension to do this in a single expression. I dug up the code, and wrapped it up in a nice little method. This version accepts any sequence or generator, and makes an effort to return a value of the same type as the input sequence:

def unique(seq):
    """Function to keep only the unique values supplied in a given 
       sequence, preserving original order."""
       
    # determine what type of return sequence to construct
    if isinstance(seq, (list,tuple)):
        returnType = type(seq)
    elif isinstance(seq, basestring):
        returnType = type(seq)('').join
    else:
        # - generators and their ilk should just return a list
        returnType = list

    try:
        seen = set()
        return returnType(item for item in seq if not (item in seen or seen.add(item)))
    except TypeError:
        # sequence items are not of a hashable type, can't use a set for uniqueness
        seen = []
        return returnType(item for item in seq if not (item in seen or seen.append(item)))

My first pass at this tried to compare the benefit of using a list vs. a set for seen – it turns out, both versions are useful, in case the items in the incoming sequence aren’t hashable, in which case the only option for seen is a list.

Leave a comment

RK integrator

While moving files from my old laptop drive to my new one, I found a nice Runge-Kutta integrator class that I had written ages ago. So long ago, in fact, that I was a little embarrassed at the newbiness of some of my code. So I decided to update my code to get a nice RK class out of it, using list comprehensions instead of “for i in range” loops, and including an integrate method that acts as a generator so that the calling code can cycle through each integration step. As is typical in R-K, the system state is maintained in a vector X, and the calling method must provide a callback function that will return dX/dt.

Here is the class:

class RKIntegrator :
    "Class used to perform Runge-Kutta integration of set of ODE's"

    def __init__( self, dt, derivFunc, degree=0, initConds=None ):
        self.dt = float(dt)
        self.dt_2 = dt / 2.0
        self.t = float(0)
        if not (degree or initConds):
            raise ValueError("must specify degree or initial conditions")
        if initConds is not None:
            self.x = initConds[:]
        else:
            self.x = [0.0] * degree
        self.derivFunc = derivFunc


    def doIntegrationStep( self ):
        dt = self.dt
        dxFunc = self.derivFunc
        t2 = self.t + self.dt_2

        dx = dxFunc( self.t, self.x )
        delx0 = [ dx_i*dt for dx_i in dx ]
        xv = [x_i + delx0_i/2.0 for x_i, delx0_i in zip(self.x, delx0)]
        dx = dxFunc( t2, xv )
        delx1 = [ dx_i*dt for dx_i in dx ]
        xv = [x_i + delx1_i/2.0 for x_i,delx1_i in zip(self.x,delx1)]
        dx = dxFunc( t2, xv )
        delx2 = [ dx_i*dt for dx_i in dx ]
        xv = [x_i + delx1_2 for x_i,delx1_2 in zip(self.x,delx2)]
          
        self.t += dt
        dx = dxFunc(self.t, xv)
        
        self.x = [
            x_i + ( delx0_i + dx_i*dt + 2.0*(delx1_i + delx2_i) ) / 6.0
                for x_i, dx_i, delx0_i, delx1_i, delx2_i in 
                    zip(self.x, dx, delx0, delx1, delx2)
            ]
    
    def integrate(self):
        while True:
            self.doIntegrationStep()
            yield self.t, self.x

Here is an example of finding X with constant acceleration of 4:

def getDX( t, x ):
  return [ 
    x[1], 
    4.0 
    ]

isWhole = lambda x : abs(x-round(x)) < 1e6 

rk = RKIntegrator( dt=0.1, derivFunc=getDX, initConds = [0.0, 0.0] )
for t,x in rk.integrate():
    if t > 10: 
        break
    if isWhole(t):
        print(t, ', '.join('%.2f' % x_i for x_i in x))

Googling for ‘Python runge kutta’, I came across this blog posting:
http://doswa.com/blog/2009/01/02/fourth-order-runge-kutta-numerical-integration/
This does a good job, but hardcodes the vector size to just x, velocity, and acceleration. Here is how my R-K integrator would implement Doswa’s code:

def accel(t,x):
    stiffness = 1
    damping = -0.005
    x,v = x
    return -stiffness*x - damping*v

def getDX(t,x):
    return [
        x[1],
        accel(t,x)
        ]

rk = RKIntegrator( dt=1.0/40.0, derivFunc=getDX, initConds = [50.0, 5.0] )
for t,x in rk.integrate():
    if t > 100.1: 
        break
    if isWhole(t):
        print t,', '.join('%.2f' % x_i for x_i in x)

My results match the posted results to 2 places.

3 Comments

“dulce” becomes “littletable”

It’s funny how a little experiment can start to take on momentum all by itself. After looking at other Python databases, it wasn’t long before Google’s BigTable cropped in my searches.  This suggested to me a more descriptive and maybe more appopriate name for my experiment – littletable.  It’s expectations are modest, and so it has a fairly modest-sounding name.

Tables of objects are created simply by creating an empty table and loading like objects into it.  No schema, no SQL.  The attributes of the objects themselves, and the attributes used in the queries and joins, describe an organic, emergent schema.  I loaded a data table of zipcodes by state (from xxx), and a table of states.  There are a total of over 42,000 defined zipcodes (data as of 1999).  Here is a query of zipcodes:

fullzips = (zips.join_on("statecode") + states)()

A table can keep an index on a particular attribute, with the option to require uniqueness or not.  Indexes are used at join time to optimize the join performance, by minimizing the number of records that have to be sifted through.

The latest version of littletable (0.3) now includes table pivoting.  This makes it very easy to look at data in a large table to see how it is distributed across particular keys.  For instance, here is a table of the top 20 states with the most zip codes:

    TX Texas             2676
    CA California        2675
    NY New York          2238
    PA Pennsylvania      2224
    IL Illinois          1595
    OH Ohio              1470
    FL Florida           1449
    VA Virginia          1253
    MO Missouri          1192
    MI Michigan          1169
    NC North Carolina    1083
    IA Iowa              1073
    MN Minnesota         1036
    KY Kentucky          1016
    IN Indiana            992
    GA Georgia            975
    WV West Virginia      930
    WI Wisconsin          914
    AL Alabama            847
    TN Tennessee          806

created by “pivoting” the zip code table on the single attribute stateabbr.

The states with the fewest zip codes are:

    GU Guam                21
    VI Virgin Islands      16
    FM Federated State      4
    MP Northern Marian      3
    MH Marshall Island      2
    AS American Samoa       1
    PW Palau                1

And this query:

    nozips = states.where(lambda o:o.statecode not in zips.statecode)

returns a single record:

    ['UM', None]

(“UM” is the postal state abbreviation for the U.S. Minor Outlying Islands, a group of uninhabited islands SW of Hawaii – see  http://en.wikipedia.org/wiki/U.S._Minor_Outlying_Islands).

A nice characteristic of littletable queries and joins is that they each return a new fully-functional table, containing the joined and/or filtered records described in the query.  Tables can then be exported to CSV files, making it easy to save and restore the results of a particular query.  Tables are just wrappers around Python lists, so it is still possible to access parts of them using slice notation.

Here is a query from a database of US place names, retrieving all of the tunnels in the US, sorted by descending elevation.

    tunnels = us_names.query(feature="TNL", _orderby="elev desc")

Using basic python slicing, we can then find the 15 highest and 15 lowest tunnels in the country:

    for t in tunnels[:15]+tunnels[-15:]:
        print "%-30.30s %s %5d" % (t.name, t.state, t.elev)

Giving:

    Twin Lakes Reservoir and Canal CO  4003
    Harold D Roberts Tunnel        CO  3763
    Eisenhower Memorial Tunnel     CO  3728
    Twin Lakes Reservoir Tunnel Nu CO  3709
    Ivanhoe Tunnel                 CO  3680
    Vasquez Tunnel                 CO  3653
    Old Alpine Tunnel (historical) CO  3639
    Hagerman Tunnel                CO  3637
    McCullough Tunnel              CO  3635
    Strickler Tunnel               CO  3608
    August P Gumlick Tunnel        CO  3605
    Charles H Boustead Tunnel      CO  3603
    Quandary Tunnel                CO  3574
    Chapman Tunnel                 CO  3561
    Hoosier Pass Tunnel            CO  3552
    Harvey Tunnel                  LA     2
    Posey Tube                     CA     2
    Harbor Tunnel                  MD     0
    Baytown Tunnel                 TX     0
    Chesapeake Channel Tunnel      VA     0
    Downtown Tunnel                VA     0
    Midtown Tunnel                 VA     0
    Thimble Shoal Channel Tunnel   VA     0
    Holland Tunnel                 NJ     0
    Lincoln Tunnel                 NJ     0
    Brooklyn-Battery Tunnel        NY     0
    Pennsylvania Tunnnels          NY     0
    Queens Midtown Tunnel          NY     0
    Webster Street Tube            CA     0
    Rapid Transit Trans-Bay Tube   CA    -2

So it isn’t necessary to support every SQL feature in the litletable API, since the objects *are* in what is essentially a list structure.

So far littletable has been a handy little tool for quick data manipulation, and maybe some simple database concept experimentation. Not sure if there is much more I really need to add – we’ll see if there are many takers out there for this little recipe.

1 Comment