Things Python

Happy Repdigit Day!

Posted by ptmcg in Snippets, Uncategorized on February 2, 2022

Today being February 2, 2022, I got to thinking about dates. Since most people will write today as “2/2/22”, today is one of the few days that can be written all using the same digit. I thought about how to find all such dates in a century, and the first thing I needed was a name for such dates. I came up with “uninumeric”, but a Google search turned up this Wikipedia article, which calls them repdigits, as in “repeated digits”.

SPOILER ALERT: the solution is shown at the end of this post. Try to come up with your own guess as to how many such dates there are before continuing.

We have to set some ground rules on this exercise first. Should January 1, 2001 count as a repdigit date? I’m going to say that leading zeroes can be dropped when finding repdigit dates, so this date, when written as “1/1/1”, would be a repdigit. Trailing zeroes cannot be dropped, so January 10, 2001 would not be a repdigit date.

I have some other date filters in mind, so I decided to create a CSV table of all the dates in a century, truncating the year to modulo 100 so that the list would apply to any modern century. (Leap days are always the fly in the ointment in such tables, and if I use the current century to make this table, it would omit February 29 from the 0 year. Fortunately, leap days could not qualify as repdigit dates, so I’m not going to worry about it today.)

I pulled up my trusty Python IDE and imported my simple table package littletable, plus some names from the Python datetime package:

from datetime import datetime, timedelta
import littletable as lt

I need something to give me all the days in a century, but unfortunately, Python’s range() builtin won’t allow this:

dates = range(datetime(2000, 1,1), datetime(2100, 1,1), timedelta(1))

range() only accepts int arguments.

So I needed to write a daterange function (I have said this so many times, that it might qualify as McGuire’s Law – Before you can ever do anything, you always have to do something else first. Unfortunately, McGuire’s Law is recursive.)

def daterange(from_: datetime, to_: datetime, step_: timedelta = timedelta(days=1)):
    cur = from_
    while cur < to_:
        yield cur
        cur += step_

This generator function follows the same convention as Python’s range(): it is inclusive of the starting value, but exclusive of the ending value.

With daterange, I can now create my littletable Table of dates in the current century:

dates = lt.Table()
dates.insert_many({"year": date.year % 100, "month": date.month, "day": date.day}
                  for date in daterange(datetime(2000, 1, 1), datetime(2100, 1, 1)))

For nice presentation, we can also add a str field (I’m showing year/month/day for the format, to avoid any discussions about why Americans write month/day/year while the rest of the world writes day/month/year – fortunately, when looking for repdigit dates, either style works the same):

dates.add_field("str",
                lambda rec: f"{rec.year:02d}/{rec.month:02d}/{rec.day:02d}")

Now to detect which dates are repdigit dates, we need a predicate function that takes a record from our table, and returns whether the date is a repdigit. A function to detect a repdigit string is just:

def is_repdigit(intstr: str):
    return len(set(intstr)) == 1

And for easy filtering, add a “YMD” field to all the records in our table, which contains the digits used to write each date, dropping leading zeroes:

dates.add_field("YMD", lambda rec: f"{rec.year}{rec.month}{rec.day}")

Getting and presenting the matching dates is now just:

repdigit_dates = dates.where(lambda rec: is_repdigit(rec.YMD))
repdigit_dates.present(caption=f"{len(repdigit_dates} dates")

Which shows these dates:

  Year   Month   Day   Str        Ymd     
 ──────────────────────────────────────── 
     1       1     1   01/01/01   111     
     1       1    11   01/01/11   1111    
     1      11     1   01/11/01   1111    
     1      11    11   01/11/11   11111   
     2       2     2   02/02/02   222     
     2       2    22   02/02/22   2222    
     3       3     3   03/03/03   333     
     4       4     4   04/04/04   444     
     5       5     5   05/05/05   555     
     6       6     6   06/06/06   666     
     7       7     7   07/07/07   777     
     8       8     8   08/08/08   888     
     9       9     9   09/09/09   999     
    11       1     1   11/01/01   1111    
    11       1    11   11/01/11   11111
    11      11     1   11/11/01   11111
    11      11    11   11/11/11   111111
    22       2     2   22/02/02   2222
    22       2    22   22/02/22   22222
    33       3     3   33/03/03   3333
    44       4     4   44/04/04   4444
    55       5     5   55/05/05   5555
    66       6     6   66/06/06   6666
    77       7     7   77/07/07   7777
    88       8     8   88/08/08   8888
    99       9     9   99/09/09   9999

26 dates

With this table, we can now try some other what-ifs (left as exercises for the reader):

dates that are repdigits if you don’t drop the leading zeroes
dates that are binary dates (written with just 0’s and 1’s)
dates that are palindromes (now you have to decide on month/day/year vs. day/month/year)

Rather than regenerate the table every time, I wrapped my table create code in the else clause of a try-except, trying to first load the table from a CSV:

try:
    dates = lt.Table().csv_import("dates_in_century.csv",
                                  transforms={}.fromkeys("year month day".split(), int))
except FileNotFoundError:
    dates = lt.Table()
    dates.insert_many({"year": date.year % 100, "month": date.month, "day": date.day}
                      for date in daterange(datetime(2000, 1, 1), datetime(2100, 1, 1)))
    dates.add_field("str",
                    lambda rec: f"{rec.year:02d}/{rec.month:02d}/{rec.day:02d}")
    dates.csv_export("dates_in_century.csv")

python

Leave a comment

How to try something ‘N’ times in Python

Posted by ptmcg in Snippets on December 5, 2021

Photo by Andrea Piacquadio on Pexels.com

“If at first you don’t succeed, try, try again.”

There are many times when we will write Python code that might encounter temporary exceptions that will clear up if we would just have our code pause, take a breath, and then try again. This is actually pretty common with scripts that access some network resource, whether it is a web page or a REST API, or some other service that is either resource-constrained or has built in rate-limiting.

Here is an example of a function that might be part of a larger program to get the contents from a list of web URLs. The function is pretty simple: if successful, it returns the HTML contents of the page; if not, it raises an exception:

from contextlib import closing
import urllib.request

def get_data_for_url(url):
    with closing(urllib.request.urlopen(url, timeout=3)) as response:
        if response.status == 200:
            return response.read()
        else:
            raise Exception(f"failed with status {response.status}")

We may find that for some of these URLs, our request just times out or has some other temporary limit that causes it to fail, when a retry might succeed. So we could wrap the code in this function inside a while loop:

def get_data_for_url(url):
    while True:
        with closing(urllib.request.urlopen(url, timeout=3)) as response:
            if response.status == 200:
                return response.read()

        time.sleep(2)

(We add a short sleep in the loop, to give the web server a chance to “settle down” between retries.)

Unfortunately, this kind of “loop-forever” while loop is a pretty well-known anti-pattern. The request may fail for lots of reasons, some of which might not be cured by just trying again – for instance, if we had invalid request login credentials, or if the server was down. In that case, our script would just loop without stopping, since the request would continue to return the same “Not Authorized” or “Not Found” response over and over again.

So we should limit the number of times we try to get the web page. We could add a “number of tries” counter:

def get_data_for_url(url):
    num_tries = 0
    while True:
        with closing(urllib.request.urlopen(url, timeout=3)) as response:
            if response.status == 200:
                return response.read()

        time.sleep(2)

        num_tries += 1
        if num_tries > 5:
            raise Exception(f"failed with status {response.status}")

There are a couple of issues with this code. First off, there is an “off-by-one” error – this code will actually try 6 times, not 5. Easily fixed by changing “>” to “>=”. Also, our loop limit of 5 is buried down in the body of the code, and will take some searching when we decide to raise or lower the retry count. But more importantly, the code is very sensitive to the incrementing of num_tries. In fact, in my first test of writing this code, I left out the "num_tries += 1” statement, and I had another loop-forever loop. If at some point, you or another developer decide to refactor the code and add in a “continue” statement before the increment, then num_tries will stay 0 and you will be back to looping forever again.

In this code, I made a slight change here to change “while True” to “while num_tries < 5“. This makes the retry limit a little more prominent, but we still have the problem around incrementing num_tries.

def get_data_for_url(url):
    num_tries = 0
    while num_tries < 5:
        with closing(urllib.request.urlopen(url, timeout=3)) as response:
            if response.status == 200:
                return response.read()

        time.sleep(2)

        num_tries += 1

    raise Exception(f"failed with status {response.status}")

A good way to do auto-incrementing is, instead of using a while loop with a counter, to use a for loop over a range.

def get_data_for_url(url):
    for tries in range(5):
        with closing(urllib.request.urlopen(url, timeout=3)) as response:
            if response.status == 200:
                return response.read()

        time.sleep(2)

    raise Exception(f"failed with status {response.status}")

This is a big improvement, because now our counter will always increment, even if a “continue” gets inserted into the loop.

There is still a minor problem – on the last try, if we are about to give up, we still sleep for 2 seconds before exiting the loop and raising the exception. After the last try, there is really no need to sleep again before exiting the loop and raising the “we give up!” exception. It sounds like a minor issue, but it’s possible that other code might be needed between retries besides a simple sleep, and we really just want to quit looping at this point.

We could conditionalize the sleep with this kind of code:

            if tries < 4:
                time.sleep(2)

But if we decide to raise or lower the number of tries, we have to chase down and change this other value as well.

I’ve settled on this pattern for retrying code ‘n’ times.

    def get_data_for_url(url):
        for tries_remaining in reversed(range(5)):
            with closing(urllib.request.urlopen(url, timeout=3)) as response:
                if response.status == 200:
                    return response.read()

            if tries_remaining:
                time.sleep(2)

        raise Exception(f"failed with status {response.status}")

By reversing the range, the tries_remaining variable will count down from 4 to 0. Now the final try will always have a value of 0, so it can be easily checked as a boolean value – changing the variable name from “tries” to “tries_remaining” also makes this condition read a little more naturally.

Lastly, these values of 5 for number of tries and 2 for the retry interval are best defined at some global level, or in configuration settings for your script. So we can define these constants to simplify the maintenance of these values whenever they need to be updated.

NUM_PAGE_FETCH_TRIES = 5
PAGE_FETCH_TIMEOUT = 3
PAGE_FETCH_RETRY_WAIT = 2

def get_data_for_url(url):
    for tries_remaining in reversed(range(NUM_PAGE_FETCH_TRIES)):
        with closing(urllib.request.urlopen(url, timeout=PAGE_FETCH_TIMEOUT)) as response:
            if response.status == 200:
                return response.read()

        if tries_remaining:
            time.sleep(PAGE_FETCH_RETRY_WAIT)

    raise Exception(f"failed with status {response.status}")

Now the purpose of these various constants is much clearer, and future maintainers will know where to make updates to these parameters without scanning through all the program code.

UPDATE: After posting on Twitter, I got a great reply from @GordonMcGregor, that he had written something similar as a decorator, and that this made for cleaner code. I gave it more thought, and this would make for a good decorator example – see below:

import functools
import time

# retry decorator
def retry(n: int, interval: int = 2):
    def _deco(fn):
        @functools.wraps(fn)
        def _inner(*args, **kwargs):
            for tries_remaining in reversed(range(n)):
                try:
                    return fn(*args, **kwargs):
                except Exception:
                    if not tries_remaining:
                        raise
                time.sleep(interval)
        return _inner
    return _deco


# example using the retry decorator
from contextlib import closing
import urllib.request
                    
@retry(3)
def get_data_for_url(url: str):
    with closing(urllib.request.urlopen(url, timeout=3)) as response:
        if response.status == 200:
            return response.read()
    raise Exception(f"failed with status {response.status}")

python

Leave a comment

Pyparsing Walkthrough – Letter Ranges

Posted by ptmcg in Uncategorized on January 1, 2021

This post is hopefully the first of several to walk through the development of some simple pyparsing parsers, to demonstrate how to make use of pyparsing to compose complex parsers from simple building-blocks.

This walkthrough will address a hypothetical problem: given a string of alphabetic characters, expand any ranges to return a list of all the individual characters.

For instance given:

A-C,X,M-P,Z

return the list of individual characters:

A,B,C,X,M,N,O,P,Z

For simplicity, we will limit ourselves to the ASCII alphabetic characters A-Z. At the end, I’ll suggest some enhancements that you’ll be able to tackle on your own.

Parsing Options in Python

Python offers several built-in mechanisms for slicing and dicing text strings:

str methods – the Python str class itself includes a number of methods such as split, partition, startswith, and endswith that are suitable for many basic string parsing tasks
re and regex modules – the builtin re module and the externally available enhancement regex bring all the power, speed, complexity, and arcana of regular expressions to crack many difficult parsing problems
shlex module – the builtin shlex module takes its name from the Unix sh shell, and is good for parsing strings as if they were command arguments, including recognizing and properly handling string elements enclosed in quotes

These features are good for many quick text-splitting and scanning tasks, but can suffer when it comes to ease of maintenance or extensibility.

Approaching a problem with Pyparsing

Whether using pyparsing or another parsing library, it is always good to start out with a plan for the parser. In parsing terms, this is called a BNF, for Backus-Naur Form. This plan helps structure your thinking and the resulting implementation of your parser. Here is a simplified BNF for the letter range problem:

letter ::= 'A'..'Z'
letter_range ::= letter '-' letter
letter_range_item ::= letter_range | letter
letter_range_list ::= letter_range_item [',' letter_range_item]...

This translates almost directly to pyparsing Python code:

import pyparsing as pp

letter = pp.Char("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
letter_range = letter + '-' + letter
letter_range_item = letter_range | letter
letter_range_list = (letter_range_item + (',' + letter_range_item)[...])

Using parseString to run this parser:

print(letter_range_list.parseString(test))

gives

['A', '-', 'C', ',', 'X', ',', 'M', '-', 'P', ',', 'Z']

While this shows that we did in fact parse everything in the input, we still have to do some cleaning up, and also add the expansion of the A-C type ranges.

Suppress delimiters

First thing is to strip out the commas. They are important during the parsing process, but not really necessary for the parsed results. Pyparsing provides a class Suppress for this type of expression, that needs to be parsed, but should be suppressed from the results:

letter_range_list = letter_range_item + (pp.Suppress(',') + letter_range_item)[...]

As it happens, this expr + (Suppress(delimiter) + expr)[...] pattern occurs so frequently, that pyparsing includes a helper method delimitedList(expr, delim) (with a default delimiter of ',') that supports this pattern:

letter_range_list = pp.delimitedList(letter_range_item)

With this change, our output is now cleaner:

['A', '-', 'C', 'X', 'M', '-', 'P', 'Z']

Expand the ranges

The last step is to expand the sub-ranges 'A-C' and 'M-P'. We could certainly walk this list of parsed items, looking for the '-' symbols:

def expand_range(from_, to_):
    # given two letters, return a list of all the letters between them, inclusive
    # ('C', 'F') will return ['C', 'D', 'E', 'F']
    alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    from_loc = alphas.index(from_)
    to_loc = alphas.index(to_)
    return list(alphas[from_loc:to_loc+1])

out = []
letter_iter = iter(parsed_output)
letter = next(letter_iter, '')
last = ''
while letter:
    if letter != '-':
        out.append(letter)
        last = letter
    else:
        to_ = next(letter_iter)
        out.extend(expand_range(last, to_)[1:])
    letter = next(letter_iter, '')

But doesn’t it feel like we are retracing steps that the parser must have already taken? When pyparsing ran our parser to scan the original input string, it had to follow a very similar process to recognize the letter_range expression. Why not have pyparsing run this expansion process at the same time that it has parsed out a letter_range?

As it happens, we can attach a parse-time callback to an expression; in pyparsing applications, this callback is called a parse action. Parse actions can serve various purposes, but the feature we will use in this example is to replace the parsed tokens with a new set of tokens. We’ll still utilize the expand_range function defined above, but now calling it will be more straightforward:

def expand_parsed_range(t):
    # expand parsed tokans ['C', '-', 'F'] to ['C', 'D', 'E', 'F']
    return expand_range(t[0], t[2])
letter_range.addParseAction(expand_parsed_range)

And now our returned list reads:

'A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']

One more thing…

While this accomplishes our goal, there is one pyparsing “best practice” that I’d like to incorporate into this example. Whenever we access parsed data using numeric indices like in:

return expand_range(t[0], t[2])

we make it difficult to make changes in our grammar at some time in the future. For instance, if the grammar were to change and add an optional field in this expression, our index of 2 might have to be conditionalized depending on whether that field were present or not. It would be much better if we could refer to these fields by name.

We can attach names to parsed fields by modifying the definition of the letter_range expression from:

letter_range = letter + '-' + letter

to:

letter_range = letter('start') + '-' + letter('end')

With this change, we can now write our parse action using these results names:

def expand_parsed_range(t):
    return expand_range(t.start, t.end)

And now if the definition of letter_range changes by adding other fields, our starting and ending fields will still be processed correctly, without having to change any existing code. Parsers that use results names are much more robust and maintainable over time.

The Complete Parser

Here is the full working parser example (with tests using run_tests):

import pyparsing as pp

def expand_range(from_, to_):
    """
    Given two letters, return a list of all the letters between them, inclusive.
    >>> expand_range('C', 'F')
    ['C', 'D', 'E', 'F']
    """
    alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    from_loc = alphas.index(from_)
    to_loc = alphas.index(to_)
    return list(alphas[from_loc:to_loc+1])

def expand_parsed_range(t):
    """
    Parse action to expand parsed tokans ['C', '-', 'F'] to ['C', 'D', 'E', 'F']
    """
    return expand_range(t.start, t.end)

# define parser elements - use setName to define names for these elements
# for better error messages
letter = pp.Char("ABCDEFGHIJKLMNOPQRSTUVWXYZ").setName("letter")
letter_range = (letter('start') + '-' + letter('end')).setName("letter_range")
letter_range.addParseAction(expand_parsed_range)
letter_range_list = pp.delimitedList(letter_range | letter)

# test the original input string
test = "A-C,X,M-P,Z"
print(letter_range_list.parseString(test))

# use runTests to easily test against multiple input strings, with comments
letter_range_list.runTests("""\
  # a simple list with no ranges
  A,B

  # a range
  A-E

  # a range and a single letter  
  A-E,X

  # the original complete test string
  A-C,X,M-P,Z
  
  # an erroneous input
  -C,X,M-P,Z
""")

Gives this output:

['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']

# a simple list with no ranges
A,B
['A', 'B']

# a range
A-E
['A', 'B', 'C', 'D', 'E']

# a range and a single letter
A-E,X
['A', 'B', 'C', 'D', 'E', 'X']

# the original complete test string
A-C,X,M-P,Z
['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']

# an erroneous input
-C,X,M-P,Z
^
FAIL: Expected {letter_range | letter}, found '-'  (at char 0), (line:1, col:1)

Potential extensions

Here are some extensions to this parser that you can try on your own:

Change the syntax of a range from “A-C” to “A:C”; how much of your program did you have to change?
Add support for lowercase letters and numeric digits
Add a parse action to letter_item_list to return a sorted list, with no duplicates
Add validation that letter_range is proper (do not allow degenerate ranges like “J-J”, out of order ranges “X-M”, or mixed ranges like “T-4”)
Write another parser to handle ranges of integers instead of letters

For More Information

You can find more information on pyparsing by visiting the pyparsing Github repo, at https://github.com/pyparsing/pyparsing.

Pyparsing

Leave a comment

Fuzzy Groupby – Unusual Restaurant Part II

Posted by ptmcg in Uncategorized on November 11, 2020

In the last blog post (https://thingspython.wordpress.com/2020/10/30/python-groupby-and-an-unusual-restaurant/ ), we looked at Python’s stdlib’s itertools.groupby method, with examples illustrated using a restaurant with an unusual seating plan:

they have just one table
that table can seat an unlimited number of customers
BUT, they must all be in the same party

The restaurant is very popular, so there is always a line of customers outside waiting to get in. The groupby greeter is in charge of identifying each group:

on Family Night, groups are determined by those all having the same last name
on High School Night, groups are determined by those who are in high school or not

These represent the most common uses of itertools.groupby in Python – from a sequence of items, identify groups of items based on those having some common value (such as a last name), or whose properties all evaluate to a common value (such as a grade level indicating whether or not one is in high school).

In this post, I’ll describe what I call fuzzy groupby – items in a group are defined not by their own attributes, but by their relationship with those around them.

For the first example, let’s assume our restaurant has Kids Eat Free Night, where an adult can bring as many children as they like and the children eat for free. Adults may still eat alone as well. For this example, we will say that a child is one under the age of 12.

Note that this is different from High School Night. On that night, the high schoolers and non-high schoolers each constituted their own separate groups. Now we want to define our groups as an adult + 0 or more children.

For a particular Kids Eat Free Night, our groupby greeter might announce parties as:

Party of one adult and 2 children
Party of one adult and 3 children
Party of one adult alone
Party of one adult and 1 child
Party of one adult alone
...

One way to write this in Python is to write the code just like we did for High School Night: have our grouping function return a bool of whether each person is a child or not. But since we need to combine the adult from an all-adults group with the children in the next all-children group, we’ll need to do that regrouping manually:

if adults, each but the last is its own group
if children, all are in the same party as the most recent adult

To write this in Python, we first create a line of waiting customers:

from collections import namedtuple

# create a line of waiting customers
Customer = namedtuple("Customer", "name age")
customers_in_line = [
    Customer("Alice", 22),    # an adult with group of 2 children
    Customer("Ken", 7),
    Customer("Olivia", 9),
    Customer("Jim", 37),      # an adult with group of 3 children
    Customer("Bob", 5),
    Customer("Chip", 8),
    Customer("Nancy", 2),
    Customer("Dan", 26),      # an adult dining alone
    Customer("Candace", 24),  # an adult with group of 1 child
    Customer("Courtney", 11),
    Customer("Arthur", 26),   # an adult dining alone
    ]

For groupby, we’ll need a function to return whether a customer is an adult or not:

def is_adult(customer):
    return customer.age >= 12

Using groupby, we’ll now get groups of adults and groups of children. If the group is a group of adults, then every adult but the last is a single adult dining alone.

The last adult in the group may be followed by a group of children, so we need to save that adult in a “holding” variable, to be included with the children in the next group given by groupby.

from itertools import groupby

last_adult = None
for group_are_adults, customers in groupby(customers_in_line,
                                 key=is_adult):
    # convert generator to list so we can easily detect 
    # the last customer
    customers = list(customers)
    if group_are_adults:
        # every customer but the last is an adult dining alone
        for customer in customers[:-1]:
            # announce the group, and its members
            print("Party of 1 adult alone")
            print(" -", customer.name, "(adult)")
        last_adult = customers[-1]

    else:
        # group are all children, so associate them with
        # the last adult
        if last_adult is not None:
            # announce the group, and its members
            print("Party of one adult and {} {}".format(
                    len(customers),
                    "child" if len(customers) == 1 else "children"))

            print(" -", last_adult.name, "(adult)")
            for customer in customers:
                print(" -", customer.name, "(child)")
            last_adult = None

        else:
            # this should never happen, but...
            print("what are these children doing here without an adult?!")

Lastly, we must deal with a subtle bug in this code: what happens if the list ends with one or more adults followed by no children? If that happens, then the last adult would get stored in last_adult, but never actually processed. So after the code for our groupby loop, we need additional code to handle any last single-dining adult:

# must see if any adult was at the end of the line
if last_adult is not None:
    print("Party of 1 adult alone")
    print(" -", last_adult.name, "(adult)")

This is some pretty messy code, with lots of duplication. It is only a little better than doing our own iteration over the list instead of using groupby.

It would be better if we could define a key grouping function so that groupby gives us each group as our rules define them.

Such a grouping function will need to keep some state of the current group, and will need to detect when a new group has started. We can do that by moving the “last adult” holding variable into the grouping function, and have the grouping function return that variable’s id(). Now the grouping function doesn’t necessarily return a value that comes from each entry in the list, but it does return a value that represents the group.

We can attach attributes to Python functions just like any other object. Doing so avoids creating a separate global variable, and better indicates the association of that value with the function.

Here is a grouping function that will keep track of the adult for each group of one adult plus zero or more children:

def in_group(customer):
    # if the current customer is an adult, then
    # they are the start of a new group
    if is_adult(customer):
        in_group.first_in_group = customer

    return id(in_group.first_in_group)

# define attribute for tracking the adult for each group
in_group.first_in_group = None

Remember, groupby calls this grouping function for every element in the list, or in this example, every customer waiting in line. If that customer is an adult, then we know they are the start of a new group. In that case, that customer becomes the new first customer of the next group. If the customer is a child, then we simply return the value from the last-seen adult, so that the adult and all the children return the same grouping value.

This value returned by the grouping function is no longer all that interesting. The important parts are that it describes the current group, that all the members of this group return the same value, and that the value will change on the start of the next group. If we could be sure that all people in the line had unique names, we could return in_group.first_in_group.name. But since we can’t be sure of that, it is simple to just return the id() of the first_in_group customer object.

Since we now only care about the contents of the groups, and not the grouping value itself, we can just use the throwaway variable _ for that value:

for _, customers in groupby(customers_in_line, key=in_group):

    # extract customers in the group into the leading adult
    # and the following list of children, if any
    adult, *children = customers

    # announce the group
    if children:
        print("Party of one adult and {} {}".format(
                len(children),
                "child" if len(children) == 1 else "children"))
    else:
        print("Party of 1 adult alone")

    # print out members of the group
    print(" -", adult.name, "(adult)")
    for child in children:
        print(" -", child.name, "(child)")

At the expense of a slightly more complicated grouping function, our groupby loop has cleaned up considerably. We get the same output as before, but now the code that processes each group is straightforward, with no code duplication. And no need for special handling after the loop for straggling adults.

A slight variation on the last example could be made if we envision the restaurant having an Eat With People Close To Your Own Age Night. On this night, the waiting customers are grouped if their ages are all within 5 years of each other. In these groups, there is no distinction of adult or child, so this code is extremely simple. The grouping function is very similar, with a minor change to handle the initial case where the first_in_group instance is None.

def in_group(customer):
    # detect if we are the first group (first_in_group is None) 
    # or if we are starting a new group - in either case, update
    # the first_in_group with the current customer
    if (in_group.first_in_group is None 
            or abs(customer.age - in_group.first_in_group.age) > 5):
        in_group.first_in_group = customer
    return id(in_group.first_in_group)

in_group.first_in_group = None

[EDIT: it turns out that this does not group customers within 5 years of age of each other, but customers within 5 years of age of the first customer in the group, either older or younger. So this would permit groups of customers up to 10 years of age apart. Correct code would also add in_group.min_age and in_group.max_age, update them as new in-group customers are added, and determine if a new group is found if customer.age < in_group.max_age - 5 or customer.age > in_group.min_age + 5. After adding 2 more attributes to in_group and this more complex group detection logic, one might reasonably argue that this should really be moved to a separate class. The code has been updated in the summarized code at the end of this article.]

This is the simplest groupby loop of all, since there is no need to distinguish types of customers – we just list them all and their ages.

for _, customers in groupby(customers_in_line, key=in_group):

    customers = list(customers)

    # announce the party
    print("Party of {} customer{}".format(
                len(customers),
                "" if len(customers) == 1 else "s"))

    # list the customers in the party
    for customer in customers:
        print(" -", customer.name, customer.age)

I have used this same kind of fuzzy groupby in several cases. In my last job, I often had to process log files where some log entries were multiple lines long, or had an associated traceback inserted in the log. In those cases, I would detect a log line that started with a valid timestamp, and that would be the first_in_group line. Log entries on a single line would be a group of just one line; entries on multiple lines would result in a group with the first entry being the timestamped log line, and the remaining entries being the continuation lines. This was very similar to the Kids Eat Free Night example.

In another case, I needed to group log entries by proximity in time. These were actually network packet captures, and I needed to group packets that occurred within 5 milliseconds, so that bursts of packets could be identified. For that case, I used a method similar to the Eat With People Close To Your Own Age Night grouping function.

Here is the full code described in this article:

from collections import namedtuple
from itertools import groupby

# create a line of customers
Customer = namedtuple("Customer", "name age")
customers_in_line = [
    Customer("Alice", 22),    # an adult with group of 2 children
    Customer("Ken", 7),
    Customer("Olivia", 9),
    Customer("Jim", 37),      # an adult with group of 3 children
    Customer("Bob", 5),
    Customer("Chip", 8),
    Customer("Nancy", 2),
    Customer("Dan", 26),      # an adult dining alone
    Customer("Candace", 24),  # an adult with group of 1 child
    Customer("Courtney", 11),
    Customer("Arthur", 26),   # an adult dining alone
    ]

#
# Kids Eat Free Night example
#
print("\nKids Eat Free Night example")
def is_adult(customer):
    return customer.age >= 12

last_adult = None
for group_are_adults, customers in groupby(customers_in_line, key=is_adult):
    # convert generator to list so we can easily detect the last
    customers = list(customers)

    if group_are_adults:
        # every customer but the last is an adult dining alone
        for customer in customers[:-1]:
            print("Party of 1 adult alone")
            print(" -", customer.name, "(adult)")

        # save off the last adult to be grouped with the next group
        # (of children)
        last_adult = customers[-1]

    else:
        # group are all children, so associate them with the last adult
        if last_adult is not None:
            print("Party of one adult and {} {}".format(
                    len(customers),
                    "child" if len(customers) == 1 else "children"))
            print(" -", last_adult.name, "(adult)")
            for customer in customers:
                print(" -", customer.name, "(child)")
            last_adult = None
        else:
            # this should never happen, but...
            print("what are these children doing here without an adult?!")

# must see if any adult was at the end of the line
if last_adult is not None:
    print("Party of 1 adult alone")
    print(" -", last_adult.name, "(adult)")



print("\nKids Eat Free Night example #2")

def in_group(customer):
    # if the current customer is an adult, then
    # they are the start of a new group
    if is_adult(customer):
        in_group.first_in_group = customer
    return id(in_group.first_in_group)

# define attribute for tracking the adult for each group
in_group.first_in_group = None

for _, customers in groupby(customers_in_line, key=in_group):

    # extract customers in the group into the leading adult
    # and the following list of children, if any
    adult, *children = customers

    # announce the group
    if children:
        print("Party of one adult and {} {}".format(
                len(children),
                "child" if len(children) == 1 else "children"))
    else:
        print("Party of 1 adult alone")

    # print out members of the group
    print(" -", adult.name, "(adult)")
    for child in children:
        print(" -", child.name, "(child)")


print("\nEat With People Close to Your Own Age Night example")

def in_group(customer):
    # detect if we are the first group (first_in_group is None) 
    # or if we are starting a new group - if so, update the
    # first_in_group with the current customer
    if (in_group.first_in_group is None 
            or customer.age < in_group.max_age - 5
            or customer.age > in_group.min_age + 5
        ):
        in_group.first_in_group = customer
        in_group.min_age = in_group.max_age = customer.age
    else:
        in_group.min_age = min(in_group.min_age, customer.age)
        in_group.max_age = max(in_group.max_age, customer.age)

    return id(in_group.first_in_group)

in_group.first_in_group = None
in_group.min_age = 0
in_group.max_age = 0

for _, customers in groupby(customers_in_line, key=in_group):
    customers = list(customers)

    # announce the party
    print("Party of {} customer{}".format(
                len(customers),
                "" if len(customers) == 1 else "s"))

    # list the customers in the party
    for customer in customers:
        print(" -", customer.name, customer.age)

data analysis, groupby, itertools, python

Leave a comment

Python groupby and an Unusual Restaurant

Posted by ptmcg in Uncategorized on October 30, 2020

The itertools module in the Python stdlib has a number of interesting and useful functions. One that I find especially powerful when reporting or analyzing data is groupby. Given a sequence of objects and a grouping function, groupby will turn them into a sequence of grouping keys and associated objects. It is great for summarizing or tallying data that is stored as a sequence of objects, which could be dicts, tuples,strings, or really most any type of data item in the Python world.

To picture how groupby works, imagine a restaurant. This restaurant is extremely popular, and people wait outside to get in. The restaurant is unusual, in that it has only a single table, but that table will seat an unlimited number of people. The only constraint is that all the people have to be in the same party.

groupby is the greeter outside the restaurant, who goes down the line to identify who will be the next group to go in. The greeter asks each person what party they are in. When the greeter finds someone not in the current party, this indicates the end of the current group, and the start of the next.

For instance, on Family Night, imagine that everyone who is in a group will all have the same last name. So the groupby greeter asks each person their name, and if the lead person says “Bob Jones”, then the greeter continues down the line until finding someone whose last name is not Jones, say “Rogers”. Those Joneses all form the “Jones” group. If one of the Jones family is late, say “Bill Jones”, and is mixed in with the Rogerses, that is too bad, because the strict rules of the restaurant are that they will only be seated with their group if they are in line together with their group.

Also sad for the Rogerses who are in line after Bill Jones, because since they are no longer in line together with their other Rogers relatives, they too will not be seated with them. Bill Jones will be seated alone as a group of one, followed by another group of the remaining Rogerses.

So the greeter might report on the line like:
“Jones party, the next 4 customers.”
After they are admitted, the greeter continues down the line.
“Rogers party, the next 3 customers.”
Followed by,
“Jones party, 1 person.” (that tardy Bill Jones!)
And then,
“Rogers party, the next 2 customers”.

Note that the greeter does not announce the next party until the previous one has been seated.

Another night might be High School Night, when customers who are in high school are all seated together. Again, the greeter goes down the line asking each customer what grade they are in, and if they are in high school, they are seated in a group together. If not in high school, either younger or older, they are seated in a separate group, which goes until another high schooler is found. The actual grade they are in is not important, just whether it is one that goes to high school (typically grades 9 through 12 in the US).

And groupby in your Python program works much the same way as the groupby greeter. To use groupby, you need some sequence of items (typically a list, or perhaps a database cursor; may also be a Python generator), and a function that will distinguish the key value that determines the current group.

Here is how Family Night might look in Python. First we need a line of customers (we are using namedtuple from the stdlib collections module to define a simple Customer class type):

Customer = namedtuple("Customer", "last_name first_name grade_level")

line_of_customers = [
    Customer("Jones", "Alice", None),
    Customer("Jones", "Bob", None),
    Customer("Jones", "Charles", 10),
    Customer("Jones", "Dave", 9),
    Customer("Rogers", "Sandy", 11),
    Customer("Rogers", "George", 12),
    Customer("Rogers", "Amelia", None),
    Customer("Jones", "Bill", None),
    Customer("Rogers", "Sam", 3),
    Customer("Rogers", "Sylvia", 5),
    Customer("Burton", "Fred", 11),
    Customer("Burton", "Claire", 9),
    Customer("Adams", "Floyd", 10),
    Customer("Adams", "Elaine", None),
    Customer("Adams", "Susan", None),
]

Now to use groupby, we create a grouping function, and then call groupby using the list of customers and that function.

def customers_by_last_name(customer) -> str:
    return customer.last_name

for family_name, family_customers in groupby(line_of_customers,
                                             key=customers_by_last_name):
    family_customers = list(family_customers)
    print(f"{family_name} party, the next {len(family_customers)} people.")

For each group, groupby gives us the key value that all the elements in the group have in common, followed by a generator that yields the individual list elements in turn.

(Note that we had to instantiate the grouped family_customers as a list in order to use the len builtin, since family_customers is given by groupby as a Python generator. Python generators do not support len, so for small sequences you can just instantiate them into a list, and then call len() on that list. In place of len(list(family_customers)) we could have written sum(1 for _ in family_customers), which walks the generator to count up the sum of people without creating the intermediate list.)

(A second note: that function customers_by_last_name is rather wordy. The Python stdlib includes another handy module, the operator module, which includes functions that wrap many common simple operations like getting attributes, getting dict items, performing mathematical functions, etc. The one that would help here is operator.attrgetter, which takes one or more string arguments and returns a function that gets those attributes from an object passed to it, returning those attribute values in a tuple. In this case, we could have replaced the customers_by_last_name with just operator.attrgetter('last_name').)

For High School Night, the code looks very similar, differing only in the grouping method, and code to change the boolean high school status is_in_high_school into a printable string:

def in_high_school(customer) -> bool:
    return customer.grade is not None and 9 <= customer.grade <= 12

for is_in_high_school, group_customers in groupby(line_of_customers,
                                                  key=in_high_school):
    group_customers = list(group_customers)
    party_name = "High School" if is_in_high_school else "Not High School" 
    print(f"{party_name} party, the next {len(group_customers)} people.")

When you are summarizing data and want to avoid the problem of fragmented groups (such as those caused by the tardy Bill Jones breaking up both the Jones and the Rogers parties), it is customary if possible, to sort the incoming sequence of records. If the sequence is a generator, like the records that might be returned by a database query cursor, the sorting would most likely be done as part of an SQL "ORDER BY" clause that would be used to create the database cursor. But for records in a Python list, one would simply sort the list.

If sorting the list, then you would most likely use the same function you would pass to groupby. For Family Night, this would be:

line_of_customers.sort(key=customers_by_last_name)

for family_name, family_customers in groupby(line_of_customers,
                                             key=customers_by_last_name):
    ... etc. ...

You can imagine our restaurant having another greeter going through the line ahead of the groupby greeter, calling out, “Jones party, anyone else in the Jones party please move forward.” This would then pull Bill Jones up with his relatives, and the Joneses and the Rogerses would each all dine together. (In fact, this sorter-greeter would find the Adams and Burton families further down the line, and end up seating them ahead of the Jones, which the Joneses probably wouldn’t like – but at least they would get to eat together!)

Here is all the code from this post:

from collections import namedtuple
import itertools
import operator


Customer = namedtuple("Customer", "last_name first_name grade_level")

line_of_customers = [
    Customer("Jones", "Alice", None),
    Customer("Jones", "Bob", None),
    Customer("Jones", "Charles", 10),
    Customer("Jones", "Dave", 9),
    Customer("Rogers", "Sandy", 11),
    Customer("Rogers", "George", 12),
    Customer("Rogers", "Amelia", None),
    Customer("Jones", "Bill", None),
    Customer("Rogers", "Sam", 3),
    Customer("Rogers", "Sylvia", 5),
    Customer("Burton", "Fred", 11),
    Customer("Burton", "Claire", 9),
    Customer("Adams", "Floyd", 10),
    Customer("Adams", "Elaine", None),
    Customer("Adams", "Susan", None),
]


print("\nFamily Night seating")

def customer_by_last_name(customer) -> str:
    return customer.last_name

# or just
# customer_by_last_name = operator.attrgetter("last_name")

# uncomment out this line to see the grouping with all the groups 
# collected together
#line_of_customers.sort(key=customer_by_last_name)

for group_name, group_customers in itertools.groupby(line_of_customers, 
                                                      key=customer_by_last_name):
    group_customers = list(group_customers)
    group_size = len(group_customers)
    if group_size > 1:
        print(f"{group_name} party, the next {group_size} customers.")
    else:
        print(f"{group_name} party, 1 person.")
    for customer in group_customers:
        print(f" - {customer.first_name} {customer.last_name}")


print("\nHigh School Night seating")

def high_school_status(customer) -> bool:
    return customer.grade_level is not None and 9 <= customer.grade_level <= 12

for is_in_high_school, group_customers in itertools.groupby(line_of_customers,
                                                            key=high_school_status):
    group_name = "High School" if is_in_high_school else "Not High School"
    group_customers = list(group_customers)
    group_size = len(group_customers)
    print(f"{group_name} party, the next {group_size} customers.")
    for customer in group_customers:
        print(f" - {customer.first_name} {customer.last_name} {customer.grade_level}")

groupby, itertools, python

1 Comment

Adventures in Unicode: Parsing superscript exponents

Posted by ptmcg in Uncategorized on December 22, 2018

I’ve always wanted to add support for parsing exponentiation using superscript characters, instead of clunky infix operators like ‘^’ or ‘**’. If I want 2 raised to the tenth power, 2**10 just doesn’t look as good as 2¹⁰. Parsing and evaluating these exponents is simple enough with pyparsing, but first the digits need to be convertible to regular int values.

The regular digits 0 through 9 work nicely enough. Since they are contiguous and ordered in the ASCII sequence, and Python has support for them, we can write int(‘9’) and get the value 9. Here we look at all 10 numeric characters:

digits = "0123456789"
for i in sorted(digits):
    print(repr(i), ord(i), end=' ')
    try:
        print(int(i))
    except ValueError:
        print('-error-')

And we get a friendly predictable listing of characters, ordinal values, and converted int values:

'0' 48 0
'1' 49 1
'2' 50 2
'3' 51 3
'4' 52 4
'5' 53 5
'6' 54 6
'7' 55 7
'8' 56 8
'9' 57 9

But the superscript digits are more difficult. Using the same for loop, but with this changed line:

digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"

Now we get these results:

'²' 178 -error-
'³' 179 -error-
'¹' 185 -error-
'⁰' 8304 -error-
'⁴' 8308 -error-
'⁵' 8309 -error-
'⁶' 8310 -error-
'⁷' 8311 -error-
'⁸' 8312 -error-
'⁹' 8313 -error-

What the heck?! They aren’t even in sorted order! Only the 1, 2, and 3 superscripts are even in the 8-bit 0-255 range, and even they aren’t in order.

So looking ahead to when we eventually parse and gather these exponent strings, we need to roll our own int() method to convert to usable Python ints.

I suggest we build a dict for mapping each superscript to its int value:

superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
super_int_map = {digit: value for value, digit in enumerate(superscript_digits)}

Enumerating over the superscript digits in order gives us (int, digit) tuples, which make it easy to convert to a dict using a dict comprehension.

Now we just need a function that will convert a string made up of one or more superscript characters, and give back an int value, for example converting ‘⁷⁸⁹‘ to 789. This code is a typical exercise in beginning programming classes, so should not be much of a surprise.

def super_to_int(superscript_str):
    ret = 0

    # iterate over each digit in the input, get its value (from 0-9)
    # and add it to the current running total, after multiplying
    # the running total by 10
    for chr in superscript_str:
        ret = ret * 10 + super_int_map[chr]

    return ret

With that in our back pocket, we can start writing a parser, knowing that when it comes time to convert superscript exponents to ints, our conversion function is all ready to go.

Pyparsing comes ready-made with a number of helpful classes and objects for building up parsers a piece at a time. So we can begin by writing the expression that will parse an integer subscript using a pyparsing Word class, which indicates that we want to parse a word group composed of one or more characters, in this case the characters are the superscript digits.

exponent = pp.Word(superscript_digits).setName("integer exponent")

The reason for naming this expression becomes clear when we run some simple tests on it, using the runTests() method that is provided on all pyparsing parse expressions:

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

Giving

¹²³
['¹²³']

¹⁰
['¹⁰']

10
^
FAIL: Expected integer exponent (at char 0), (line:1, col:1)

Without the name, we would have gotten a more cryptic-looking message like “Expected W:(⁰¹²³…) (at char 0), (line:1, col:1)”.

We actually will want to parse and evaluate strings like “2¹⁰“, so we also need an expression for a regular integer. Integer expressions are so common, pyparsing provides a standard expression in its pyparsing_common namespace class, so we can just use that.

integer = pp.pyparsing_common.integer
integer.runTests("""\
   2
   42
   4294967296
""")

Giving

2
[2]

42
[42]

4294967296
[4294967296]

There is a subtle difference from the previous tests – these values have already been converted to ints! There are no quotation marks around the parsed values, as you see above when we tested out our exponent expression. So when we want to perform the final exponentiation operation, the base number will already be in int form.

pyparsing does this using a parse-time callback, called a parse action. Each expression in the parser can have one or more parse actions attached to it, for the purposes of conversion, validation, data structuring, or whatever. We will use this same feature to convert our superscript exponents to ints.

Pyparsing is pretty flexible in letting you define the arguments passed to parse actions. The most common is to just pass the list of parsed tokens. For this particular expression the list is always just 1 element long, since the expression we are acting on parses just a single word of superscript digits.

def convert_parsed_exponent(tokens):
    return super_to_decimal(tokens[0])

exponent.addParseAction(convert_parsed_exponent)

Rerunning our previous tests now gives:

¹²³
[123]

¹⁰
[10]

10
^
FAIL: Expected integer exponent (at char 0), (line:1, col:1)

Now the first two parse correctly, but we are still stumbling over plain old “10”.

To handle our expressions like “2¹⁰“, we define another expression, this time combining the expressions we already have. We’ll allow for our parser to handle integers with or without exponents:

raised_number = integer + pp.Optional(exponent, default=1)

We simply use the ‘+’ operator to show that these two expressions should occur one after the next. We wrap our exponent expression using the pyparsing Optional class, so that pyparsing won’t complain when parsing values that have no exponent. In the event that there is no exponent, we would still like a default value to be given. In this case, a logical default exponent if none is explicitly given is 1, since any number raised to 1 is just that same number.

Testing out our combined expression, shows that we are getting pretty close:

raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Gives

2¹⁰
[2, 10]

10³
[10, 3]

2³²
[2, 32]

10⁰
[10, 0]

10
[10, 1]

The last step will be to do the computation of the actual exponentiation operation. But now that we are used to using parse actions, a second one added to the raised_number expression does the job.

def raise_to_power(t):
    return t[0]**t[1]
raised_number.addParseAction(raise_to_power)

raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Gives the desired results:

2¹⁰
[1024]

10³
[1000]

2³²
[4294967296]

10⁰
[1]

10
[10]

Note that raise_to_power() has no code in it to convert the tokens to ints. This is because the two expressions that make up a raised_number have each already converted their strings to ints, so the resulting parsed tokens are no longer just strings, but ints. Similarly, the code that performs conversion from superscript string to int has no exception handling in case of bad input. Why? Because the input has already been screened in the parser so that only valid superscript strings are sent to the parse action.

Here are some ideas for other expressions that this parser does not currently handle:

    -1³
    -1²
    (4-5)³
    (1/2)³
    1/2¹⁰
    2⁻¹⁰
    6.02×10²³

Here is the full listing of code from this article:

# -*- coding: utf-8 -*-

import pyparsing as pp

def show_numeric_characters(digits):
    for i in sorted(digits):
        print(repr(i), ord(i), end=' ')
        try:
            print(int(i))
        except ValueError:
            print('-error-')

digits = "0123456789"
superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
show_numeric_characters(digits)
show_numeric_characters(superscript_digits)

# a pyparsing expression to parse an exponent of superscript digits
exponent = pp.Word(superscript_digits).setName("integer exponent")

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

# pyparsing-provided expression to parse plain old integers
integer = pp.pyparsing_common.integer
# alternate form, which could handle a leading '-' sign
# integer = pp.Regex(r'-?\d+').addParseAction(lambda t: int(t[0]))
integer.runTests("""\
    2
    42
    4294967296
""")

# function to convert superscript string to int
superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
super_int = dict((digit, value) for value, digit in enumerate(superscript_digits))
def super_to_decimal(superscript_str):
    ret = 0
    for chr in superscript_str:
        ret = ret * 10 + super_int[chr]
    return ret

# parse action to convert a string of superscript digits to an int
def convert_parsed_exponent(tokens):
    return super_to_decimal(tokens[0])
exponent.addParseAction(convert_parsed_exponent)

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

# define an expression to parse an integer optionally raised
# to an exponent power
raised_number = integer + pp.Optional(exponent, default=1)

# add parse action to perform exponentiation
def raise_to_power(t):
    return t[0]**t[1]
raised_number.addParseAction(raise_to_power)

# take it for a spin!
raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Pyparsing is available through PyPI using pip install. The home for pyparsing has recently moved to GitHub: https://github.com/pyparsing/pyparsing

UPDATE: The unicodedata module in Python’s standard library includes the digit() method, which will do the same job as our super_int_map, converting superscripts (and subscripts too!) to their corresponding int values. We still need to do the conversion in sup_to_int that handles multiple digit superscripts, though.

Pyparsing, python

1 Comment

Gibberish Haiku Generator

Posted by ptmcg in Uncategorized on December 27, 2012

On a lark, I wrote this haiku generator one afternoon at work, with the Python becoming more and more obfuscated as I went:

from random import choice as CH
from itertools import product as P
from functools import partial as PA
def haiku(syls=dict((i,s.split()) for i,s in enumerate(
    "fair blue green sun earth sky air lake bird fish wheat tree sand life death grim "
    "man boy girl runs plays hunts far old young deep high I my his her sings sad old "
    "red black white night day sleeps broods joy eats drinks swims leaps dreams worm "
    "eye hand arm leg foot mouth blood tears years moon stars days wild dog cat time "
    "friend son takes lives dies loves hates laughs cries shouts flies flames burnt "
    "tall short grass low mourns moans chants wind blows grows rain snow light dark "
    "prays prayer soul tone hue shade fades rich leaf leaves arch oak rose song just "
    "child void|"
    "golden yellow meadow open mountain nature garden ocean tiger vision only "
    "flower zebra lion woman baby wonder joyful dances dancing laughing single "
    "morning evening sleepy drowsy awake asleep away hunger pleasure simple deadly "
    "today tonight progress prefers dances mother father daughter droplet ocean "
    "laughter sorrow running jumping rushing turning whispers cascade arrow simple "
    "longing elder willow honey water heaven above below after before river "
    "quiet embers wishes under over respite relief journey afar glistens abyss "
    "shimmers ripples shudders shivers trembles slumbers hiding gently slowly "
    "slender brooding somber ochre velvet ruby entwined spirit blossom ribbon iris "
    "willow meaning resolve omen|"
    "waterfall aurora indigo butterfly lioness whimsical yesterday enemy "
    "contemplates considers grandfather grandmother family glistening emerges "
    "fortunate absolute horizon precipice oasis shivering shimmering tenderly "
    "sleepily carefully wistfully beautiful entwining sepia unknowing "
    "feverish|"
    "caterpillar captivating fascinating philosophy iridescent "
    "intermingled understanding proximity unknowable unreachable|"
    "inescapable undeniable inevitable unfathomable".split('|'),1)),
    combs=map(PA(filter,None),P(*[range(6)]*7))):
    return '\n'.join(' '.join(CH(syls[x])
                    for x in CH(filter(lambda c:sum(c)==n, combs)))
                        for n in (5,7,5))

for i in range(10):
    print (haiku()+'\n')

Some samples:

feverish moans runs
moans unfathomable time
garden baby earth

absolute moon cries
wheat moon evening blossom song
earth takes shudders plays

life grim precipice
blue understanding abyss
fades grim aurora

broods willow grim dies
fair respite iridescent
foot shade I hiding

my stars family
iridescent heaven sand
hunts grass cascade runs

dreams turning red snow
waterfall shade shimmering
captivating stars

sand grim tonight night
moon nature shimmering prays
my flower shade blows

old white glistening
meadow only shade slender
wild his omen cat

willow hue whispers
dies grandfather son spirit
garden dark above

tears garden relief
laughs shimmering tenderly
song time mountain dies

Are these really haiku? Not in the conventional sense, and certainly not ones that would hold up to critical review. But if you select a seed dictionary of words that collectively suggest a particular mood, the generated verse can evoke that same mood.

Leave a comment

Snippet: Class that never raises AttributeError

Posted by ptmcg in Snippets on March 19, 2011

Sometimes we’d like to have a class that is tolerant of attribute accesses that weren’t anticipated at class design time, returning a default ‘None’ value for an uninitialized attribute. Using defaultdict to override the `__dict__` attribute makes this very easy:

from collections import defaultdict

class X(object):
    def __init__(self):
        self.__dict__ = defaultdict(lambda : None)
    def __getattr__(self, attr):
        return self.__dict__[attr]

Now we can initialize an object with some attributes, but access others and get back the default None:

x = X()
x.color = "RED"
x.size = 6

print x.color
print x.size
print x.material

prints:

RED
6
None

Of course, this defeats a key validation feature in Python, so you’ll need to take extra care that you specify attributes correctly.

Leave a comment

Snippet: sgn function

Posted by ptmcg in Uncategorized on March 12, 2011

From time to time, I find that I need a function to return +1, 0, or -1 representing the sign of a value: +1 or -1 for positive or negative values, or 0 for a zero value. I remember learning this as the sgn function in high school. Python’s standard lib leaves this out, but does supply cmp as a built-in, so the standard approach would probably be to define sgn using:

sgn = lambda x : cmp(x,0)

I found a nice website of Bit-Twiddling Hacks the other day, and it had a nice alternative for sgn, which in Python would be:

sgn = lambda x : (x>0) - (x<0)

The elegance of this appeals to me and I did a quick timing pass:

C:\Users\Paul>python -m timeit "[cmp(x,0) for x in (-100,0,14)]" 1000000 loops, best of 3: 0.645 usec per loop

C:\Users\Paul>python -m timeit "[lambda x:x>0 - x<0 for x in (-100,0,14)]" 1000000 loops, best of 3: 0.496 usec per loop
So by cutting out the repeated function calls to cmp, this also has the benefit of being just a tad faster.

1 Comment

Snippet: Uniquify a sequence, preserving order

Posted by ptmcg in Snippets on March 9, 2011

I came across some code today that used a set to keep track of previously seen values while iterating over a sequence, keeping just the not-seen-before items. A brute force kind of thing:

seen = set()
unique = []
for item in sequence:
    if not item in seen:
        unique.append(item)
        seen.add(item)

I remembered that I had come up with a simple form of this a while ago, using a list comprehension to do this in a single expression. I dug up the code, and wrapped it up in a nice little method. This version accepts any sequence or generator, and makes an effort to return a value of the same type as the input sequence:

def unique(seq):
    """Function to keep only the unique values supplied in a given 
       sequence, preserving original order."""
       
    # determine what type of return sequence to construct
    if isinstance(seq, (list,tuple)):
        returnType = type(seq)
    elif isinstance(seq, basestring):
        returnType = type(seq)('').join
    else:
        # - generators and their ilk should just return a list
        returnType = list

    try:
        seen = set()
        return returnType(item for item in seq if not (item in seen or seen.add(item)))
    except TypeError:
        # sequence items are not of a hashable type, can't use a set for uniqueness
        seen = []
        return returnType(item for item in seq if not (item in seen or seen.append(item)))

My first pass at this tried to compare the benefit of using a list vs. a set for seen – it turns out, both versions are useful, in case the items in the incoming sequence aren’t hashable, in which case the only option for seen is a list.

1 Comment

Things Python

Happy Repdigit Day!

How to try something ‘N’ times in Python

Pyparsing Walkthrough – Letter Ranges

Parsing Options in Python

Approaching a problem with Pyparsing

Suppress delimiters

Expand the ranges

One more thing…

The Complete Parser

Potential extensions

For More Information

Fuzzy Groupby – Unusual Restaurant Part II

Python groupby and an Unusual Restaurant

Adventures in Unicode: Parsing superscript exponents

Gibberish Haiku Generator

Snippet: Class that never raises AttributeError

Snippet: sgn function

Snippet: Uniquify a sequence, preserving order

Recent Posts

Archives

Categories

Meta