telemachus.me

Python, Meet Descartes. Descartes, Meet Python.

2016-06-08

This summer I’m working on a commentary on Descartes’s Meditationes de prima philosophia, usually known in English as his Meditations on First Philosophy. (Though actually a better translation is the less literal Metaphysical Meditations, which is how it’s usually translated into French.) In addition to providing a text and commentary, I plan to produce a vocabulary. This is a time-consuming and error-prone job, so naturally I want to offload at least some of the grunt work to a computer. (See laziness as virtue for programmers.)

As a start, I took at look at The Classical Language Toolkit. CLTK provides a set of natural language processing utilities for ancient texts. Although the Latin tools in CLTK are aimed primarily at classical material, the neo-Latin that Descartes wrote is largely classical in style and vocabulary. The following short Python script tokenizes and lemmatizes Descartes’s first meditation. This gets us well on our way since the script reduces its input text to a list of unique words reduced to their dictionary entry. That is, suppose that Descartes wrote scīre and sciēns (an infinitive meaning to know and a participle meaning knowing), the script would output only sciō, the shared dictionary entry of those two forms.

from cltk.stem.lemma import LemmaReplacer
from cltk.stem.latin.j_v import JVReplacer
from cltk.tokenize.word import WordTokenizer

meditatio_prima = open("./meditatio-prima.txt", "r").read()

jv = JVReplacer()
meditatio_prima = jv.replace(meditatio_prima)

t = WordTokenizer("latin")
l = LemmaReplacer("latin")

words = l.lemmatize(t.tokenize(meditatio_prima))
words = sorted(list(set(words)))
print("\n".join(words))

It would be even better if we could take the dictionary entries from this first script and feed them to a program that would give us definitions. So that’s what I worked on next. The Perseus Digital Library has opensourced the XML version of A Latin Dictionary edited by Lewis and Short. I’d prefer the modern Oxford Latin Dictionary, but Lewis and Short is still an outstanding Latin dictionary. Once again, a relatively short Python script will parse the XML and match the words from Descartes’s text with their Lewis and Short dictionary entries.

from sys import stderr
from lxml import etree as ET

# Helper functions
def warn(*args, **kwargs):
    '''Print message to stderr'''
    print(*args, file=stderr, **kwargs)

def inner_text(node):
    '''Return all inner text of an XML node'''
    return ''.join([text for text in node.itertext()])

xml = ET.parse('no-entities-ls.xml')
entries = xml.xpath('//entryFree')
lewis_short = {}
for item in entries:
    lewis_short[item.attrib['key']] = inner_text(item)

# Load vocabulary words that we're searching for
words_file = open('meditatio-words.txt', 'r')
wanted_words = words_file.read().splitlines()
words_file.close()

# Work through the words we want,
# trying to match their meanings
for wanted in wanted_words:
    if wanted in lewis_short:
        print('%s' % (lewis_short[wanted]))
    else:
        warn('%s has no entry.' % (wanted))

This is worth walking through. After defining two small helper functions, the next block of code parses the Lewis and Short XML file and stores the entries in a Python dictionary. (Since dictionary is now very ambiguous, I’m going to call the data structure a hash from here on in.) The keys of the hash are lookup words, and the values are full entries from Lewis and Short. Since lookup words are just what our first script gave us, we load those into a list and iterate over them. For each word in Descartes, we test whether it’s in the hash. If it is, we print out the entry from Lewis and Short. If it’s not in the hash, we tell the user that no matching entry was found. (The “no item found” messages are printed to stderr instead of stdout, so that if the user wants they can print the two output streams to different places. E.g., python3 vocabulary-builder.py 1>descartes-vocabulary.txt 2>missing-words.txt)

This is a terrific amount of progress for two short Python scripts. In addition, I extracted the textual data from the XML file and saved that in plain text format. This allows for lots of other potential uses of Lewis and Short without having to deal with the XML. (The plain text is available on GitHub, under a CC license that allows for further changes.)

What’s next?

There’s a lot of room for improvement. Here’s a quick list of things I’m working on or would like to see.

As usual, trying to make a computer do your work for you is fun, helpful, and exhausting. I got to learn some Python, and the tokenizer and lemmatizer results alone will save me hours compared to doing it by hand. At the same time, I’ve only scratched the surface and there’s far more work to do. But I’m excited to continue learning Python and to improve the glossary-maker scripts that I’ve started here.