Review of Requests 1.0

Author's note: This piece was originally published in the excellent literary journal DIAGRAM, Issue 12.6. I'm re-publishing here for formatting reasons.

Identification with another is addictive: some of my life's most profound, memorable experiences have come when something bridged the gap between me and another human. Because I'm a reader, this can occur across the distance of space and time. It's happened with minor Chekov characters, and at the end of Kate Mansfield stories. It happens again and again with Norman Rush and George Saunders. The author has pushed a character through the page and connected with me on a deep level: identification.

Identification happens with computer programming, too.

I say this as a reader, writer, and programmer: I experience identification when reading and programming, and I strive to create it when writing and programming.

Though they deal with the messiness of reality differently, several techniques common to both disciplines enable them to achieve this mental intimacy: navigating complexity; avoiding pitfalls that inhibit communication; choosing structure wisely; harnessing expressive power; and inhabiting other minds. The Requests library, a work of computer programming by Kenneth Reitz, illustrates this.

Read More

A Case Study of Node.js in Production

I'm giving a talk about my experience developing and deploying a Node.js web service in production at the next Nova-Node meetup, October 30 at 6:30 p.m. Below is the writeup. If it sounds interesting to you, come by!

SpanishDict recently deployed a new text-to-speech service powered by Node. This service can generate audio files on the fly for arbitrary Spanish and English texts with rapid response times. The presentation will walk through the design, development, testing, monitoring, and deployment process for the new application. We will cover topics like how to structure an Express app, testing and debugging, learning to think in streams and pipes, writing a Chef cookbook to deploy to AWS, and monitoring the application for high performance. The lead engineer on the project, William Bert, will also talk about his experiences transitioning from a Python background to Node and some of the key insights he had about writing in Node while developing the application.

Update: here are the slides from the talk.

(Relatively) quick and easy Gensim example code

Here's some sample code that shows the basic steps necessary to use gensim to create a corpus, train models (log entropy and latent semantic analysis), and perform semantic similarity comparisons and queries.

gensim has an excellent tutorial, and this does not replace reading and understanding it. Nonetheless, this may be helpful for those interested in doing some quick experimentation and getting their hands dirty fast. It takes you from training corpus to index and queries in about 100 lines of code, much of which is documentation.

Note that this code will not work out of the box. To train the models, you need to provide your own background corpus (a collection of documents, where a document can range from one sentence up to multiple pages of text). Choosing a good corpus is an art; generally, you want tens of thousands of documents that are representative of your problem domain. Like the gensim tutorial, this code also shows how to build a corpus from Wikipedia for experimentation, though note that doing so require a lot of computing time. You could potentially save hours by installing accelerated BLAS on your system.

Read More

An Introduction to gensim: "Topic Modelling for Humans"

On Tuesday, I presented at the monthly DC Python meetup. My talk was an introduction to gensim, a free Python framework for topic modelling and semantic similarity using LSA/LSI and other statistical techniques. I've been using gensim on and off for several months at work, and I really appreciate its performance, clean API design, documentation, and community. (All of this is due to its creator, Radim Rehurek, who I interviewed recently.)

The presentation slides are available here. I also wrote some quick gensim example code that walks through creating a corpus, generating and transforming models, and using models to do semantic similarity. The code and slides are both also available on my github account.

Finally, I also developed a demo app to visualize semantic similarity queries. It's a Flask web app, with gensim generating data on the backend that is clustered by scipy and scikit-learn and visualized by d3.js as agglomerative and hierarchical clusters as well as a simple table and dendrogram. To make it all work in realtime, I used threading and hookbox. I call it Visularity, and it's available on github. You need to provide your own model and dictionary data to use--check out my presentation and visit radimrehurek.com/gensim/ to learn how. Comments and feedback welcome!

Interview with Radim Rehurek, creator of gensim

Tomorrow at the May 2012 DC Python meetup, I'm giving a talk on gensim, a Python framework for topic modeling that I use at work and on my own for semantic similarity comparisons. (I'll post the slides and example code for the talk soon.) I've found gensim to be a useful and well-designed tool, and pretty much all credit for it goes to its creator, Radim Rehurek. Radim was kind enough to answer a few questions I sent him about gensim's history and goals, and about his background and interests.

WB: Why did you create gensim?

RR: Consulting gig for a digital library project (Czech Digital Mathematics Library, dml.cz), some 3 years ago. It started off as a few loosely connected Python scripts to support the "show similar articles" functionality. We wanted to use some of the statistical methods, like latent semantic analysis. Originally, gensim only contained wrappers around existing Fortran libraries for SVD, like Propack and Svdpack.

But there were issues with that, and it scaled badly (all documents in RAM), so I started looking for more scalable, online algorithms. Running these popular methods shouldn't be so hard, I thought!

In the end, I developed new algorithms for these methods for gensim. The theoretical part of this research later turned into a part of my PhD thesis.

Read More

ExtJS TreeStore trouble with nested nodes

At work, we're building an app to edit objects in a database--a classic CRUD application. For now, we're trying out ExtJS as the client-side UI framework. One of the use cases is selecting and editing nested objects, represented in our relational database with foreign keys. Let's call the root object a Task, which consists of nested Goals, which have Steps. Each of those is defined by a model on the backend that is more or less mimicked by an Ext.data.Model on the client-side, and each model has a proxy to a RESTful endpoint on the backend for create/retrieve/update/delete operations. We want to use an Ext.tree.TreePanel for the UI, so we hold the data in an Ext.data.TreeStore. So far so good.

We coded up our prototype, but when a user selects a Task, Ext JS throws this error: Uncaught TypeError: Cannot read property 'internalId' of undefined. Hmm. Everything seems to be working. Our models are loading the correct data. No obvious bugs. A lot of inspecting and googling and reading documentation later, I discover this thread. The key quote:

Read More

Fake bio for Steve

My good friend Steve has hosted the lowercase, the monthly reading series associated with 826DC, for three years. Steve has a charming habit of introducing his readers with made-up bios, so in his honor, I asked some lowercase regulars to write fake bios of him and share them at the third anniversary reading on April 4. The results were highly entertaining; thanks to everyone who wrote one!

Here's mine:

Steve Souryal is a group of 15 small islets and rocks in the central equatorial Atlantic Ocean. He lies in the Intertropical Convergence Zone, a region of severe storms. Steve exposes serpentinized abyssal mantle peridotite and kaersutite-bearing ultramafic mylonite on the top of the second-largest megamullion in the world (after the Parece Vela megamullion under Okinotoshima in the Pacific). He is the only location in the Atlantic Ocean where the abyssal mantle is exposed above sea level! In 1986, Steve was designated an environmentally protected area, and since 1998, the Danish Navy has maintained a permanently manned research facility in him. His main economic activity is tuna fishing, and we are incredibly lucky to have him with us tonight.

Apologies to Wikipedia. But somehow, it just feels right.

How to install accelerated BLAS into a Python virtualenv

Background

Some mathematically intense operations that use Numpy/Scipy can run faster with accelerated basic linear algebra subroutine (BLAS) libraries installed on your system (e.g., gensim's corpus processing).

To see what BLAS libraries you are using, do:

1
python -c 'import numpy; numpy.show_config()'
If none of them are installed, you probably want to install one or more. [ATLAS](http://math-atlas.sourceforge.net/) is always a good bet, since it's portable and self-optimizing. There are others out there targeted at particular CPU architectures.

Read More