Python

The Winning Font in Court Opinions

At CourtListener, we're developing a new system to convert scanned court documents to text. As part of our development we've analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we've attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold Italic Bold Italic Total
Times 1454 953 867 47 3321
Courier 369 333 209 131 1042
Arial 364 39 11 41 455
Symbol 212 0 0 0 212
Helvetica 24 161 2 2 189
Century Schoolbook 58 54 52 9 173
Garamond 44 42 41 0 127
Palatino Linotype 36 24 24 1 85
Old English 42 0 0 0 42
Lincoln 27 0 0 0 27

Using Pylint in Geany

Tagged:  

Pylint is a tool that tells you when your Python code is broken or when it has coding problems. As a newish Python coder, using it has taught me a lot about conventions, and has helped to make my code significantly cleaner. Enabling it in my IDE, Geany, makes it so that using it is just another part of my development workflow.

Enabling Pylint in Geany is easy. Simply open Geany, and create a new build command that uses pylint -r no "%f" as the command, and (W|E|F):([0-9]+):(.*) as the error regular expression. After you've done this, using this build command instead of saving your work will run Pylint on your current file, showing you warnings, errors and fatal errors in red.

Using Revision Control on a Django Project Without Revealing Your Passwords

Just a quick post today, since this took me way too long to figure out. If you have a django project that you want to share without sharing the private bits of settings.py, there is an easy way to do this.

I tried for a while to to set up mercurial hooks that would strip out my passwords before each commit, and then place them back after each commit, thus avoiding uploading them publicly. This does not work however because all of the mercurial hooks happen after snapshots of the modified files have been made. So you can edit the files using a hook, but your edits will only go into effect upon the next check in. Clearly, this will not do.

Another solution that I tried was the mercurial keyword extension. This could work, but ultimately it does not because you have to remember to run it before and after each commit — something I know I'd forget sooner or later.

The solution that does work is to split up your settings.py file into multiple pieces such that there is a private file and a public file. I followed the instructions here, with the resulting code looking being checked in here and here. There is also a file called "20-private.conf" which is not uploaded publicly, and which contains all the private bits of code that would normally be found in settings.py. Thus, all of my settings can be found my django, but I do not have to share my private ones.

A Python Function to Verify Twitter Credentials

Thought I'd post this for the future generations, since I had a hard time finding a template anywhere on the web when I needed one. It's nothing revolutionary, but a useful snippet nonetheless. This is for one of my projects this semester.

import pycurl
def verifyTwitterCredentials(username, password):
    c = pycurl.Curl()
    c.setopt(c.URL, 'http://twitter.com/account/verify_credentials.xml')
    c.setopt(c.USERPWD, username + ":" +  password)
    twitterfeed = c.perform()
 
    status = c.getinfo(c.HTTP_CODE)
 
    if str(status) == '200':
        verified = True
    else:
        verified = False
 
    c.close()
 
return verified

Working with matplotlib and pycairo

I spent a good part of my winter break working on learning Python and using it for projects. One project was the Yelp scraper that I posted about previously, and another was a report for my old work.

The report is a statistical analysis of the development of about 2,000 children aged three and four. For those interested, I'll try to post it here once the final version is ready to go. In the past when making the report, I had been frustrated because there was no easy way to script the creation of the 30 or so charts that need to be made. Excel had been our data analysis tool, and as such, we were stuck with either using VBA to create charts, or to do it by hand. Since nobody knew VBA, we always just buckled down and did the work by hand.

This time around, I discovered the matplotlib Python library, and used that to create the charts. It was an pretty rough experience all in all. While simple graphs can be created in about five lines of code, creating complicated ones took a good amount of work. For example, to change the tick markers on a graph requires that you create tick objects, and then manipulate them each individually in a for loop. Granted, I couldn't customize them at all in Excel, but figuring out that kind of change was a pain indeed.

The report itself required about 1,000 lines of code, and each chart required about 100-200 lines. For custom charts, I didn't find the library that useful, however towards the end of the report there are 30 charts, all of which are identical, except for the data. For these charts, I was able to make a for loop that created them all in about 20 minutes, whereas previously these took me a few hours to make by hand.

Another library I spent some time learning was the pycairo library, which allows pixel by pixel editing of pictures. I had planned to use it to do any editing to the charts that I was unable to accomplish with the matplotlib library, but in the end, it was unnecessary. I have another project coming up though that will use the pycairo library, so look for that soon.

Yelp Scraper to Get Business Info in a Geographic Area

I spent the past couple days on one of my first Python projects - using the Yelp API to compile a list of restaurants in a defined geographic area.

It's been a good project. Because of some limitations of the API, I had to do some interesting tricks to make it work. One problem with the API is that it only allows 20 hits per query, so if you want to do a big query, you have to divide it up into tiny queries that have fewer than 20 hits each.

To accomplish that, if a query gets 20 hits within those two points, it will divide the longer dimension of the rectangle created by the points in half, and perform a query on each of those two new rectangles. For each of those, if there are 20 hits, it will again divide it in two and perform two new queries, and so forth until less than 20 hits are found for the rectangle. Once less than 20 hits are found, the data is entered into a database. Once all the points have been added to the database, a comma separated file is created, and the program ends.

It was pretty incredible switching to Python for this project from my usual Java, and also using an official API for the first time. This project ended up being about 200 lines (half of which are comments). I can't imagine how long it would be with Java, since I used some rather powerful Python modules to accomplish this (namely, csv, urllib & json).

If anybody is interested in seeing/using the code, let me know. It should be useful if you need a list of restaurants or other businesses in a certain area. Worthy causes only please!

Dear God, This is a Terrible Interface

Tagged:  

The UI for a KDE Python IDE is about the worst I have ever seen:

That's about 90 buttons.

Syndicate content