mlissner's blog

Exploratory Analysis of Service Recipients of the Contra Costa County Community Services Bureau

Introduction and Background

As part of my information visualization class at the UC Berkeley School of Information, I have completed an exploratory analysis of the characteristics of the children and families that have received services from the Community Services Bureau (CSB) of the Contra Costa County Employment and Human Services Department (EHSD). As indicated below, at any given time the Bureau provides subsidized childcare services to approximately 2,700 children aged zero to five that live in and around Contra Costa County. When granting services, priority is given to the neediest families and children, and thus the data should not be considered a representative sample of the county at large. With that in mind however, extrapolations could likely be made by comparing the number of children in a geographic area against the number of children analyzed herein.

The data that is used was exported from the Bureau's COPA database, pseudo-anonymized by the Bureau,1 and then provided to me for this research. The COPA database is an online tool into which county workers have input vast quantities of personal information about the children and families who apply for and receive services. It has been in use since approximately 2003, and at present contains records for approximately 22,500 children. Of these children, approximately 6,600 have been given a series of developmental assessments by county employees. These assessments consist of 41 measures, which attempt to plot each child's development as he/she progresses in the program. For the explorations I have completed here, I have averaged the scores for these measures, thus approximating the child's developmental standing shortly after enrolling in the program.

As a former employee of this Bureau, I am familiar with this data, however in the past I have not completed exploratory research as broadly as I have here.

Using Revision Control on a Django Project Without Revealing Your Passwords

Just a quick post today, since this took me way too long to figure out. If you have a django project that you want to share without sharing the private bits of settings.py, there is an easy way to do this.

I tried for a while to to set up mercurial hooks that would strip out my passwords before each commit, and then place them back after each commit, thus avoiding uploading them publicly. This does not work however because all of the mercurial hooks happen after snapshots of the modified files have been made. So you can edit the files using a hook, but your edits will only go into effect upon the next check in. Clearly, this will not do.

Another solution that I tried was the mercurial keyword extension. This could work, but ultimately it does not because you have to remember to run it before and after each commit — something I know I'd forget sooner or later.

The solution that does work is to split up your settings.py file into multiple pieces such that there is a private file and a public file. I followed the instructions here, with the resulting code looking being checked in here and here. There is also a file called "20-private.conf" which is not uploaded publicly, and which contains all the private bits of code that would normally be found in settings.py. Thus, all of my settings can be found my django, but I do not have to share my private ones.

Converting PDF Files to HTML

For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the PDF files the courts give us to HTML that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts PDF files to something resembling text, but short of spending a few days hacking the site, I couldn't figure out any easy way to leverage their technology in any sort of automated way.

The other two tools I have looked at today are pdftotext and pdftohtml, which, not surprisingly, do what their names claim they do. Since we're going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the UTF-8 encoding applied.

Before coming to this conclusion though, I looked at two settings that pdftohtml has. With the -c argument, it can generate a 'complex' HTML document that closely resembles that of the original. Without the -c argument, it will create a more simple document. Although the complex documents are rather impressive in appearance, they're abysmal when it comes to the quality of the HTML code that is generated. For an example, look at the source code for this this file. If, on the other hand, the -c argument is not run, and the simple documents are generated, the appearance of the final product is worse than the simple text documents that are created by pdftotext. Check out this one for example.

For thoroughness, here is a table containing the results from this test.

Court pdftotext pdftohtml complex pdftotext simple Original PDF
1st The first circuit publishes in HTML Format by default
2nd link link link link
3rd link link link link
4th link link link link
5th link link link link
6th link link link link
7th link link link link
8th link link link link
9th link link link link
10th link link link link
11th link link link link
DC Circuit link link link link
Federal Circuit link link link link

A caveat regarding pdftotext: This library is developed by a company called Glyph & Cog. Although the code is open source, I couldn't for the life of me figure out how to file a bug against it. This doesn't particularly bode well for using something as a dependency. On the flip side, Glyph & Cog is happy to provide support for the product.

With Howard Zinn's Death, We All Suffer a Little

Tagged:  

Howard Zinn was one of the greats. He may not have freed the slaves or created the nation, but it is safe to say that his every action and his every belief furthered the dream and the ideals of the American state. After eight years of Bush policies dragging down the nation, and a year of Obama sounding increasing like an echo of Bush, it is truly tragic that we are losing this thinker.

I posted this quote from Zinn's Passionate Declarations some time ago, but now more than ever it seems relevant:

What sorts of values and ideals are encouraged in the young people of the coming generation by the enormous emphasis on the Founding Fathers and the presidents? It seems to me that the result is the creation of dependency on powerful political figures to solve our problems.

...

Consider how much attention is given in historical writing to military affairs—to wars and battles—and how many of our heroes are military heroes. And consider also how little attention is given to antiwar movements and to those who struggled against the idiocy of war.

...

As a result of omitting, or downplaying, the importance of social movements of the people in our history...a fundamental principle of democracy is undermined: the principle that it is the citizenry, rather than the government, that is the ultimate source of power and the locomotive that pulls the train of government in the direction of equality and justice. Such histories create a passive and subordinate citizenry.

If you haven't read Passionate Declarations, you should.

New Server!

Tagged:  

When I started this blog more than two years ago, it was an experiment in blogging and an experiment in running my own server. At this point, it's safe to say that the results of the experiment are in. While running my own server out of my attic has been an enlightening experience, teaching me about everything from DNS to PHP, ultimately, I have come to the conclusion that if I want a reliable, powerful and secure sever, running it myself is not the way to do it.

In that light, today I've transferred this website over to the people at slicehost. This will give me (almost) all the freedom of running the server myself, and should relieve some of the headaches I've had for the past couple of years when it came to power outages, comcast and AT&T outages, and the like.

Moving the site over has been a bit complicated, so if you see anything weird, let me know via the contact link, which should work as of now.

How to Protect Your Open Source Code from Theft and a Mercurial Hook to Help

Updated, 2010-01-24: Some edits regarding the Affero license (thanks to Brian at http://cyberlawcases.com for the corrections).

I've finally begun doing some of the actual coding for my final project so the time has come to set up a mercurial repository to hold the code.

Once we complete our project, we will have built a free product that competes with some of the core functionality of both LexisNexis and Westlaw, so something we wanted to do was make sure they couldn't steal our code, enhance their product and thus moot ours.

To achieve this, we're using the GNU Affero General Public License v3, which allows people to take our code for free, but requires that they publicly share any modifications that they make to the code. The normal GNU General Public License allows the code to be used at no cost, but only requires that changes to the code be shared with the public if one distributes the changed version to the public. With a server-based project, like ours, one could operate modified versions of the code without ever having a need to distribute any of the software to the public. This loophole is closed by the Affero license.

In order to license our work, we must be its copyright holder. This is easy enough, since we get copyright instantly in the U.S., but, as has been demonstrated in Jacobsen v. Katzer, in order to seek remedies for copyright violations, we would have to register everything we made with the copyright office. This costs $35 per registration, and with open source software, it's not clear whether each and every version needs to be registered or just major releases, or what.

Since this is too onerous to be practical, an additional approach to protecting our works is useful, and in the DMCA (17 U.S.C. § 506(d)), remedies are provided for the "fraudulent removal of copyright notice." Although these do not (in any way) match the protections provided by normal copyright registration, they are a useful place to begin. Thus, if we place a copyright notice into each file of our code, those using our code must either risk violating the DMCA by removing these notices, or leave our copyright information intact. (Placing such notices in each file is also the recommendation of the Free Software Foundation.)

To place our information into each and every file of code that we upload publicly, I wrote a short mercurial hook that adds copyright and licensing information it to the top of every file that is modified or added to the repository. To use the script, simply make it executable, place it in the .hg directory of your project, and add the following lines to .hg/hgrc:

[hooks]
pretxncommit = .hg/checklicense.py

A couple of things I should note about this script is that it currently only checks for java and python files, and that it requires files called java_license.txt and python_license.txt to be in the root of your repository. It should be fairly easy to modify though to fit your own needs.

2010 Donations

Every year, a handful of organizations reach out to me to give them my money. I'm never sure how much I've given them, when the last time was, etc, so I'm always convinced some of them are tricking me into donating more than I would otherwise.

Well, this shall happen no more. I'm starting a new scheme this year in which on January 1st of each year, I donate to all the organizations that make the cut.

Here are this year's results, including amount donated and links to their donation page. Please let me know if I'm making any grave mistakes:

Organization Amount Link to donate
Pacific Crest Trail Association $25 Donate
Secular Coalition $15 Donate
Pitzer College (my alma mater) $30 Donate
Zotero $25 Donate
Electronic Frontier Foundation $40 Donate
Catalog Choice $10 Donate
Food and Water Watch $10 Donate
Creative Commons $20 Donate
International AIDS Vaccine Initiative $10 Donate
gnome-do $15 Summon > Donate > Enter

Cheap Metal Bike Stand

It's my winter break right now, so I'm taking advantage of it by doing some of the things that have been on my list for far too long. One of those things was to build a repair stand for the work I do on the bikes in my life. For about $50, you can build this stand, which works remarkably well.

Click the pic to see a (very) short gallery and high res versions.

Essentially, what you need to make this beauty is five pipes, a 90 degree elbow, a 45 degree elbow, and a welder. Since I don't weld, I went to a local muffler shop and they were happy to do it (for free!) The one big lesson I'll share is that you can't (and shouldn't) make this from galvanized metal, since welding that is dangerous. The stuff you want is called "black pipe."

Script to Rid Thyself of Autocomplete = Off in Firefox

I took some time today and wrote up a script that can be run to eliminate autocomplete=off in Firefox. It basically does the same thing as is described here, but it automates it.

The script can be run with one of five arguments:

  • You can choose to use find (--find) or locate (--locate) to find the files that need to be changed on your system;
  • You can dictate the location of the file if you want to modify a specific one or know exactly where it's located (--dictate);
  • You can choose to use the Ubuntu default location (--default); or
  • You can print the help information (--help)

Once the program is run, it will make a back up, and modify it the original versions of the file. Once that's complete, all you have to do is restart Firefox.

It has been pointed out to me by some security folks that removing autocomplete's functionality from the browser might not be the best thing, since it will allow you to save your passwords in the browser. There's some truth to that: Anything that's on your computer can be hacked. So, if you're going to use this script, use it wisely.

Here's the code. I've attached it to this message as well. Any bugs or comments are greatly appreciated.

#!/bin/bash
# a simple script to destroy autocomplete in linux installations. 
 
 
##############
# We begin with our functions, it's not efficient, but it works
##############
 
# a function to print the help message.
function printHelp {
cat <<EOF
NAME
    autocompleteDestroyer.sh
 
SYNOPSIS
    autocompleteDestroyer.sh [ --find | --default | --help | --locate | --dictate ] 
 
OPTIONS
    This program will find the nsLoginManager.js file on your computer, and will fix it so that autocomplete is disabled in your installation of Firefox. Since this program will be altering your installation of Firefox, it will require your root password. 
 
    --help     Print this help file
 
    --default  Attempt to use the default location of the files (/usr/lib/xulrunner*/components/nsLoginManager.js)
 
    --locate   Use the locate database, if installed, to find the files. This will only find the files that were added before the last time the locate database was updated (which is typically once a day). It is faster than the --find option, but might not find all versions.
 
    --find     Use the find command to locate the nsLoginManager.js files. This will search in /usr/lib by default. Edit the script if you would like to change this. This is the slowest, but most thorough option.
 
    --dictate  Allows input of a known location.
 
EXIT STATUS
    autocompleteDestroyer.sh exists with a status of 0 if it encounters no problems. An exit status of 1 means incorrect usage. An exit status of 2 indicates it was unable to find your files. An exit status of 3 indicates the user terminated the program. An exit status of 4 means it encountered problems editing your file.
 
BUGS
    If any bugs are encountered, please see michaeljaylissner.com/contact
 
AUTHOR AND COPYRIGHT
    This script was authored by Michael Lissner and is released under GNU GPLv3.
 
EOF
}
 
# takes an argument, and creates an array containing the files to be modified.
function identifyEvilFiles {
    if [ $1 == "find" ]
    then
        files=$(find /usr/lib -name nsLoginManager.js 2> /dev/null)
        if [[ ! $files ]]
        then
            # Test if files has been set.
            echo "autocompleteDestroyer.sh: No files found. Try loosening the find parameter in the script, per the help file."
            exit 2
        fi
    elif [ $1 == "default" ]
    then
        # We assume the default location of nsLoginManager.js
        files=$(ls /usr/lib/xulrunner*/components/nsLoginManager.js 2> /dev/null)
        if [[ ! $files ]]
        then
            # We didn't have any hits. Exit.
            echo "autocompleteDestroyer.sh: We didn't find anything at the default locations. Perhaps try the --locate or --find arguments."
            exit 2
        fi
    elif [ $1 == "locate" ]
    then
        # We run the locate command, see if we have any hits.
        files=$(locate -b '\nsLoginManager.js' 2> /dev/null)
        if [[ ! $files ]]
        then
            # No hits. Exit.
            echo "autocompleteDestroyer.sh: We didn't find anything using the locate command. Perhaps try the --find argument."
            exit 2
        fi
    elif [ $1 == "dictate" ]
    then
        # "Why don't you just tell me what movie you'd like to see?" --Kramer.
        read -p "Where is the file nsLoginManager.js located on your machine: " files
        if [ -f $files ]
        then
            # Good. The file exists. We press on.
            echo "Thank you. That file exists, and we will modify it."
        else
            echo "autocomplete.sh: That file doesn't seem to exist. Please try again."
            exit 2
        fi
     fi
}
 
 
 
function modifyFiles {
    echo  "The following files will be modified: 
$files "
    echo 
    read -p "Shall we proceed (y/n): " proceed
 
    if [ $proceed == "y" -o $proceed == "Y" ]
    then
        # Here we go!
        while read -r line
        do
            echo Now processing $line
            #find the function in the file, label it with FILLERWORD, then replace the first line, and delete the rest. A messy approach, but functional
            sed -i.bak '/[[:space:]]*_isAutocompleteDisabled[[:space:]]*:[[:space:]]*function.*{[[:space:]]*$/,/^[[:space:]]*},[[:space:]]*$/s/^/FILLERWORD/' $line
            sed -r -i 's/FILLERWORD.*_isAutocomplete.*/    _isAutocompleteDisabled :  function (element) { return false; },/' $line
            sed -i '/FILLERWORD/d' $line
 
            # test if it worked
            grep -i -q 'isautocompletedisabled.*return false' $line
            if [ $? != 0 ]
            then
                # something failed...probably. Tell the user
                echo "Unable to successfully edit the file. Exiting"
                exit 4
            fi
        done <<< "$files"
        echo "All the files have been processed properly. Please restart Firefox, and thanks for using this script."
        exit 0
    else
        # It appears they'd like to abort. Let's exit.
        echo "OK. You know what to do if you change your mind."
        exit 3
    fi
 
}
 
 
#initiation sequence
if [ $# -eq 0 -o $# -gt 1 ]
then 
    # We need to give them help using the program. 
    echo "autocompleteDestroyer.sh:  Invalid number of arguments."
    echo "Usage: autocompleteDestroyer.sh [ --help | --default | --locate | --find | --dictate ] "
    exit 1
elif [[ $EUID -ne 0 ]]; 
then
    echo "autoCompleteDestroyer.sh: This script must be run as root" 1>&2
    exit 1
else
    case $1 in
        --help) printHelp;;
        --find) identifyEvilFiles find; modifyFiles;;
        --default) identifyEvilFiles default; modifyFiles;;
        --locate) identifyEvilFiles locate; modifyFiles;;
        --dictate)identifyEvilFiles dictate; modifyFiles;;
        *) echo "autocompleteDestroyer.sh: Invalid argument. Try the --help argument."
           exit 1;
    esac
fi

Some Legal Gems

As anybody who has been following my twitter stream today knows, I have been spending the entire day hunting, pecking, avoiding, and kind of working on my final exam for my Intellectual Property Law Class.

While working on the exam, I have found some real legal gems today. These kinds of things don't often turn up unless you know to look for them, and since I'm betting nobody ever looks, I figured I'd share them here.

America, sometimes you have the strangest laws.

Syndicate content