How to help end Boy Scouts of America's ban on gays

On May 3rd, the Boy Scouts are considering lifting their ban on gays, and are putting a vote to the local and national councils. This means that it's easy to influence the vote by calling in and expressing your opinion. It's simple to do so, and more voices could change the direction of the Boy Scouts of America, allowing all boys to be included and accepted.

Here's what to do:

  1. Find your local council
  2. Give them a call or send them an email telling them your opinion.
  3. Contact the national council via email: feedback@scouting.org.

Most important is to contact your local council. We need their phones to be ringing off the hook with people expressing their opinions. It truly takes no more than two minutes. Here's the San Diego council: (619) 298-6121, and the East Bay council: (925) 674-6100.

If you prefer a form letter, you can just do this one through the Human Rights Campaign (34,000 people already have).

Here's a simple transcript for you to follow:

I'm [calling/wrriting] to express my desire that the Boy Scouts immediately lift their ban on gays. This ban is discriminatory, outdated and pointless. Scouting teaches many great lessons to thousands of adolescents across the U.S. One lesson that Scouting should not teach is that homosexuality is somehow wrong or means for discrimination.

On May 3rd, I hope that you will help the BSA finally makes the right decision so it can continue to lead boys into being mature and accepting men.

Send this to the email above, and call your local council. This could make a difference.

So you wanna buy a bike

Tagged:  

Another of my friends is asking me questions about buying a bike. I love that they do this, but since it's become a trend, I figure I should throw my thoughts together here as a reference.

Buying a bike is actually pretty simple:

  1. You want a used road bike.
  2. You want gears; more is better.
  3. Lighter is better.
  4. Size is important.
  5. Exotic isn't for you (yet).

That's pretty much all there is to it, at least on the macro level. Let's dive into each of these a tad, shall we?

First, you want a used road bike. The reason for this is pretty logical: As a commuter, you'll be riding mostly on...roads! You could get a cruiser (but that defies rule #2), or a mountain bike (defies rule #3) or a recumbant (see rule #5). But ultimately, road bikes are the best for what you want to do: travel quickly and safely to and from work and around town with minimal fuss. New ones cost a fortune, so get one used.

I wish I didn't have to mention rule number two, but for all the hipsters out there, you're wrong about fixies. If you life in San Francisco, a city with specific maps for the hills, you should use the things that were invented to get you up hills: Gears! Even if you live in a flat-ish city, you should get gears because they give you flexibility. Sure, your bike is a commuter today, but tomorrow your friend might want to go on a bike ride with you. Or maybe you move to a new city. Who knows? Get gears. Don't be trendy.

Enough said. Rule number three is that lighter is better. If you look at new road bikes, you'll quickly learn that you can spend an incredible amount on bikes. And naturally, the ligher they are, the more expensive they are. So how light is right? Well, this one is tough and somewhat subjective, so I say, find one that's as light as you can buy for your dollar. Off the cuff, I'd wager that around $400 is the point of deminishing returns for most people.

Size is important. If it doesn't fit, it's useless. Go to a bike shop and get sized before you do much shopping. It'll help you winnow the stuff you're looking at anyway. This may as well be the first step of your search.

Rule five explains itself. When you know more about bikes, branch out if you care. You probably won't, so save yourself the effort of looking at exotic stuff. It's exotic for a reason.

Bonus questions for the avid reader

What brand should I buy? Doesn't much matter, surprisingly. There are better and worse brands, but if you're buying a used road bike, and follow rule number 3, your goals will be accomplished.

What material should my bike be? Probably steel. Aluminum is good too, but probably out of your price range. Steel's a very reputable material though. If you can find Reynold's steel, all the better.

How do I know the bike I'm buying isn't stolen? Good question! This one's hard. You can look for the serial number or try to only deal with people that seem legit. There is a national bike registry (which you should use!), but otherwise there's not a whole lot you can do...yet.

Comments? Thoughts? Email me. You're probably my friend already if you're reading this...

Enabling Two-Factor Authentication

This post is as much Public Service Announcement as anything else. I didn't realize that two-factor authentication had finally taken off. It's practically vital for your email account (you're asking for trouble without it), but in the past year or so, a bunch of other services have begun offering it.

Today I went on a little security binge, and found that I could turn on two-factor authentication at:

  • Google/Gmail
  • Yahoo
  • Dropbox
  • Charles Schwab (they send you a fob for free)
  • Facebook
  • Paypal
  • Amazon Web Services

One note about Charles Schwab is that getting their fob is great, but it's hardly all you should do to secure your account. You should also set up what they call a "verbal password" that you have to provide whenever you call in. Without it, it's pretty easy to get into an account via their surprisingly weak phone security.

Anyway, this is a pretty good list so far. The companies are using a handful of different techniques for doing this, but they all seem pretty solid in the end. Google's, naturally, seems to be one of the most robust, but I'm impressed there's so much offered.

Go set these up!

2013 Donations

Long-time friends will probably realize that with the coming of the new year comes a revisit to my annual donations.

This year's donations are larger than any previous year, but largely fall along similar trends as in the past. The larger donations this year (about $1,000-worth) go towards non-profit organizations. The choices this year were hard. After consulting with a few friends, I decided to donate to two new categories: Environmental and Anti-Gun.

Finding a good environmental organization to give your money is HARD. After a few hours of research, I had looked at many organizations that were doing good work. But a lot of those organizations were still trying to prove the point that climate change is an issue, or were focused on small-scale issues. These are both noble goals, but I think what we need now are big solutions on an international level. I'm no expert in this topic, by far, but I'm fairly convinced that individual decision making isn't going to solve the problem fast enough. It's great if we all learn to recycle and to consider environmental impact in our daily lives. That, I don't disagree with. But I don't think it's enough. I think we need to start forcing governments and organizations to be cleaner. I'm convinced that so long as the economic incentives are in place that have led to the current behaviors, the market will follow those incentives. I'm hopeful that my donation to the Center for Climate and Energy Solutions will help bring changes to these incentives.

Finding an anti-gun organization is easier, especially given the current state of affairs after Sandyhook Elementary School. While I'm not so sure that anti-gun legislation is going to solve any truly big problems, I hope that donating my money here will help strike while the iron is hot. I simply can't believe that the 2nd Amendment pro-gun lobby is as successful as it is, and I am hopeful that we'll be able to change the dialog around guns over the next few years. Gun ownership is trending down in the U.S., and I hope that we can accellerate that trend, bringing a cease to the needless gun deaths violence we currently live with.

The other big donations in this year's list go mostly towards organizations that I've donated to in the past. Fair Vote and Rootstrikers are organizations that work to fix the current political system. Most Americans (about 70%, I believe) agree that the current Federal legislation system is corrupt, and these organizations are working to fix that. I'm pessimistic that until these organizations find success, we won't be able to deal with the small or large issues facing the country, so these organizations continue to get the plurality of my donation ($400 between them). I think the ridiculous fiscal cliff "negotiations" are testament to how bad things have gotten. Our political system is paralyzed.

Other organizations that did well this year include a handful of open-source foundations that I rely on, but which otherwise give away their work for free. My livelihood and these very donations rely on these bits of infrastructure we take for granted, so I figure I should give them some money to keep 'em going.

Here's the nitty gritty breakdown of my donations this year (as well as last):

As always, I welcome input on these decisions, and suggestions for the years ahead. Those that made suggestions for this year, I truly appreciate your help.

Year in Review: Travel Edition

Tagged:  

I did a lot of traveling this year; more than anybody should ever really do. Since I'm already forgetting all the places I went to, I figured I'd write it all down.

Here's the tally:

TripFlightsDistance (miles)
London, Germany, Turkey915500
Germany for work211362
Montreal Bike Trip and Visit to Montreal25221
L.A., San Diego, Colorado backpacking52586
Olympic Peninsula Backpacking21356
Paris, Brussels211092
Law via the Internet Conference at Ithaca45454
TOTAL2652571

I'm pretty sure some things are left out from the beginning of the year, and I'm still trying to figure out how I ended up doing so much goddamn traveling. For comparison's sake, I must note that the Earth is 24,901 around its belly, and this is a total of more than twice that.

Next year will be another banner year, as I already have seven weddings on the books. I don't think there will be so many trips to Europe though. That'll make the biggest difference.

I just hope I have enough suits. It's gonna be crazy.

Setting up etherpad with postgres on Windows

There don't seem to be any successful installation instructions for postgres on Windows. It's not that hard, but there are a couple things you need to do.

I haven't gone through these instructions to make sure they work, but this is roughly what I've done to get my Postgres/Windows/Node/Etherpad working together:

  • Install git
  • Install node.js
  • Install python
    • add PYTHON as an env
  • Install postgres
    • add C:\Program Files\PostgreSQL\9.2\bin to your path
  • Download etherpad-lite with git
  • Run the etherpad-lite windows installer per the instructions
  • start etherpad-lite
    • make sure it works with the dirty DB before getting exotic
  • Set up postgres
    • npm install pg (will throw an error about msbuild version, but ignore that, the native JS drivers are installed)
    • add a user using pgadmin
    • add a DB using pgadmin and the user created a second ago
    • reconfigure to use postgres in the settings.json file
  • Run start.bat to make sure it works
  • Turn down the log messages to only ERROR in the settings.json file.
  • Use NSSM to daemonize it, per the instructions here.

Note that NSSM doesn't yet have stdout and stderr redirection built in. Thus, to start the daemon with these working, you have to create a little script like this:

@ECHO OFF
@REM This runs etherpad with stdout and stderr getting redirected to special logs
call D:\etherpad\etherpad-lite\start.bat >> D:\etherpad\etherpad-lite\logs\stdout.log 2>> D:\etherpad\etherpad-lite\logs\stderr.log

Presentation on Juriscraper and CourtListener for LVI2012

Yesterday and today I've been in Ithaca, New York, participating in the Law via the Internet Conference (LVI), where I've been learning tons!

I had the good fortune to have my proposal topic selected for Track 4: Application Development for Open Access and Engagement.

In the interest of sharing, I've attached the latest version of my slides to this Blog post, and the audio for the talk may eventually get posted on the LVI site.

Calculating the average elevation of a trip using a TCX file

If you use a site like ridewithgps, something you may want to know is how to calculate the average elevation for a trip. Unfortunately, most sites don't seem to provide this, so we have to do a little hacking.

Here's what worked for me:

  • download the GPS TCX file
  • grep out the altitude lines (grep -i 'altitude' your_file.tcx)
  • find & replace out the remaining XML tags and whitespace using a basic text editor
  • average the remaining values in a spreadsheet

Takes about five minutes. For my trip the number came out to be 10,753 feet!

URL Hacking at REI.com

I'm about two hours away from heading on vacation to Montreal, but I wanted to post a quick update about a vulnerability I found on REI.com last night.

The vulnerability was a simple one. A few days ago, to get a 15% off coupon, I signed up for their Gear Mail newsletter. It eventually came, and at the bottom it had a link to unsubscribe, which I clicked (I was only after the 15% sign-up coupon).

The link led to:

http://email.rei.com/cgi-bin12/DM/t/nCT4n0N3xbv0ESo05DPf0Et&EmailAddr=ml...

Which redirects to:

https://preferences.rei.com/rei/rei_PrefCtr.asp?EmailAddr=mlissner@micha...

I immediately noticed the badness in these URLs, and at a whim, I tried modifying the URL to use a friend's email address. Sure enough it worked, and I could look up the full name and zip code of anybody who had an email address that was in REI's system.

Around midnight last night, I sent REI an email informing them of the problem, giving them a month to fix it, and I posted on Twitter that I had found a vulnerability on REI.com. Naively, I thought that if I didn't post the link on Twitter, nobody would be able to figure it out, but of course, by morning a friend of mine (a security/privacy researcher, sigh) had found the link and posted it. Not only that, but for fun, he had tried his address book against the link, and turned up 30 of his friend's names and zip codes out of a sample of about 200.

I sent another note to REI to make sure that they knew about the link now being in the open, and that the month I promised them had been curtailed by my own mistake.

It's now 7:15pm, about 19 hours after I first informed them of the problem, and it's fixed. It still seems to be possible for me to update your email subscriptions, but at least I can't look up information about you.

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we're using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There's one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they'd start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It's quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it's pretty basic on the whole.

I'd love to get any feedback I can about this. It's probably still got some bugs, but it's small enough that they should be quite easy to stamp out.

Update: I got in touch with the developer of Xacobeo. There's an --html flag that you can pass to it at startup, if that's your intention. If you use that, it indeed uses the same DOM parser that my tool does. Sigh. Affordances are important, especially in a GUI-based tool.

Further privacy protections at CourtListener

I've written previously about the lengths we go to at CourtListener to protect people's privacy, and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had already blocked cases from appearing in the search results of all major search engines, we had a privacy leak in the form of our computer-readable sitemaps. These sitemaps contain links to every page within a website, and since those links contain the names of the parties in a case, it's possible that a Google search for the party name could turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve sitemaps so that they use the noindex X-Robots-Tag HTTP header. This tells search crawlers that they are welcome to read our sitemaps, but that they should avoid serving them or indexing them.

My Presentation Proposal for LVI 2012

The Law Via the Internet conference is celebrating its 20th anniversary at Cornell University on October 7-9th. I will be attending, and with any luck, I'll be presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply visiting a court's website it is now possible to find and read thousands of cases without ever leaving your home. At the same time, there are nearly a hundred court websites, many of these websites suffer from poor funding or prioritization, and gaining a higher-level view of the law can be challenging. “Juriscraper” is a new project designed to ease these problems for all those that wish to collect these court opinions daily. The project is under active development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be picked up and used by any organization to scrape the case data from court websites. In addition to a simply scraping the websites and extracting metadata from them, Juriscraper has a number of other design goals:

  • Extensibility to support video, oral argument audio, and other media types
  • Support for all metadata provided by court websites
  • Extensibility to support varied geographies and jurisdictions
  • Generalized object-oriented architecture with little or no code repetition
  • Standardized coding techniques using the latest libraries and standards (Python, xpath, lxml, requests, chardet)
  • Simple installation, configuration, and API
  • Friendly and transparent to court websites

As well as a number of features:

  • Harmonization of metadata (US, USA, United States of America, etc → United States; et al, et. al., etc. get eliminated; vs., v, vs → v.; all dates are Python objects; etc.)
  • Smart title-casing of case names (several courts provide case names in uppercase only)
  • Sanity checking and sorting of metadata values returned by court websites

Once implemented, Juriscraper is part of a two-part system. The second part is the caller, which uses the API, and which itself solves some interesting questions:

  • How are duplicates detected and avoided?
  • How can the impact on court websites be minimized?
  • How can mime type detection be completed successfully so that textual contents can be extracted?
  • What should we do if it is an image-based PDF?
    • How should HTML be tidied?
    • How often should we check a court website for new content?
  • What should we do in case of failure?

Juriscraper is currently deployed by CourtListener.com to scrape all of the Federal Appeals courts, and we are slowly adding additional state courts over the coming weeks.

We have been scraping these sites in various ways for several years, and Juriscraper is the culmination of what we've learned. We hope that by presenting our work at LVI 2012, we will be able to share what we have learned and gain additional collaborators in our work.

Adding New Fonts to Tesseract 3 OCR Engine

Tagged:  

Update: I've turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.

Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I've explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named 'standard-training-text.txt'. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I've attached a sample doc too, if that helps. Set the text to the font you want to use, and save it as font-name.doc.

Save the document as a PDF (call it [lang].font-name.exp0.pdf, with lang being an ISO-639 three letter abbreviation for your language), and then use the following command to convert it to a 300dpi tiff (requires imagemagick):

convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif

You'll now have a good training image called lang.font-name.exp0.tif. If you're adding multiple fonts, or bold, italic or underline, repeat this process multiple times, creating one doc → pdf → tiff per font variation.

Train Tesseract

The next step is to run tesseract over the image(s) we just created, and to see how well it can do with the new font. After it's taken its best shot, we then give it corrections. It'll provide us with a box file, which is just a file containing x,y coordinates of each letter it found along with what letter it thinks it is. So let's see what it can do:

tesseract lang.font-name.exp0.tiff lang.font-name.exp0 batch.nochop makebox

You'll now have a file called font-name.exp0.box, and you'll need to open it in a box-file editor. There are a bunch of these on the Tesseract wiki. The one that works for me (on Ubuntu) is moshpytt, though it doesn't support multi-page tiffs. If you need to use a multi-page tiff, see the issue on the topic for tips. Once you've opened it, go through every letter, and make sure it was detected correctly. If a letter was skipped, add it as a row to the box file. Similarly, if two letters were detected as one, break them up into two lines.

When that's done, you feed the box file back into tesseract:

tesseract eng.font-name.exp0.tif eng.font-name.box nobatch box.train.stderr

Next, you need to detect the Character set used in all your box files:

unicharset_extractor *.box

When that's complete, you need to create a font_properties file. It should list every font you're training, one per line, and identify whether it has the following characteristics: <fontname> <italic> <bold> <fixed> <serif> <fraktur>

So, for example, if you use the standard training data, you might end up with a file like this:

eng.arial.box 0 0 0 0 0
eng.arialbd.box 0 1 0 0 0
eng.arialbi.box 1 1 0 0 0
eng.ariali.box 1 0 0 0 0
eng.b018012l.box 0 0 0 1 0
eng.b018015l.box 0 1 0 1 0
eng.b018032l.box 1 0 0 1 0
eng.b018035l.box 1 1 0 1 0
eng.c059013l.box 0 0 0 1 0
eng.c059016l.box 0 1 0 1 0
eng.c059033l.box 1 0 0 1 0
eng.c059036l.box 1 1 0 1 0
eng.cour.box 0 0 1 1 0
eng.courbd.box 0 1 1 1 0
eng.courbi.box 1 1 1 1 0
eng.couri.box 1 0 1 1 0
eng.georgia.box 0 0 0 1 0
eng.georgiab.box 0 1 0 1 0
eng.georgiai.box 1 0 0 1 0
eng.georgiaz.box 1 1 0 1 0
eng.lincoln.box 0 0 0 0 1
eng.old-english.box 0 0 0 0 1
eng.times.box 0 0 0 1 0
eng.timesbd.box 0 1 0 1 0
eng.timesbi.box 1 1 0 1 0
eng.timesi.box 1 0 0 1 0
eng.trebuc.box 0 0 0 1 0
eng.trebucbd.box 0 1 0 1 0
eng.trebucbi.box 1 1 0 1 0
eng.trebucit.box 1 0 0 1 0
eng.verdana.box 0 0 0 0 0
eng.verdanab.box 0 1 0 0 0
eng.verdanai.box 1 0 0 0 0
eng.verdanaz.box 1 1 0 0 0

Note that this is the standard font_properties file that should be supplied with Tesseract and I've added the two bold rows for the blackletter fonts I'm training. You can also see which fonts are included out of the box.

We're getting near the end. Next, create the clustering data:

mftraining -F font_properties -U unicharset -O lang.unicharset *.tr
cntraining *.tr

If you want, you can create a wordlist or a unicharambigs file. If you don't plan on doing that, the last step is to combine the various files we've created.

To do that, rename each of the language files (normproto, Microfeat, inttemp, pffmtable) to have your lang prefix, and run (mind the dot at the end):

combine_tessdata lang.

This will create all the data files you need, and you just need to move them to the correct place on your OS. On Ubuntu, I was able to move them to;

sudo mv eng.traineddata /usr/local/share/tessdata/

And that, good friend, is it. Worst process for a human, ever.

The Winning Font in Court Opinions

At CourtListener, we're developing a new system to convert scanned court documents to text. As part of our development we've analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we've attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold Italic Bold Italic Total
Times 1454 953 867 47 3321
Courier 369 333 209 131 1042
Arial 364 39 11 41 455
Symbol 212 0 0 0 212
Helvetica 24 161 2 2 189
Century Schoolbook 58 54 52 9 173
Garamond 44 42 41 0 127
Palatino Linotype 36 24 24 1 85
Old English 42 0 0 0 42
Lincoln 27 0 0 0 27

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn't seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here's Google's page on the topic. And here's Bing's. The msnbot is retired.

Yahoo, AOL

Yahoo!'s search engine is provided by Bing. AOL's is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia's search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask's page on the topic is here, and Yandex's is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa's crawler, which is known as ia_archiver. This crawler does not seem to support either the HTML robots meta tag nor the x-robots-tag HTTP header. Their page on the subject is here. I have requested more information from them, and will update this page if I hear back.

Duckduckgo, Blekko, Baidu

Duckduckgo and Blekko do not support either the robots meta tag nor the x-robots-tag header, per emails I've had with each of them. I also requested information from Baidu, but their response totally ignored my question and was in Chinese. They do have some information here, but it does not seem to provide any information on the noindex value for the robots tag. In any case, the only way to block these crawlers seems to be via a robots.txt file.

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what's available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we're often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person's right to privacy and the public's need to access court records, and to what extent do changes in practical obscurity compel action on our behalf? For example, if someone convicted of murder or child molestation is trying to make information about their past harder to discover, how should we weigh the public's interest in easily locating this information via a search engine? In the case of convicted child molesters, we can look to Megan's law for a public policy stance on the issue, but even that forces us to ask to what extent we should chart our own path, and to what extent we should follow public policy decisions.

On the opposite end of the spectrum, many of the cases that we block search engines from indexing are asylum cases where a person has lost an attempt to stay in the United States, and been sent back to a country where they feel unsafe. In such cases, it seems clear that it's important to keep the person's name out of search engine results, but still we must ask to what extent do we have an obligation to identify and block such cases from appearing proactively rather than post hoc?

In both of these scenarios, we have taken a middle ground that we hope strikes a balance between the public's need for court documents and an individual's desire or need for privacy. Instead of either proactively blocking search engines from indexing cases or keeping cases in search results against a party's request, our current policy is to block search engines from indexing a web page as each request comes in. We currently have 190 cases that are blocked from search results, and the number increases regularly.

Where we do take proactive measures to block cases from search results is where we have discovered unredacted or improperly redacted social security numbers in a case. Taking a cue from the now-defunct Altlaw, whenever a case is added, we look for character strings that appear to be social security numbers, tax ID numbers or alien ID numbers. If we find any such strings, we replace them with x's, and we try to make sure the unredacted document does not appear in search results outside of CourtListener.

The methods we have used to block cases from appearing in search results have evolved over time, and I'd like to share what we've learned so others can give us feedback and learn from our experiences. There are five technical measures we use to keep a case out of search results:

  1. robots.txt file
  2. HTML meta noindex tags
  3. HTTP X-Robots-Tag headers
  4. sitemaps.xml files
  5. The webmaster tools provided by the search engines themselves

Each of these deserves a moment of explanation. robots.txt is a protocol that is respected by all major search engines internationally, and which allows site authors (such as myself) to identify web pages that shouldn't be crawled. Note that I said crawled not indexed. This is a very important distinction, as I'll explain momentarily.

HTML meta tags are a tag that you can place into the HTML of a page, and which instructs search engines not to index a page. Since this is an HTML format, this method only works on HTML pages.

HTTP X-Robots-Tag headers are similar to HTML meta tags, but they allow site authors to request that an item not be indexed. That item may be an HTML page, but equally, it may be a PDF or even an image that should not searchable.

Further, we provide an XML sitemap that search engines can understand, and which tells them about every page on the site that they should crawl and index.

All of these elements fit together into a complicated mélange that has absorbed many development hours over the past two years, as different search engines interpret these standards in different ways.

For example, Google and Bing interpret the robots.txt files as blocks to their crawlers. This means that web pages listed in robots.txt will not be crawled by Google or Bing, but that does not mean those pages will not be indexed. Indeed, if Google or Bing learn of the existence of a web page (for example, because another page linked to it), then they will include it in their indexes. This is true even if robots.txt explicitly blocks robots from crawling the page, because to include it in their indexes, they don't have to crawl it — they just need to know about it! Even your own link to a page is sufficient for Google or Bing to know about the page. And what's worse, if you have a good URL with descriptive words within it, Google or Bing will know the terms in the URLs even when they haven't crawled the page. So if your URL is example.com/private-page-about-michael-jackson, a query for [ Michael Jackson ] could certainly bring it up, even if it were never crawled.

The solution to this is to allow Google and Bing to crawl the pages, but to use noindex meta or HTTP tags. If these are in place, the pages will not appear in the index at all. This sounds paradoxical: to exclude pages from appearing in Google and Bing, you have to allow them to be crawled? Yes, that's correct. Furthermore, it's theoretically possible that Google or Bing could learn about a page on your site from a link, and then not crawl it immediately or at all. In this case, they will know the URL, but won't know about and X-Robots-Tag headers or meta tags. Thus, they might include the document against your wishes. For this reason, it's important to include private pages in your sitemap.xml file, inviting and encouraging Google and Bing to crawl the page specifically so the page can be excluded from their indexes.

Yahoo! uses Bing to power their search engine, and AOL uses Google, so the above strategy applies to them as well.

Other search engines take a different approach to robots.txt. Ask.com, The Internet Archive and the Russian search engine Yandex.ru all respect the robots meta tag, but not the x-robots-tag HTTP header. Thus, for these search engines, the strategy above works for HTML files, but not for any other files. These crawlers therefore need to be blocked from accessing those other files. On the upside, unlike Google and Bing, it appears that these search engines will not show a document in their results if they have not crawled it. Thus, using robots.txt alone should be sufficient.

A third class of search engines support neither the robots HTML meta tag, nor the x-robots-tag HTTP header. These are typically less popular or less mature crawlers, and so they must be blocked using robots.txt. There are two approaches to this. The first is to list blocked pages individually in the robots.txt file, and the second is to simply block these search engines from all access. While it's possible to list each private document in robots.txt, doing so creates a privacy loophole, since it lists all private documents in one place. At CourtListener, therefore, we take a conservative approach, and completely block all search engines that do not support the HTML meta tag or the x-robots-tag HTTP header.

The final action we take when we receive a request that a document on our site stop appearring in search results, is to use the webmaster tools provided by the major search engines1 to explicitly ask those search engines to exclude the document(s) from their results.

Between these measures, private documents on CourtListener should be removed from all major and minor search engines. Where posssible this strategy takes a very granular approach, and where minor search engines do not support certain standards, we take a conservative approach, blocking them entirely.

Update, 2012-04-29: You may also want to look at our discussion of the impact of putting people's names into your URLs, and the way that affects your sitemap files.

  1. We use Google's Webmaster Tools and Bing's Webmaster Tools. Before it was merged into Bing's tools, we also previously used Yahoo's Site Explorer.

Android Icons from Ice Cream Sandwich

Tagged:  

For those that are looking for the stock Android icons that come with the platform, I just pulled them together, and put them in a zip for you.

I've gone through and deleted a couple hundred images that weren't icons - things like blue backgrounds and the like, but the rest should be here.

Enjoy.

Integrating Solr Search with Django at CourtListener

Over the past few weeks, I've been hard at work on the new version of CourtListener. Unfortunately, progress has been slower than I'd like due to the limitations of the Solr frameworks I've been using. There are a number of competing frameworks available, each with its own strengths and pitfalls.

So far, I've tried two of the popular ones, Haystack and Sunburnt. I'm pretty impressed by both, but today's blog post is to outline the problems I'm having with these frameworks so that others that are faced with choosing one might be better informed. The difference between these frameworks is vast. Haystack aims to solve all of your integration needs, while Sunburnt is a fairly lightweight wrapper around Solr.

CourtListener's needs

At CourtListener, we have some big goals for the new search version. At its core, it's essentially a search-powered site, so we have some big needs:

  • Parallel Faceted Search
  • Highlighting
  • Complex boolean searches supported by Solr's eDisMax syntax
  • Snippets below search results and in emails
  • Standard search stuff: field-level boosting, result and facet counts, field-level searching, result pagination, performance, etc.

We're currently using Sphinx Search with django-sphinx, which does a fine job, but it has some problems:

  • django-sphinx hasn't been maintained in years, and requires patching
  • django-sphinx doesn't support snippets
  • Sphinx doesn't (yet) support real time indexing (though it's in beta, I believe)
  • Sphinx doesn't have the community and features that Solr does
  • Unfamiliar syntax for users

In general, these problems aren't too difficult, but in combination, they make for a poor user experience. The last point is a real deal breaker, since most users are accustomed to making queries like [ site:google.com ], which works for Solr and Google, but not for Sphinx. In Sphinx, your query is [ @site(google.com) ]. While we could do post processing of the user's query to convert it to Google/Solr-style syntax, it's unreliable and prone to failing in corner cases. Parsing queries is hard. More on this in a moment.

Let's try Haystack

In switching from Sphinx, I first tried Haystack as a solution, since it has excellent documentation and seems to be the most popular solution. I spent about two weeks learning about it and getting it in place, but ultimately, I gave up on it because I found that I was subclassing it everywhere. Haystack is a good solution, to be sure, but I found that I was:

  • Subclassing the FacetView so it could support parallel facet counts
  • Subclassing the FacetForm for another feature I needed
  • Subclassing the Solr backend so it could support Solr's highlighting syntax
  • Further subclassing the Solr backend so it can support additional Solr parameters that aren't built in
  • ...etc...

I worked on that third point for the better part of a day before deciding that Haystack wasn't for me. Rather than spending my time working on the search needs of CourtListener, I was spending most of it hacking on Haystack, and trying to understand the way it fits together. It's not unreasonably complex, but there is a LOT of documentation, and a lot of complexity that I don't need (such as the ability to switch search backends). Instead of a big solution that allows me to subclass whatever I need (which is good), I needed a lighter-weight solution that was more nimble, and which allowed me to interact with Solr in a more direct way.

Enter Sunburnt

Sunburnt is a lightweight solution that is everything that Haystack isn't. From the moment it's installed, you can start making queries without configuring Django to use it, and without really knowing much else. Its documentation is a single page, which is actually a big relief after coming from Haystack. But Sunburnt has a major problem in its design: It doesn't support just sending queries to Solr. The expectation in Sunburnt is that each system using it does post-processing on the user's query, and then submits the query to Sunburnt in stages.

So, if a user searches for "foo bar", rather than just passing that to Sunburnt, you have to split on the white space, then pass:
si.query('foo').query('bar')

At first you think, "OK, I can do that - just split on white space, no big deal." Then you start thinking about the other syntax that Solr supports, and you realize that you have a real problem if you have to split up queries appropriately. Trust me when I say that you don't want to be thinking about how to send a query like this one to Sunburnt: [ foo bar "jakarta apache"~10 ].

The author of Sunburnt will point out that there's a workaround for this problem. You can use
si.search(q='"jakarta apache"~10')

That works, to a point, but that syntax isn't supported on facets, so your facet counts won't have the same counts as your results. And so, Sunburnt, though powerful and lightweight, fails.

What now?

Good question.

Tripling Down

Tagged:  

I'm tripling down on my donations that I placed at the beginning of the year. Feels right.

The abolishment of the Emergency Court of Appeals (April 18, 1962)

One of the coming features at CourtListener is an API for the law. Part of that feature is going to be some basic information about the courts themselves, so I spent some time over the weekend researching courts that served a special purpose but were since abolished.

One such court was the Emergency Court of Appeals. It was created during World War II to set prices, and, naturally, was the court of appeals for many cases. The creation date of the court is prominently published in various places on the Internet, but the abolishment history of the court was very difficult to find. After researching online for some time, and learning that my library card had expired (sigh), I put in a query with the Library of Congress, which provides free research of these types of things.

Within a couple days, the provided me with this amazing response, which I'm sharing here, and on the above Wikipedia article:

As stated in the Legislative Notes to 50 U.S. Code Appendix §§ 921 to 926, as posted at
http://www.law.cornell.edu/uscode/html/uscode50a/usc_sec_50a_00000921---..., the following explanation is given regarding the amendment and repeal of Act of Jan. 30, 1942, ch. 26, title II, § 204, 56 Stat. 23, 31-33:

"Section 924, acts Jan. 30, 1942, ch. 26, title II, § 204, 56 Stat. 31; June 30, 1944, ch. 325, title I, § 107, 58 Stat. 639; June 30, 1945, ch. 214, § 6, 59 Stat. 308; July 30, 1947, ch. 361, title I, § 101, 61 Stat. 619; June 25, 1948, ch. 646, § 32(a), 62 Stat. 991; May 24, 1949, ch. 139, § 127, 63 Stat. 107, authorized review of orders of the Office of Price Administrator under the Emergency Price Control Act of 1942, and created the Emergency Court of Appeals for this purpose. The Emergency Price Control Act of 1942 terminated on June 30, 1947, under the provisions of act July 25, 1946, ch. 671, § 1, 60 Stat. 664. The Housing and Rent Act of 1948, act Mar. 30, 1948, ch. 161, 62 Stat. 93, classified to section 1881 of this Appendix, continued the Court for the purpose of reviewing recommendations of local advisory boards for the decontrol or adjustment of maximum rents. Later, the Defense Production Act of 1950, act Sept. 8, 1950, ch. 932, 64 Stat. 798, classified to sections 2061 to 2166 of this Appendix, continued the Court to review regulations and orders relating to price control. The Housing and Rent Act of 1948 and the Defense Production Act of 1950 both terminated, however, the Court remained in existence “to complete the adjudication of rights and liabilities incurred prior to their termination dates.” (Transcript of Proceedings of the Final Session of the Court, 299 F.2d 1.) The final decision of the Court, Rosenzweig v. General Services Administration, 1961, 299 F.2d 22, was decided on Dec. 6, 1961. A petition for rehearing was denied on Jan. 2, 1962, and a petition for writ of certiorari to the Supreme Court of the United States was denied on Mar. 19, 1962, 82 S. Ct. 830.

The order of Chief Judge Albert B. Maris, set forth in 299 F.2d 20, provided:

“The business of this Court having been completed, it is ordered that at the expiration of 30 days from this date, if a petition for certiorari has not been filed in the Supreme Court in Case No. 676 [Rosenzweig v. General Services Administration], just decided, the acting clerk shall deliver the records and papers of the Court in his office to the General Services Administration for permanent custody as records of the Government, and shall thereupon inform the Chief Justice of the United States that the work of the Court has been completed and that the designations of the judges of the Court may therefore appropriately be terminated.

“If a petition for certiorari is filed in Case No. 676 this order shall take effect and be carried out at the expiration of 30 days after the final disposition of Case No. 676.”

In accordance with the terms of this order, the petition for certiorari having been filed, and denied Mar. 19, 1962, the Court terminated on Apr. 18, 1962."

Pretty fantastic research. And for free! Thanks LOC.

Syndicate content