mlissner's blog

The Winning Font in Court Opinions

At CourtListener, we're developing a new system to convert scanned court documents to text. As part of our development we've analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we've attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold Italic Bold Italic Total
Times 1454 953 867 47 3321
Courier 369 333 209 131 1042
Arial 364 39 11 41 455
Symbol 212 0 0 0 212
Helvetica 24 161 2 2 189
Century Schoolbook 58 54 52 9 173
Garamond 44 42 41 0 127
Palatino Linotype 36 24 24 1 85
Old English 42 0 0 0 42
Lincoln 27 0 0 0 27

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn't seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here's Google's page on the topic. And here's Bing's. The msnbot is retired.

Yahoo, AOL

Yahoo!'s search engine is provided by Bing. AOL's is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia's search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask's page on the topic is here, and Yandex's is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header.

The Internet Archive, Alexa

The Internet Archive uses Alexa's crawler, which is known as ia_archiver. This crawler does not seem to support either the HTML robots meta tag nor the x-robots-tag HTTP header. Their page on the subject is here. I have requested more information from them, and will update this page if I hear back.

Duckduckgo, Blekko, Baidu

Duckduckgo and Blekko do not support either the robots meta tag nor the x-robots-tag header, per emails I've had with each of them. I also requested information from Baidu, but their response totally ignored my question and was in Chinese. They do have some information here, but it does not seem to provide any information on the noindex value for the robots tag. In any case, the only way to block these crawlers seems to be via a robots.txt file.

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what's available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we're often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person's right to privacy and the public's need to access court records, and to what extent do changes in practical obscurity compel action on our behalf? For example, if someone convicted of murder or child molestation is trying to make information about their past harder to discover, how should we weigh the public's interest in easily locating this information via a search engine? In the case of convicted child molesters, we can look to Megan's law for a public policy stance on the issue, but even that forces us to ask to what extent we should chart our own path, and to what extent we should follow public policy decisions.

On the opposite end of the spectrum, many of the cases that we block search engines from indexing are asylum cases where a person has lost an attempt to stay in the United States, and been sent back to a country where they feel unsafe. In such cases, it seems clear that it's important to keep the person's name out of search engine results, but still we must ask to what extent do we have an obligation to identify and block such cases from appearing proactively rather than post hoc?

In both of these scenarios, we have taken a middle ground that we hope strikes a balance between the public's need for court documents and an individual's desire or need for privacy. Instead of either proactively blocking search engines from indexing cases or keeping cases in search results against a party's request, our current policy is to block search engines from indexing a web page as each request comes in. We currently have 190 cases that are blocked from search results, and the number increases regularly.

Where we do take proactive measures to block cases from search results is where we have discovered unredacted or improperly redacted social security numbers in a case. Taking a cue from the now-defunct Altlaw, whenever a case is added, we look for character strings that appear to be social security numbers, tax ID numbers or alien ID numbers. If we find any such strings, we replace them with x's, and we try to make sure the unredacted document does not appear in search results outside of CourtListener.

The methods we have used to block cases from appearing in search results have evolved over time, and I'd like to share what we've learned so others can give us feedback and learn from our experiences. There are five technical measures we use to keep a case out of search results:

  1. robots.txt file
  2. HTML meta noindex tags
  3. HTTP X-Robots-Tag headers
  4. sitemaps.xml files
  5. The webmaster tools provided by the search engines themselves

Each of these deserves a moment of explanation. robots.txt is a protocol that is respected by all major search engines internationally, and which allows site authors (such as myself) to identify web pages that shouldn't be crawled. Note that I said crawled not indexed. This is a very important distinction, as I'll explain momentarily.

HTML meta tags are a tag that you can place into the HTML of a page, and which instructs search engines not to index a page. Since this is an HTML format, this method only works on HTML pages.

HTTP X-Robots-Tag headers are similar to HTML meta tags, but they allow site authors to request that an item not be indexed. That item may be an HTML page, but equally, it may be a PDF or even an image that should not searchable.

Further, we provide an XML sitemap that search engines can understand, and which tells them about every page on the site that they should crawl and index.

All of these elements fit together into a complicated mélange that has absorbed many development hours over the past two years, as different search engines interpret these standards in different ways.

For example, Google and Bing interpret the robots.txt files as blocks to their crawlers. This means that web pages listed in robots.txt will not be crawled by Google or Bing, but that does not mean those pages will not be indexed. Indeed, if Google or Bing learn of the existence of a web page (for example, because another page linked to it), then they will include it in their indexes. This is true even if robots.txt explicitly blocks robots from crawling the page, because to include it in their indexes, they don't have to crawl it — they just need to know about it! Even your own link to a page is sufficient for Google or Bing to know about the page. And what's worse, if you have a good URL with descriptive words within it, Google or Bing will know the terms in the URLs even when they haven't crawled the page. So if your URL is example.com/private-page-about-michael-jackson, a query for [ Michael Jackson ] could certainly bring it up, even if it were never crawled.

The solution to this is to allow Google and Bing to crawl the pages, but to use noindex meta or HTTP tags. If these are in place, the pages will not appear in the index at all. This sounds paradoxical: to exclude pages from appearing in Google and Bing, you have to allow them to be crawled? Yes, that's correct. Furthermore, it's theoretically possible that Google or Bing could learn about a page on your site from a link, and then not crawl it immediately or at all. In this case, they will know the URL, but won't know about and X-Robots-Tag headers or meta tags. Thus, they might include the document against your wishes. For this reason, it's important to include private pages in your sitemap.xml file, inviting and encouraging Google and Bing to crawl the page specifically so the page can be excluded from their indexes.

Yahoo! uses Bing to power their search engine, and AOL uses Google, so the above strategy applies to them as well.

Other search engines take a different approach to robots.txt. Ask.com, The Internet Archive and the Russian search engine Yandex.ru all respect the robots meta tag, but not the x-robots-tag HTTP header. Thus, for these search engines, the strategy above works for HTML files, but not for any other files. These crawlers therefore need to be blocked from accessing those other files. On the upside, unlike Google and Bing, it appears that these search engines will not show a document in their results if they have not crawled it. Thus, using robots.txt alone should be sufficient.

A third class of search engines support neither the robots HTML meta tag, nor the x-robots-tag HTTP header. These are typically less popular or less mature crawlers, and so they must be blocked using robots.txt. There are two approaches to this. The first is to list blocked pages individually in the robots.txt file, and the second is to simply block these search engines from all access. While it's possible to list each private document in robots.txt, doing so creates a privacy loophole, since it lists all private documents in one place. At CourtListener, therefore, we take a conservative approach, and completely block all search engines that do not support the HTML meta tag or the x-robots-tag HTTP header.

The final action we take when we receive a request that a document on our site stop appearring in search results, is to use the webmaster tools provided by the major search engines1 to explicitly ask those search engines to exclude the document(s) from their results.

Between these measures, private documents on CourtListener should be removed from all major and minor search engines. Where posssible this strategy takes a very granular approach, and where minor search engines do not support certain standards, we take a conservative approach, blocking them entirely.

  1. We use Google's Webmaster Tools and Bing's Webmaster Tools. Before it was merged into Bing's tools, we also previously used Yahoo's Site Explorer.

Android Icons from Ice Cream Sandwich

Tagged:  

For those that are looking for the stock Android icons that come with the platform, I just pulled them together, and put them in a zip for you.

I've gone through and deleted a couple hundred images that weren't icons - things like blue backgrounds and the like, but the rest should be here.

Enjoy.

Integrating Solr Search with Django at CourtListener

Over the past few weeks, I've been hard at work on the new version of CourtListener. Unfortunately, progress has been slower than I'd like due to the limitations of the Solr frameworks I've been using. There are a number of competing frameworks available, each with its own strengths and pitfalls.

So far, I've tried two of the popular ones, Haystack and Sunburnt. I'm pretty impressed by both, but today's blog post is to outline the problems I'm having with these frameworks so that others that are faced with choosing one might be better informed. The difference between these frameworks is vast. Haystack aims to solve all of your integration needs, while Sunburnt is a fairly lightweight wrapper around Solr.

CourtListener's needs

At CourtListener, we have some big goals for the new search version. At its core, it's essentially a search-powered site, so we have some big needs:

  • Parallel Faceted Search
  • Highlighting
  • Complex boolean searches supported by Solr's eDisMax syntax
  • Snippets below search results and in emails
  • Standard search stuff: field-level boosting, result and facet counts, field-level searching, result pagination, performance, etc.

We're currently using Sphinx Search with django-sphinx, which does a fine job, but it has some problems:

  • django-sphinx hasn't been maintained in years, and requires patching
  • django-sphinx doesn't support snippets
  • Sphinx doesn't (yet) support real time indexing (though it's in beta, I believe)
  • Sphinx doesn't have the community and features that Solr does
  • Unfamiliar syntax for users

In general, these problems aren't too difficult, but in combination, they make for a poor user experience. The last point is a real deal breaker, since most users are accustomed to making queries like [ site:google.com ], which works for Solr and Google, but not for Sphinx. In Sphinx, your query is [ @site(google.com) ]. While we could do post processing of the user's query to convert it to Google/Solr-style syntax, it's unreliable and prone to failing in corner cases. Parsing queries is hard. More on this in a moment.

Let's try Haystack

In switching from Sphinx, I first tried Haystack as a solution, since it has excellent documentation and seems to be the most popular solution. I spent about two weeks learning about it and getting it in place, but ultimately, I gave up on it because I found that I was subclassing it everywhere. Haystack is a good solution, to be sure, but I found that I was:

  • Subclassing the FacetView so it could support parallel facet counts
  • Subclassing the FacetForm for another feature I needed
  • Subclassing the Solr backend so it could support Solr's highlighting syntax
  • Further subclassing the Solr backend so it can support additional Solr parameters that aren't built in
  • ...etc...

I worked on that third point for the better part of a day before deciding that Haystack wasn't for me. Rather than spending my time working on the search needs of CourtListener, I was spending most of it hacking on Haystack, and trying to understand the way it fits together. It's not unreasonably complex, but there is a LOT of documentation, and a lot of complexity that I don't need (such as the ability to switch search backends). Instead of a big solution that allows me to subclass whatever I need (which is good), I needed a lighter-weight solution that was more nimble, and which allowed me to interact with Solr in a more direct way.

Enter Sunburnt

Sunburnt is a lightweight solution that is everything that Haystack isn't. From the moment it's installed, you can start making queries without configuring Django to use it, and without really knowing much else. Its documentation is a single page, which is actually a big relief after coming from Haystack. But Sunburnt has a major problem in its design: It doesn't support just sending queries to Solr. The expectation in Sunburnt is that each system using it does post-processing on the user's query, and then submits the query to Sunburnt in stages.

So, if a user searches for "foo bar", rather than just passing that to Sunburnt, you have to split on the white space, then pass:
si.query('foo').query('bar')

At first you think, "OK, I can do that - just split on white space, no big deal." Then you start thinking about the other syntax that Solr supports, and you realize that you have a real problem if you have to split up queries appropriately. Trust me when I say that you don't want to be thinking about how to send a query like this one to Sunburnt: [ foo bar "jakarta apache"~10 ].

The author of Sunburnt will point out that there's a workaround for this problem. You can use
si.search(q='"jakarta apache"~10')

That works, to a point, but that syntax isn't supported on facets, so your facet counts won't have the same counts as your results. And so, Sunburnt, though powerful and lightweight, fails.

What now?

Good question.

Tripling Down

Tagged:  

I'm tripling down on my donations that I placed at the beginning of the year. Feels right.

The abolishment of the Emergency Court of Appeals (April 18, 1962)

One of the coming features at CourtListener is an API for the law. Part of that feature is going to be some basic information about the courts themselves, so I spent some time over the weekend researching courts that served a special purpose but were since abolished.

One such court was the Emergency Court of Appeals. It was created during World War II to set prices, and, naturally, was the court of appeals for many cases. The creation date of the court is prominently published in various places on the Internet, but the abolishment history of the court was very difficult to find. After researching online for some time, and learning that my library card had expired (sigh), I put in a query with the Library of Congress, which provides free research of these types of things.

Within a couple days, the provided me with this amazing response, which I'm sharing here, and on the above Wikipedia article:

As stated in the Legislative Notes to 50 U.S. Code Appendix §§ 921 to 926, as posted at
http://www.law.cornell.edu/uscode/html/uscode50a/usc_sec_50a_00000921---..., the following explanation is given regarding the amendment and repeal of Act of Jan. 30, 1942, ch. 26, title II, § 204, 56 Stat. 23, 31-33:

"Section 924, acts Jan. 30, 1942, ch. 26, title II, § 204, 56 Stat. 31; June 30, 1944, ch. 325, title I, § 107, 58 Stat. 639; June 30, 1945, ch. 214, § 6, 59 Stat. 308; July 30, 1947, ch. 361, title I, § 101, 61 Stat. 619; June 25, 1948, ch. 646, § 32(a), 62 Stat. 991; May 24, 1949, ch. 139, § 127, 63 Stat. 107, authorized review of orders of the Office of Price Administrator under the Emergency Price Control Act of 1942, and created the Emergency Court of Appeals for this purpose. The Emergency Price Control Act of 1942 terminated on June 30, 1947, under the provisions of act July 25, 1946, ch. 671, § 1, 60 Stat. 664. The Housing and Rent Act of 1948, act Mar. 30, 1948, ch. 161, 62 Stat. 93, classified to section 1881 of this Appendix, continued the Court for the purpose of reviewing recommendations of local advisory boards for the decontrol or adjustment of maximum rents. Later, the Defense Production Act of 1950, act Sept. 8, 1950, ch. 932, 64 Stat. 798, classified to sections 2061 to 2166 of this Appendix, continued the Court to review regulations and orders relating to price control. The Housing and Rent Act of 1948 and the Defense Production Act of 1950 both terminated, however, the Court remained in existence “to complete the adjudication of rights and liabilities incurred prior to their termination dates.” (Transcript of Proceedings of the Final Session of the Court, 299 F.2d 1.) The final decision of the Court, Rosenzweig v. General Services Administration, 1961, 299 F.2d 22, was decided on Dec. 6, 1961. A petition for rehearing was denied on Jan. 2, 1962, and a petition for writ of certiorari to the Supreme Court of the United States was denied on Mar. 19, 1962, 82 S. Ct. 830.

The order of Chief Judge Albert B. Maris, set forth in 299 F.2d 20, provided:

“The business of this Court having been completed, it is ordered that at the expiration of 30 days from this date, if a petition for certiorari has not been filed in the Supreme Court in Case No. 676 [Rosenzweig v. General Services Administration], just decided, the acting clerk shall deliver the records and papers of the Court in his office to the General Services Administration for permanent custody as records of the Government, and shall thereupon inform the Chief Justice of the United States that the work of the Court has been completed and that the designations of the judges of the Court may therefore appropriately be terminated.

“If a petition for certiorari is filed in Case No. 676 this order shall take effect and be carried out at the expiration of 30 days after the final disposition of Case No. 676.”

In accordance with the terms of this order, the petition for certiorari having been filed, and denied Mar. 19, 1962, the Court terminated on Apr. 18, 1962."

Pretty fantastic research. And for free! Thanks LOC.

Project Idea: "Programming library for curse words"

When programming, there are occasionally times when you need to detect or block curse words. At CourtListener, for example, we make URLs with ID numbers in them that are formed by converting an ID number to letters (so a → 1, b → 2, 27 → A, etc). Higher numbers create longer strings of letters, so over time, this creates curse words in the URL. Currently, the site is only has a few four letter strings, but I will rue the day when any of the seven dirty words is being shown to users on my site.

There are many lists of curse words on the web, but none that is maintained or curated. Having that alone would be a useful project. What would make it better would be libraries in popular programming languages that efficiently told you if a string contained a curse word.

The next feature would be to add additional languages, and then to add words like pen1s, which aren't normally curse words, but are certainly words you'd want to eliminate.

It'd be a pretty simple project, so I may just go for it.

Only question is, what do I name it?

Firefox images and icons

A few weeks ago, I was in need of a free star icon, and for the life of me, I couldn't find one that was quite right. I scoured over the Internet, all the while noticing the perfect little star in my browser.

A couple of times, I Googled for Firefox icons, but I couldn't find them posted anywhere. Finally, I realized that if I just downloaded the source, I could easily find the icons, zip them up neatly, and post them for all to share.

So that's what I've done here. Behold the Firefox icons. These are organized more usefully in the actual Firefox source, but if you don't mind a little icon browsing, the attached zip should have all the icons you see in your browser.

Enjoy.

PS - these are licensed under the following licenses: MPL 1.1/GPL 2.0/LGPL 2.1, and are copyright Mozilla.

Syndicate content