privacy

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn't seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here's Google's page on the topic. And here's Bing's. The msnbot is retired.

Yahoo, AOL

Yahoo!'s search engine is provided by Bing. AOL's is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia's search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask's page on the topic is here, and Yandex's is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header.

The Internet Archive, Alexa

The Internet Archive uses Alexa's crawler, which is known as ia_archiver. This crawler does not seem to support either the HTML robots meta tag nor the x-robots-tag HTTP header. Their page on the subject is here. I have requested more information from them, and will update this page if I hear back.

Duckduckgo, Blekko, Baidu

Duckduckgo and Blekko do not support either the robots meta tag nor the x-robots-tag header, per emails I've had with each of them. I also requested information from Baidu, but their response totally ignored my question and was in Chinese. They do have some information here, but it does not seem to provide any information on the noindex value for the robots tag. In any case, the only way to block these crawlers seems to be via a robots.txt file.

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what's available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we're often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person's right to privacy and the public's need to access court records, and to what extent do changes in practical obscurity compel action on our behalf? For example, if someone convicted of murder or child molestation is trying to make information about their past harder to discover, how should we weigh the public's interest in easily locating this information via a search engine? In the case of convicted child molesters, we can look to Megan's law for a public policy stance on the issue, but even that forces us to ask to what extent we should chart our own path, and to what extent we should follow public policy decisions.

On the opposite end of the spectrum, many of the cases that we block search engines from indexing are asylum cases where a person has lost an attempt to stay in the United States, and been sent back to a country where they feel unsafe. In such cases, it seems clear that it's important to keep the person's name out of search engine results, but still we must ask to what extent do we have an obligation to identify and block such cases from appearing proactively rather than post hoc?

In both of these scenarios, we have taken a middle ground that we hope strikes a balance between the public's need for court documents and an individual's desire or need for privacy. Instead of either proactively blocking search engines from indexing cases or keeping cases in search results against a party's request, our current policy is to block search engines from indexing a web page as each request comes in. We currently have 190 cases that are blocked from search results, and the number increases regularly.

Where we do take proactive measures to block cases from search results is where we have discovered unredacted or improperly redacted social security numbers in a case. Taking a cue from the now-defunct Altlaw, whenever a case is added, we look for character strings that appear to be social security numbers, tax ID numbers or alien ID numbers. If we find any such strings, we replace them with x's, and we try to make sure the unredacted document does not appear in search results outside of CourtListener.

The methods we have used to block cases from appearing in search results have evolved over time, and I'd like to share what we've learned so others can give us feedback and learn from our experiences. There are five technical measures we use to keep a case out of search results:

  1. robots.txt file
  2. HTML meta noindex tags
  3. HTTP X-Robots-Tag headers
  4. sitemaps.xml files
  5. The webmaster tools provided by the search engines themselves

Each of these deserves a moment of explanation. robots.txt is a protocol that is respected by all major search engines internationally, and which allows site authors (such as myself) to identify web pages that shouldn't be crawled. Note that I said crawled not indexed. This is a very important distinction, as I'll explain momentarily.

HTML meta tags are a tag that you can place into the HTML of a page, and which instructs search engines not to index a page. Since this is an HTML format, this method only works on HTML pages.

HTTP X-Robots-Tag headers are similar to HTML meta tags, but they allow site authors to request that an item not be indexed. That item may be an HTML page, but equally, it may be a PDF or even an image that should not searchable.

Further, we provide an XML sitemap that search engines can understand, and which tells them about every page on the site that they should crawl and index.

All of these elements fit together into a complicated mélange that has absorbed many development hours over the past two years, as different search engines interpret these standards in different ways.

For example, Google and Bing interpret the robots.txt files as blocks to their crawlers. This means that web pages listed in robots.txt will not be crawled by Google or Bing, but that does not mean those pages will not be indexed. Indeed, if Google or Bing learn of the existence of a web page (for example, because another page linked to it), then they will include it in their indexes. This is true even if robots.txt explicitly blocks robots from crawling the page, because to include it in their indexes, they don't have to crawl it — they just need to know about it! Even your own link to a page is sufficient for Google or Bing to know about the page. And what's worse, if you have a good URL with descriptive words within it, Google or Bing will know the terms in the URLs even when they haven't crawled the page. So if your URL is example.com/private-page-about-michael-jackson, a query for [ Michael Jackson ] could certainly bring it up, even if it were never crawled.

The solution to this is to allow Google and Bing to crawl the pages, but to use noindex meta or HTTP tags. If these are in place, the pages will not appear in the index at all. This sounds paradoxical: to exclude pages from appearing in Google and Bing, you have to allow them to be crawled? Yes, that's correct. Furthermore, it's theoretically possible that Google or Bing could learn about a page on your site from a link, and then not crawl it immediately or at all. In this case, they will know the URL, but won't know about and X-Robots-Tag headers or meta tags. Thus, they might include the document against your wishes. For this reason, it's important to include private pages in your sitemap.xml file, inviting and encouraging Google and Bing to crawl the page specifically so the page can be excluded from their indexes.

Yahoo! uses Bing to power their search engine, and AOL uses Google, so the above strategy applies to them as well.

Other search engines take a different approach to robots.txt. Ask.com, The Internet Archive and the Russian search engine Yandex.ru all respect the robots meta tag, but not the x-robots-tag HTTP header. Thus, for these search engines, the strategy above works for HTML files, but not for any other files. These crawlers therefore need to be blocked from accessing those other files. On the upside, unlike Google and Bing, it appears that these search engines will not show a document in their results if they have not crawled it. Thus, using robots.txt alone should be sufficient.

A third class of search engines support neither the robots HTML meta tag, nor the x-robots-tag HTTP header. These are typically less popular or less mature crawlers, and so they must be blocked using robots.txt. There are two approaches to this. The first is to list blocked pages individually in the robots.txt file, and the second is to simply block these search engines from all access. While it's possible to list each private document in robots.txt, doing so creates a privacy loophole, since it lists all private documents in one place. At CourtListener, therefore, we take a conservative approach, and completely block all search engines that do not support the HTML meta tag or the x-robots-tag HTTP header.

The final action we take when we receive a request that a document on our site stop appearring in search results, is to use the webmaster tools provided by the major search engines1 to explicitly ask those search engines to exclude the document(s) from their results.

Between these measures, private documents on CourtListener should be removed from all major and minor search engines. Where posssible this strategy takes a very granular approach, and where minor search engines do not support certain standards, we take a conservative approach, blocking them entirely.

  1. We use Google's Webmaster Tools and Bing's Webmaster Tools. Before it was merged into Bing's tools, we also previously used Yahoo's Site Explorer.

Analyzing Facebook's Security Mechanisms

For my Privacy, Security and Cryptography class, we studied a set of 13 principles for secure systems:

  1. Security is Economics
  2. Least Privilege
  3. Use Fail-Safe Defaults
  4. Separation of Responsibility
  5. Defense in Depth
  6. Psychological Acceptability
  7. Usability
  8. Ensure Complete Mediation
  9. Least Common Mechanism
  10. Detect if You Cannot Prevent
  11. Orthogonal Security
  12. Don’t Rely on Security Through Obscurity
  13. Design Security in, From the Start

For our midterm, we were asked to analyze how Facebook exemplifies or does not follow these principles. It was an interesting assignment, which finally forced me to think more thoroughly about Facebook's security policies, and I'm happy to attach my findings here.

For some people these may be rather run of the mill notes. For others, you may be surprised at poor security of the world's biggest photo and social networking site.

Enjoy.

Testing Deletion Speed of Online Photo Sites

Update, 2010-03-08:Added an image at drop.io
Update, 2010-01-28: Added an image at Orkut.com
Update 2, 2010-01-28: At the FTC round table today, Facebook's director of public policy, Tim Sparapani, claimed that information deleted from Facebook cannot be retrieved even by Facebook staff, because it is almost instantly deleted. I informed him this was not true in the case of pictures, and he said he would look into it. Will update this post when/if I hear more.

Imagine an embarrassing photo of you is placed online by one of your friends. You ask them to take it down, and they do. Now, imagine that your enemy had gotten a link to that photo, and had posted it to their blog. You'd hope that your friend taking the photo down would in fact delete the photo, but I'm sorry to say that isn't always the case.

Inspired by Jacqui Cheng's article, I decided to test some of the more popular online services for photo hosting to see what happens when you "delete" a photo from their site. On November 14th, 2009, I uploaded and then deleted the following image of a black box with white text to Facebook, Flickr, Picasa, MySpace, Photobucket, Shutterfly, Twitpic and WalMart:

When you look below, if you can see the black box for a site, that means that it was not truly deleted and is still live. You can verify this by clicking on the image. This is checked each time this page is loaded, so the information is constantly verified. If the image has been deleted, you will see the date that it was deleted.

There are a number of reasons why photo services might be lazy about properly removing images from their site, but until they have proper deletion mechanisms, we should all think twice about what we upload.

If there's a service that is not shown here that you'd like to see, please let me know. And now, without further ado, I present, the ongoing results of the test:

Facebook:

This file was properly deleted from their server as of at least May 27, 2010.

Flickr:

ED: Flickr began showing the following message approximately an hour after the image was "deleted."

Picasa:

This file was properly deleted from their server as of at least 15 November 2009.

MySpace:

Photobucket:

This file was properly deleted from their server as of at least 14 November 2009.

Shutterfly:

Twitpic:

This file was properly deleted from their server as of at least 14 November 2009.

Walmart:

Google Orkut (added 2010-01-28 - disregard the date in the image itself)

Drop.io (added 08 March 2010)

This file was properly deleted from their server as of at least 8 March 2010.

Rethinking Facebook Privacy Settings

Ars Technica has an article today outlining some excellent techniques for safeguarding your privacy while using Facebook. One of the best methods explained in the article is to cordon off your friends into different groups of people, and to then set different permissions for those groups. Thus, the common technique is to put your ex-partners into one group, your friends into another, family into another, and thus down the line.

But in practice this technique is nigh on impossible. I have family members (such as cousins) that are close friends, and so-called friends that, really, I haven't talked to since high school. Beyond this, managing the groups is a problem too since over time, some of your friends become closer and others more distant.

Thinking through this problem, I have come up with a better, and perhaps more obvious solution: Simply organize your Facebook friends into groups based on how much you want those people to know about you. In practice I found this to be fairly simple with only three groups: Loose Privacy, Standard Privacy, and Strict Privacy. Bosses, ex-partners and distant friends go into the Strict category, close friends and current partners go into the Loose category, and everybody else goes into the Medium category.

Admittedly, this dumbs down the power that Facebook gives you to categorize your friends into groups, but in practice, it's much easier to maintain, since there are only three lists, and it's clear who belongs in which.

A second group of settings that people are likely unaware of are those that "limit what types of information your friends can see about you through applications." These are important and creepy because by default, when your friends install an application, that application can see and aggregate an incredible quantity of information about you, even without your or your friend's permission or knowledge. As part of its dotrights campaign, the ACLU is currently working on an application that demonstrates this loophole, but for the moment, it's probably wise to adjust these settings.

To adjust these settings so third-party applications can see as little information as possible (without your friends simply not using them), go to Settings > Privacy > Applications, and then click on the "Other" tab (this link should also work, if you're logged in). Once on that page, uncheck all of the boxes in the first section, and save your settings.

Impossibility of Complete Privacy

Tagged:  

A short quote for you today from Property, Privacy and Personal Data by Paul M. Schwartz:

Already in the offline world, and in no small irony, direct marketers generate and sell lists of people who have expressed an interest in protecting their privacy.

In other words, even if you don't participate, you've participated. Gotcha.

Syndicate content