Building a censorship resistant publishing system

Posted by Nick Johnson | Filed under tech, censorship-resistant, dht, coding

I've had an idea related to censorship resistant publishing kicking around in my head for some time now, and it seems like it's about time I got it written down somewhere, for consideration and criticism. Part of my motivation is that I'm intending to snatch some spare time while I'm on the plane to the US this weekend (to attend I/O) to have a go at implementing a basic version of it.

In a nutshell, I have a design for what I believe would be a fairly robust censorship resistant publishing system, based on a DHT, and integrating fully with the web. Content published using this system would be available in exactly the same fashion as a regular website, which strikes me as a major advantage over many other proposals for similar systems.

The system consists of several layers, which I'll tackle in order:

Document storage and retrieval
Name resolution for documents within a 'site'
External access
Name resolution for sites

Document storage and retrieval

The lowest level of the system is also the simplest: That of storing and retrieving documents. This layer of the system acts more or less exactly like a regular DHT: Documents are hashed, and stored in the DHT keyed by that hash. Given the hash of a document, the user can retrieve the document from the DHT using the standard mechanisms.

All the standard properties of a DHT are available to us here: The network is resistant to individual peers dropping out and rejoining; it's self-organising, so documents can be found in O(log n) hops, and it's resistant to attempts to remove documents from the system: peers automatically replicate content to ensure it remains available if copies are removed.

It would be remiss if I didn't point out that there are known issues with using DHTs in this fashion: Depending on the DHT, malicious peers can generate peer IDs that are close to a target document, in order to disrupt serving of that document, for example. These attacks can generally be made sufficiently expensive, though, that they're impractical on a large scale, and I believe they don't pose an insurmountable problem.

Name resolution within a site

So far, so ordinary. The major problem with a system such as the one described above is one of addressing: In order to get a document, you have to be provided with its hash, by some out of band mechanism. If the document gets modified, its hash changes, and in order to get the new version, you have to obtain the new hash. We can upload hypertext documents such as HTML in order to link between documents, but this doesn't solve the update issue, and the links between documents have to form a DAG, making mutual linking impossible.

As is often the case, the solution here is to add an additional layer of indirection. Instead of linking directly to documents by their hash, we use relative URLs, using any arrangement of URL paths we desire. After uploading all the documents and obtaining their hashes, we then construct a manifest document, which maps each path to the hash of the document that currently resides at that path.

With this change, documents are generally retrieved using a two part key: The hash of the manifest, which we will call the 'site hash', and the path to the specific document we want to retrieve. Provided with this, the system first retrieves the manifest, then uses that to resolve the path to a document hash, which it likewise retrieves. Thus, mutual links become possible, since links in individual documents use human designated paths, resolved using the manifest. When the site is updated, we need only distribute the current site hash in some fashion.

External access

So far I have only described how documents are accessed within the system, using some as-yet-unspecified protocol. The next logical step is to provide the promised integration with the web. This is accomplished with two new components: HTTP servers, and DNS servers.

Every node in the DHT runs an HTTP server. This server is configured to understand URLs of the form http://hash.somedomain.com/path. When presented with such a URL, the server first retrieves the resource referred to by the hash. If this is a manifest file, it then uses it to resolve the path component of the URL, which it likewise resolves, and returns to the client.

In addition, a limited number of nodes will run DNS servers for somedomain.com. These servers are responsible for choosing regular servers from the DHT to which they can direct users wishing to resolve a particular domain (eg, hash.somedomain.com). In general, they will handle this by finding the server closest in the DHT to the hash being requested, but there is room for flexibility here, since any node can answer any query - for example, they may provide a result that is 'further away' than the ideal server in the event that the ideal server is already under substantial load.

Hopefully some of the censorship resistant properties of this layer are already obvious: Any of the individual nodes running HTTP servers can be removed from the network without undue disruption: The DNS servers will simply stop serving them as valid results for queries. DNS servers can likewise be removed with minimal disruption - the other DNS servers will stop listing them as authorities for the domain. In the event that one of the root DNS servers, listed in the domain's glue data, is removed, some manual intervention is required, but it is once again fairly straightforward to recover from.

Name resolution for sites

The one remaining issue is that of providing a way for users to find the site, given that the site's hash will change from time to time, and is anything but memorable. For this, we once again turn to the DNS system for help, with a fairly simple strategy.

Anyone wishing to host a site on this system with a 'friendly' name does the following:

Purchase a suitable domain name
Add a CNAME record for a subdomain (eg, www.), pointing to the site's current root (eg, somehash.somedomain.com)
Update that CNAME whenever they modify the site

In this fashion, we rely on the DNS system to resolve a friendly domain name into the current site hash for us. When a user enters a domain into their browser, the resolution process works as follows:

Their browser resolves www.myrobustsite.com, which returns a CNAME to somehash.somedomain.com
Their browser resolves somehash.somedomain.com, which is handled by the DHT's DNS servers, who return the IP of a participating node close to that hash.
The browser makes an HTTP request to the returned IP address, with a Host header of www.myrobustsite.com
The DHT node notes that it does not recognize the domain in the Host header, and performs its own lookup on the domain, returning somehash.somedomain.com
The DHT node treats the request as if it came in on somehash.somedomain.com, and serves the request as described in the previous section.

Vulnerability: The DNS system and registrars

Observant users will have noticed that this system is heavily dependent on the DNS system, and on registrars. To some degree this is true, but I believe there are mitigating factors:

First, challenges to domain names are rarely made, and where they are unjust, are frequently reversed. It is difficult to justify taking down or censoring a domain when it itself is not responsible for any of the content the attacker wishes to censor.

Second, it's relatively easy to serve from multiple domains, and to shift from one to another. While there is significant disruption, the removal of a domain is far from the death knell for such a system, and users will adapt as long as the system proves robust enough for day-to-day usage. A profusion of names can make staying abreast of the ways of addressing such a system impossible for the would-be censor.

Third, many other robust systems depend on the domain name system. Sites like wikileaks in principle are vulnerable to the same challenges, but survive for many of the reasons detailed above.

Disclaimer

I hope it is not necessary - but fear it is - for me to explicitly point out that I don't condone using systems such as the one I'm proposing for illegal purposes. Many unjust censorship regimes exist, however, many or all of them without due process, review, or scrutiny of any kind, and I believe it's important to develop systems that combat such unjust mis-applications of technology. I also find systems like this one of great academic interest!

It should also go without saying that this is my own personal project, and has no bearing on my day job. Google, nor anyone else, is endorsing my personal enthusiasm for side-projects such as this or any other.

Feedback

What do you think? I'm the first to acknowledge that what I'm proposing is not an 'ideal' system, but in the absense of any such ideal system, I believe this one makes a pretty good set of tradeoffs that would make it of interest were it to be developed. Your feedback is thus greatly appreciated, especially if you have some ideas on how to improve it, since such ideas are far easier to integrate in the design phase than after implementation!

As is frequently the case, I'm also desperate for naming ideas, since working on a project is tough if you don't know what to call it.

07 May, 2010

Previous Post Next Post

Nick's Blog