Generating PDFs on App Engine Python, and introducing Mapvelopes

Posted by Nick Johnson | Filed under python, tech, app-engine, cookbook, coding

This is the first of two posts covering the technologies used to implement the Mapvelopes app, an App Engine app that generates customized printable envelopes with the map to your recipient on them.

While HTML is the lingua-franca of the web, it's not the be all and end all. Sometimes, you need your webapp to generate something slightly different, and often, that something is a PDF. PDFs have the major advantage that they're designed for printing: pagination is built in, and the PDF defines the page size, so nothing about the layout is left to chance. When you need to provide something for the user to print, especially when it's complex, using a PDF can make the difference between okay output and really excellent output. Hit 'Print' in a Google Docs spreadsheet, and you'll see this in action.

PDF generation on App Engine is something that's been left largely up to individual users to figure out. Depending on your runtime - Java or Python - and your specific needs, it may be quite straightforward, or rather complicated. In particular, if you want to include images in your PDF, you're going to have to jump through some extra hoops, since most PDF libraries expect they'll be able to use an imaging library, which isn't available on App Engine.

This came up for me recently when I wanted to implement this fantastic 'map envelope' concept using the Google Maps APIs and App Engine. The generated envelopes need to be easily printable, sized correctly, and contain both embedded images, text, and shapes - which pretty much made PDF a shoo-in.

Today we're going to take a look at the ReportLab Toolkit, the best known and probably best supported PDF generation library for Python. We'll look at what's required to use it for basic PDF generation on App Engine, and then what's necessary to be able to include images in a PDF from within the App Engine environment.

Getting started with ReportLab on App Engine

Due to a combination of coincidence and good coding on the part of ReportLab, the ReportLab toolkit works out of the box for basic functionality on App Engine! All you have to do is download it and copy the 'reportab' directory into the root directory of your app. You can now use the library exactly as described in the User Guide. Let's go over a few simple examples with the Webapp framework (adapting these to your choice of framework should be straightforward, though):

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4

class PDFHandler(webapp.RequestHandler):
  def get(self):
    self.response.headers['Content-Type'] = 'application/pdf'
    self.response.headers['Content-Disposition'] = 'attachment; filename=my.pdf'
    c = canvas.Canvas(self.response.out, pagesize=A4)

    c.drawString(100, 100, "Hello world")
    c.showPage()
    c.save()

This is the absolute basics of PDF generation with ReportLab. First, we set two response headers - the content type, to indicate we're returning a PDF, and the content disposition, which tells the user's browser to download it, and provides a default name. Next, we create a new Canvas object. The Canvas object is the main interface to PDF generation, and has methods that take care of common operations like outputting text and drawing shapes. In the constructor, we pass a pagesize argument. Here, we're using a predefined size, 'A4', but in general you can specify any tuple of two numbers, being the x and y dimensions of a page. We also supply self.response.out as the object to write the generated PDF document to.

A quick aside about measurements in PDFs is worthwhile here. All measurements used by the ReportLab toolkit are in 'points', which are equal to 1/72 of an inch. Reportlab provides conversion factors in the reportlab.lib.units module, so if you'd prefer, you can express sizes in any other unit - for example, the expression 10*units.mm is the number of points in 10 millimeters.

The other thing to remember is that the coordinate system in PDFs places 0,0 at the bottom left. If you're used to coordinate systems where 0,0 is in the top left, you'd best mentally invert your Y axis, or you'll be drawing everything upside-down!

Back to the example, the next thing we do is call the canvas's drawString method. This method takes two coordinates, again in points, and a text string, and draws it to the current page of the PDF. PDFs use a state-machine approach, so attributes such as the font face, size, and color all depend on the current state, which can be modified with methods such as setStrokeColorRGB and setFont.

As we've already seen, PDFs are page-oriented. The call to c.showPage() reflects this, informing the PDF library that we're done with the current page and ready (potentially) to start on a new one. Finally, c.save() tells the library we're done with the whole PDF, and it should write the rest of it to the output.

Drawing operations are handled similarly, using methods such as line() and rect(). All of this is covered in the guide linked to above, which is well worth a read.

Since a large part of PDF generation is likely to be concerned with outputting text, ReportLab's support for text generation is worth a closer look. While drawString is fine for short pieces of text, drawing multiple lines of text requires better support. ReportLab provides this in the form of text objects. To create one, call beginText() on a canvas object. You can then manipulate the text object to draw text as desired, and finally, call drawText() on the canvas object, passing in the text object, to draw it all to the canvas. Here's a simple example:

text = c.beginText()
text.setTextOrigin(1*cm, 5*cm)
text.setFont("Times Roman", 14)
text.textLine("Hello world!")
text.textLine("Look ma, multiple lines!")
c.drawText(text)

There's a lot more to text objects than this, of course - again, take a look at the user's guide for more details.

Images in PDFs

It's also quite possible to insert images in a PDF, of course, and ReportLab supports doing so in a limited fashion without requiring access to the PIL imaging library. Unfortunately, a couple of issues - one incompatibility and one outright bug - require us to make a couple of small patches to the library before we can use it with images.

First, open up reportlab/lib/utils.py, and go to the rl_isdir() function, starting on line 463. This function depends on some internals of the Python classloader that aren't available on App Engine, so we need to change it. Comment out the last line, line 469, and in its place, insert "return False".

Next, look at the _isPILImage() function, starting on line 520. Line 523 reads "except ImportError:". Change this to read "except AttributeError:".

Now that we've made these modifications, we can insert images in our PDFs! The PDF format effectively supports two types of image - raw image data (optionally compressed with zlib), and JPEGs. As it stands, the ReportLab library only exposes the latter to users, so we'll only be able to include JPEGs for now. If your images are in another format, you can use the Images service to convert them to JPEG.

In order to insert an image into a PDF, we first have to create an ImageReader object from the JPEG data. Then, we call the canvas's drawImage() method to actually draw it to the PDF. Here's an example:

image = canvas.ImageReader(StringIO.StringIO(image_data)) # image_data is a raw string containing a JPEG
c.drawImage(image, 0, 0, 144, 144) # Draw it in the bottom left, 2 inches high and 2 inches wide

Straightforward, right? The width and height arguments can be anything you like, so you can stretch and scale the image arbitrarily. What's more, if you include the same image repeatedly, the ReportLab toolkit is smart enough to embed it in the PDF only once, and reference it from multiple locations.

You may be wondering what would be required to support other image formats, such as PNG, so you can get rid of those pesky JPEG compression artefacts. In principle, this should be quite doable, but it'd require some serious hacking: In a nutshell, using the pure-Python pypng library to decode the PNG, and subclassing ReportLab's ImageReader class to support getting its raw data from the decoded PNG, without involving the PIL imaging library. Perhaps a future article will cover this challenge. ;)

07 April, 2010

Previous Post Next Post

Nick's Blog