Information Leakage -Scrubbing Document Formats

gAtO tHiNk that our documents have too much information about us – it’s called metadata  and it’s embedded in the picture you just took with your iPhone/android phone. It has your geo-location and other information that you should clean up before you post it on Facebook  or Pintrest -so here a re a few tips to keep you paranoid.

Many document formats conveniently embed personally identifying attributes, and sometimes even attempt to limit redistribution. This can be problematic to whistle blowers who need to produce/deliver incriminating memos and photos to journalists, and also to academic researchers who wish to electronically publish their work anonymously.

 Microsoft Office

Microsoft Office embeds your name, machine name, initials, company name, and revision information in documents that you create.

According to Microsoft’s knowledge base article on the Metadata, the best way to remove all personal metadata from a document is to go to Tools | Options | Security Tab | “Remove personal information from this file on save”. Be warned that this does NOT remove hidden text and comment text that may have been added, but those tasks are also covered in that article.

Microsoft also provides the Remove Hidden Data Tool that apparently accomplishes those same functions but from outside of Microsoft Office.

This NSA Guide to sanitizing documents might also be of some interest, but I think the Microsoft KB articles cover the info better and in more depth.


By default, users of StarOffice/OpenOffice are not safe either. Both of these programs will save personal information in XML markup at the top of documents. It can be removed by going to File | Properties and unchecking “Apply User Data”, and also clicking on “Delete”. Unfortunately it does not remove creation and modification times. It’s not clear how to do this without editing the file raw in a plain text editor such as notepad.

 Document DRM – Digital Rights Mangement

Document DRM can come in all shapes and sizes, mostly with the intent to restrict who can view a document and how many times they can view or print it (in some cases even keeping track of everyone who has handled a document). For whistleblowers who need to circumvent DRM to distribute a document, the most universal approach is to use the “Print Screen” key to take a screenshot of your desktop with each page of the document and paste each screenshot into Windows Paint and save it. Some DRM software will attempt to prevent this behavior. This can be circumvented by installing the 30 day trial of the product VMWare Workstation and installing a copy of Windows and the DRM reader onto it. You can then happily take screenshots using VMWare’s “Capture Screen” or even the “Capture Movie” feature, and the DRM software will be none the wiser. With a little image cropping, you can produce a series of images that can be distributed or printed freely.

The VMWare approach may be problematic for DRM that relies on a TPM chip. The current versions of VMWare neither emulate nor provide pass-through access to the TPM. However, TPM-based DRM systems are still in the prototype stage, and since it is possible to emulate and virtualize a TPM, it should only be a matter of time before some form of support is available in VMWare.

Depending on the DRM software itself, cracks may also be available to make this process much more expedient. Casual searching doesn’t turn up much, most likely due the relative novelty (and public scarcity) of document-oriented DRM. Note that when doing your own google searching for this type of material, be sure to check the bottom of the page for notices of DMCA 512 takedowns censoring search results. It is usually possible to recover URLs from chillingeffects’ C&D postings. That, or use a google interface from another country such as Germany.

 Image Metadata

Metadata automatically recorded by digital cameras and photo editing utilities may also be problematic for anonymity. There are three main formats for image metadata: EXIF, IPTC, and XMP. Each format has several fields that should be removed from any image produced by a photographer or depicting a subject who requires anonymity. Fields such as camera model and serial numbers, owner names, locations, date, time and timezone information are all directly detrimental to anonymity. In fact, there is even a metadata spec for encoding GPS data in images. Camera equipped cell phones with GPS units installed for E911 purposes could conceivably add GPS tags automatically to pictures.

The WikiMedia Commons contains a page with information on programs capable of editing this data for each OS. My preferred method is to use the perl program ExifTool, which can strip all metadata from an image with a single command: exiftool -All= image.jpg. MacOS and Linux users should be able to download and run the exiftool program without any fuss(for Ubuntu install package libimage-exiftool-perl). Windows users will have to install ActivePerl and run perl exiftool -All= image.jpg instead. Running exiftool without the -All= switch will display existing metadata. The -U switch will show raw tags that the tool does not yet fully understand. As far as I can tell, the -All= switch is in fact able remove tags that the tool does not fully understand.

Another easy way to remove all metadata from an image it to open it in MS Paint, copy it, and paste it into another copy of paint. The Windows clipboard only copies the raw pixels and leaves the metadata behind. -gAtO oUt


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: