Using OpenOffice to batch convert file formats on OS X
In my increasingly more and more frustrating attempt at finding some use for OpenOffice, I tackled file formats, including batch file conversion. As you can expect for an application written in Java, it wasn’t easy.
(thanks to Dominika Stempniewicz for pointing out the path to OpenOffice has changed)
OpenDocument file format: what is it good for?
I started using OpenOffice mainly because I needed to convert Office Open XML files for my old version of Office. Not knowing much about the matter, I assumed Microsoft must have embraced a public XML standard for documents to help in its various anti-trust court cases, and that public standard was based on / compatible with OpenOffice (well, they both have the word ‘Open’ it it, right?)
Not so.
Turns out that while the OpenOffice can read Office Open XML files, it can only save them in older Office formats such as xls and csv, plus its own OpenDocument , which is similar but not the same. Both are binary formats, which cannot be manipulated directly as text. Weren’t they meant to be XML formats?, I asked myself.
Getting the XML out of OpenDocument and Office Open XML files
Turns out that both formats use Microsoft’s Open Packaging Convention to store content, formatting data and assets in one single file. What that means in practice, is that both OpenDocument (.odt, .ods, etc) and Office Open XML (.docx, .xlsx etc) are basically zip archives. That’s right – just unzip them, and they turn out to be directories. The text content can easily be obtained from the content.xml file, at the root of the folder. Which means, finally there’s an easy way to handle Office Document in languages like PHP.
Batch converting OpenDocument and Office Open XML files with JODConverter
Well, I said an ‘easy way’, but it is still quite awkward to have to unzip a file, and then navigate through the XML nodes. Not to mention that converting back is not a simple matter of re-zipping the package and then changing the file suffix. It is still useful to find ways to batch convert files into easier to handle formats, such as csv or txt.
Art of Solving provides JODConverter, a Java Library that can be run as a web service or from the command line. As always with Java libraries, it is a little bit awkward.
Installing JODConverter on a OS X
Macs come all set up to run Java, so all you need to do is to unzip the JODConverter package, and put it somewhere out of the way (I put all my Java Libraries into /Applications/Library/Java/Extensions/ ). JODConverter itself is ready to run, but we are not ready to start converting documents yet.
JODConverter does not do the conversion itself, rather it asks OpenOffice to do it. This will only work if OpenOffice is set up to run as a service, i.e., the same as if it was a web server or a database running on your machine. To do that on OS X, type the following on a terminal window (Applications / Utilities / Terminal.app ):
/Applications/OpenOffice.org.app/Contents/MacOS/soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizardThat will start OpenOffice as a service (you can still run OpenOffice normally as an application at the same time). Now you can start converting documents from the command line, for example:
cd /Users/gotofritz/Documentscd …. (you should type your username rather than gotofritz) puts you inside your Documents folder. Another way to navigate to a folder is to type “cd ” with the space, drag the folder you want to go to on Terminal, and type return.
java -jar /Library/Java/Extensions/jodconverter-2.2.1/lib/jodconverter-cli-2.2.1.jar test.ods test.csv
The next line is where the conversion takes place – in this case from an OpenOffice spreadsheet called ‘test.ods’ to a csv file.
java -jar /Library/Java/Extensions/jodconverter-2.2.1/lib/jodconverter-cli-2.2.1.jar -f swf *.pptThis command will batch process all powerpoint files (*.ppt) in the ‘current’ directory into swf – blimey! Now, that could be useful.
For reference, here’s the Art Of Solving list of supported formats.