Convert PDF documents to SVG for Inkscape (pdf2svg)

New – We have an Online OCR document processing which converts PDF documents and images into editable documents. Try it now

 

There is now a project that may work for you. It was made with Inkscape in mind. You can try it from http://www.cityinthesky.co.uk/opensource/pdf2svg/. If that doesn’t work, try this awful method below.

I searched and searched the web for a project that could convert PDF documents into an editable format for some free software on Linux. I use Ubuntu daily at my workplace and only use a single Windows application for accounting software via RDP / VMPlayer. I needed a way to be able to modify PDF documents and I have Inkscape with all the features I need, except a functioning PDF import filter. Inkscape uses pstoedit which doesn’t extract embedded raster images and convert them for the SVG format. There is a plugin for pstoedit to convert to SVG for $50, but it caused a segfault on my machine. So no joy with a single tool to automagically convert my documents.

I spent a loooong time trying different formats from pstoedit and converting through various other software to end up with an SVG file that I can use in Inkscape with included raster images. The following is the process I used to create the best results. The steps I have used are for creating one page at a time. Some more automation could be added to make most of this work without all these steps, as well as to do each page in the PDF document automatically.

Step 1
Create a working directory for this project. You will likely be creating hundreds of files during this process and it can get messy if it’s mixed in with other documents.

Step 2
Make a Level 1 Postscript file from a page in the PDF document. When extracting pages from a PDF document, pstoedit’s intermediate file formats have no support for Level 2 Postscript raster images. We’ll create a Level 1 Postscript file from a PDF page that pstoedit will handle correctly.

pdf2ps -f pagenum -l pagenum -dLanguageLevel=1 document.pdf page.ps

 

Step 3
Convert the page to the fig format. The pstoedit tool does this job decently, and creates tons of files in the process. This process creates a .fig document with all of the images in separate EPS files. The .fig document simply references the image files to be included. At this point you could use xfig to make modifications, but it would be horribly slow and difficult to work with. This process will also rasterize all text, but I’ll show in a later step how to get vectors back in your document.

pstoedit -f fig page.ps page.fig

 

Step 4
Convert from the fig format to the SVG format. When this is done, the SVG document contains only some formatting and placement information, while referencing all of the external EPS files.

fig2dev -L svg page.fig page.svg

 

Step 5
Convert EPS images to PNG images for use with Inkscape. Inkscape can’t import those raster EPS files, only vector EPS files (it uses pstoedit to do the conversions, so it is limited by that tool). The EPS images must be converted to an image format that Inkscape can use. I chose PNG because the format is free, standardized, and lossless. Unfortunately, there is a problem that prevents a direct conversion. When pstoedit created the EPS files with embedded raster images, the EPS file may specify an incorrect image size/formatting. This shows up as white lines that surround the raster images, and the white lines are inconsistent between the images. What must be done is to extract the raster image data from the EPS file, not using the EPS specified sizing and formatting. The way to do this is rather ugly, but it works. First, the EPS files will be converted to PDF files. Second, the tool pdfimages will extract the raster images from the PDF files. Third, the Imagemagick tool convert will conver the images to PNG files.

#!/bin/sh
mkdir tmpimages
for epsfile in *.eps
do echo "${epsfile}"
convert "${epsfile}" "${epsfile}.pdf"
pdfimages "${epsfile}.pdf" tmpimages/
convert tmpimages/* "${epsfile}.png"
rm -f "${epsfile}.pdf"
rm -f tmpimages/*
done
rm -rf tmpimages/

 

Step 6
Change all the references in the SVG file from .eps to .eps.png. Open your favorite text editor (or get creative with sed) and change all .eps to .eps.png.

Step 7
Add the proper attributes in the <svg> tag to the SVG document to allow Inkscape to open it. For whatever reason, Inkscape won’t recognize the images in the SVG document unless the following attributes are added to the <svg> tag:

xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://web.resource.org/cc/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"

 

Step 8
Extract vector information from the PDF document. If the PDF document contains text or vector information, it will have been converted to raster images as described earlier. The pstoedit tool can do this job as well.

pstoedit -f plot-svg -page pagenum -ssp -mergetext document.pdf page_text.svg

 

Step 9
Place the SVG text in the SVG document with the images.

a) Open both SVG documents with Inkscape.
b) Switch to the document that has the text only.
c) Select all the items then group them together as a single object.
d) Copy that object and paste it into the SVG document with the graphics.
e) Send the SVG text below the rasterized text in the document.
f) Delete all of the rasterized text to show only the SVG text below it.

 

Notes
It is my understanding that converting to fig documents may screw with your units due to a not high enough resolution. I have not confirmed this. You may want to scale the document up by a large factor (10x), do the conversion to the xfig format, then use xfig to scale back down before exporting.

Known Problems
Some raster images may not be clipped properly. This must be fixed manually. This is only a formatting issue and does not affect the quality of the images. Text may be converted to paths and may not be editable.

Warning
This could possibly take a LOT of disk space. Make sure to have at least several hundred megabytes free. It could also take a lot of memory and time for Inkscape to open the document for the first time until all the rasterized text is removed. Please be patient.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *