For my work on Kemtro, I recently explored how I could extract timetable information available in PDF files. There are still many transport agencies that do not want (or can’t for various reasons) make their data available in a friendly format (GTFS).
I’m not aware of any easy and centralized way of doing that, so I did what I typically do when it comes to ‘liberating’ data: I first scraped the PDF files.
But then it appears that extracting the spreadsheet content from a PDF is not that easy. Here are some of the things I looked at (oh and I only have Macbooks, so some of the options are limited)
- Use a PDF library (pyPdf, pdfminer) and do it myself in Python: :( bleh, PDF is a layout format, and after looking a little at the actual PDF markers in the various PDF files I had, it looked like a very bad idea to just even start.
Use PDF converters applications to convert to Excel format: the only one that kinda worked was PDF to Spreadsheet Pro. You open the PDF file, and you see the content page by page. You can realign red line markers for columns to tell the application how to segment the table. I found that it almost worked but not well enough. Sometimes columns would not be extracted according to the red markers I had set. Plus, it’s impossible to include that into a more automated process.
I did not try Acrobat Pro because it’s expensive, and there is no Mac trial application available for download.
- I first tried the Adobe ExportPDF web app that lets you export PDF files to either Office Work or Excel. I even signed up for it ($20/year). It works decently for single page PDFs, but it often completely fails for multiple pages PDFs – the first page would be decent, but then all the numbers would be included in a single column thus without any formatting at all. For some table format, it does not extract the data properly (one cell will contain the whole line of times, instead of having one cell for a time). Also, the interface is tedious (you have no choice but to upload the file by hand, and the whole conversion takes a long time.
Zamzar: it accepts a whole bunch of file formats, and convert them to different ones (not all conversions are possible obviously).
It works great!
I’d have a few suggestions to make it easier to integrate it in an automated processing flow, but it’s fast and most of all it has converted almost all the timetables I’ve sent. It also creates one Sheet per page in the resulting Excel document. Almost like all converters I tried, it converts UTF-8 characters to Ascii most of the time (you lose accented characters in the conversion). I signed up for a paid account, and you can send an email with the attached PDF files to a zamzar email that will convert it to the format you want. The only annoying thing is that you have no easy way to download the result files. The web interface is not super awesome, and you have to click on every file to get them back. The only way to automate the process would be to read the emails Zamzar sends you after a conversion finishes as there is a direct download link in the email. If they integrated with dropbox it’d be awesome.
After sending a hundred+ files, the conversions started to all fail. I’m hoping I was not the one to take their pdf-xls system down, and it took 12 hours for it to be back. Some of the output files could not be open, even with the Excel ‘Open and Repair’.
- Mechanical Turk: That’s actually what I tried first before the online converters. I had to convert each PDF page of the PDF file to an image first because you can’t upload a PDF file when creating a HIT. The HIT description was: “The image contains timetable information that need to be converted into a spreadsheet (google doc spreadsheet, Excel sheet, or CSV file)”, and the reward was $0.30. I created about 20 HITs, and I’ve had mixed results. Most of the results were ok. I suspect most conversions were made using a software/OCR but I’m not sure. Another disadvantage is that the resulting spreadsheet content will probably be formatted differently in the different files. Also, just making the HIT tasks took time and effort. I first had to find a way to create images from PDF files, which I did using the MAC Automator. I only used the MT web interface which limits a lot what you can do and how fast. I think overall I would maybe consider using MT for files that completely fails to be converted by other means, but it did not seem very practical.
Even after getting a whole bunch of .xls files, the journey is still not over. First of, you have to install Office Excel on your MAC if you want to open most of those files. Numbers won’t work. Also, as I wanted to convert the Excel files to CSV, I could automate the process using Excel VBA. I tried Excel to CSV Converter but it does require Excel anyway (they probably run a VBA macro to convert to CSV) and they do not do multi-sheets .xls files.
I assembled an Excel VBA module that would run through every Excel file in a directory, open and save each sheet as a CSV file. I found most of the code in different places, but all put together it’s there below.
You need to 1- Create an Excel file, 2- Open Menu Tools->Macro->VB Editor, 3- Do Menu Insert -> Module, 4- Paste the code, 5- Change the directory for the input, 6- Run. Some of the VBA functions do not work on a MAC, so it’s limited, but it mostly worked except for corrupted files. It’s also not fully automated (Excel might prompt you with notifications). You should save the file as .xlsm (macro enabled).
Now obviously, I still have to make the scripts to read the CSV files, parse the data correctly to create the actual route and timetable data…