#python

Hacker Noon CC BY-SA 26/11/2018

An open-(source, science) tool to extract tables from PDFs into Excels
▻https://hackernoon.com/an-open-source-science-tool-to-extract-tables-from-pdfs-into-excels-3ed3

https://cdn-images-1.medium.com/max/1024/0*8YsOjqB-FQPkCAlY.png

I originally wrote this post for my website.Photo by Patrick Tomasso on UnsplashBorrowing the first three paragraphs from my previous blog post since they perfectly explain why extracting tables from PDFs is hard.The PDF (Portable Document Format) was born out of The Camelot Project to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of PostScript (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and (...)

Hacker Noon CC BY-SA

#python_3