Alright, let’s dive into my wozniacki experiment. It’s a bit of a weird name, I know, but I needed something to call this little project where I was trying to automate a really tedious data entry task.

It all started when I was stuck manually copying data from these clunky old PDFs into a spreadsheet. Seriously, hundreds of pages of tables. My eyes were going crossed! I thought, “There HAS to be a better way.” So, I decided to give Python a shot. I mean, I’d dabbled before, but never really committed.
First thing I did was install the basics: Python itself, of course, and then pip, the package installer. I remember googling “how to install pip on windows” like a total noob. After that, I started looking for libraries that could handle PDFs. I stumbled upon PyPDF2
. It seemed straightforward enough, so I installed it with a simple pip install PyPDF2
.
Then came the fun part: trying to actually extract the text. I wrote a quick script to open the PDF, loop through the pages, and print out the text. It was a mess! The formatting was all over the place, line breaks in the middle of words, you name it. PyPDF2
wasn’t really cutting it for these complex tables.
Back to Google I went. This time, I found tabula-py
, which is a wrapper around Tabula, a tool specifically designed to extract tables from PDFs. Sounded perfect! I installed it (I think it needed Java too, which was another hurdle). This time, the results were much better. The tables were actually being recognized as tables!
But there were still issues. Some tables were split across pages, some had weird headers, and the data types were all wrong. So, I started writing code to clean up the data. I used regular expressions to get rid of unwanted characters, split strings, and convert numbers to the right format. It was a lot of trial and error.
The most annoying part was handling the inconsistencies. Some PDFs had one format, others had another. I ended up writing a bunch of conditional statements to handle each specific case. It wasn’t pretty, but it worked. I felt like a detective, trying to decipher these ancient documents.
Finally, after a few days of coding and debugging, I had a script that could reliably extract the data and put it into a CSV file. It wasn’t perfect, but it saved me HOURS of manual work. I even added a simple progress bar so I could see how far along it was. It felt SO good to watch it chugging away, automatically filling my spreadsheet.
Lessons learned? Python is awesome, but you need the right libraries. And data cleaning is always more time-consuming than you think. But hey, I automated a super boring task and learned a ton in the process. Definitely worth it!
