Joining Multiple Excel Files with Python

“Programming is not about how to code, but learning how to transform real world problems into problems that computer can break and understand. It teaches you to change your mindset of how to approach problems, turn them into algorithms and then write a code so that computer can solve it. Python is one of the tools that we use,” says Marco Tulio Pires, director of School of Data, at International Journalism Festival 2016.

Today journalists have to deal with more than reporting and writing. Huge amounts of data is part of daily work. Data journalism is trending worldwide, teaching professionals how to deal with documents from government, various financial reports from companies and others. Basically any situation where there are a number of files that preserve the same structure between themselves, but are needed to consolidate together in a single CSV file for further analysis.

Pires administered workshop on how to deal with problem described above. He used programming tool Python 2.7 and worked with CSV and XLS file formats. An example was based on Medicaid Managed Care Enrollment Reports (New York State, USA). It contains multiple documents with various data about different companies. All strategies used in this workshop were based on The New York Times journalist Sarah Cohen‘s Advanced Data Cleaning strategy with OpenRefine (NICAR 2016, Denver, USA) and on a script that Pires made for J++ São Paulo.

“First you need an image of success in head. How these file would look like? Think about the columns, that you want to have,” he says. After deciding what is needed in the end it’s time to start designing patterns and look for repetitions in all documents. „We need a cleaning strategy. It’s more important not to understand the Python part, but the strategies you can use for cleaning data,” he adds.

Workshop went step by step with journalists working with their computers and following Pires’ lead. After finding common patterns and repetitions in documents he moved on to using Python. “First you need to extract relevant information from cells, columns and rows. After getting values you need you move on to storing them,” Pires explains. He showed how to assign column values with necessary row values, how to save file to CVS format and so on. In the end he made a script that cleaned all files. Guidelines for doing that can be found here: https://github.com/mtrpires/pyPerugia16

Pires ends his workshops stressing necessity of understanding the way computers work: “You will learn Python along the way, but actually what you’re learning is how to approach problems like the one we had for an example today. People shouldn’t aim for learning programming, but to learn how to approach problems in an algorithm way.”

Anna Udre