Howto: Splitting Files With Standard Python Scripts

Blog

Howto: Splitting Files With Standard Python Scripts

Stefan Seltmann

Published on

19.8.2015

8.5.2025

Updated on

8.5.2025

Data Science & AI

Howto: Splitting Files With Standard Python Scripts

Ready-Made Data Sets Which Explode the Limits

I am frequently confronted with raw data that is provided to me for analysis and which, when uncompressed, can easily encompass files of half a gigabyte or more. Starting from one gigabyte and over, the desktop-supported statistics tools slowly become strained. There are, of course, tool options for only selecting part of the columns, or only loading the first 10,000 lines, etc.

But what should you do when you only want to take a random sample from the data provided? You should never rely on the file being randomly sorted. It may already have gained systematic sequence effects due to processes in the database export. It also may be the case, that you only want to analyse a tenth of a grouping, such as the purchases made by every tenth customer. To this end, the complete file has to be read as otherwise it is impossible to ensure that all of the purchases of the filtered customers are taken into account.

Splitting Files With Python Scripts

Problems With Conventional Tools

It's a fact: the opening of these files on Windows with a range of text editors or even Excel is rarely achieved to complete success because it is initially necessary to load the complete file in the tool. The program will then eventually decline the service, truncate the file, or at the least, requires a very long time.

Naive Solution With Python Script

I recently resolved this again with a short Python script. Let us attempt to simply split the file into three large partitions.

We have to start by opening the file. This takes place via a so-called file handle with which we are able to address the file. The simple approach, which often works, is to read the contents, to separate it into lines, to split these up, and to save it again. Something like this:

fileHandle = open('datensatz.csv', 'r')
content = fileHandle.read()
content = content.split('\n')
newFile01 = []
newFile02 = []
newFile03 = []

We now have a list with all of the lines and three empty lists for the target files. I will take the easy approach and distribute the lines equally among the new files:

counter=0
for line in content:
    if counter % 3 == 0:
        newFile01.append(line)
    elif counter % 3 == 1:
        newFile02.append(line)
    elif counter % 3 == 2:
        newFile03.append(line)
    counter+=1
 file = open('newFile01.csv', 'w')
 file.write('\n'.join(newFile01))
 file.close()
 file = open('newFile02.csv', 'w')
 file.write('\n'.join(newFile02))
 file.close()
 file = open('newFile03.csv', 'w')
 file.write('\n'.join(newFile03))
 file.close()

You See - It Has Worked!

The code is extremely stringy, however. We should initially use the "With" statement, then we don't need to have to explicitly close the files any more. This literal has the effect that once the work is done, a clearing up process starts, especially if an error occurs in the With area. In the case of the file handles, the clearing up includes the closing of the files. This prevents processes from failing, but the files still being blocked because they are "in progress". What makes matters worse is that with read() we load the whole of the file to the hard drive! The test file for this script had 650MB and the hard drive was burdened with approximately 1.6 Gigabytes!! Here, it is also possible to read and process the file on a line by line basis. The next attempt:

with open('datensatz.csv', 'r') as fileHandle: 
    counter=0
    newFile01, newFile02, newFile03 = [], [], []
    for line in fileHandle:
        if counter % 3 == 0:
            newFile01.append(line)
        elif counter % 3 == 1:
            newFile02.append(line)
        elif counter % 3 == 2:
            newFile03.append(line)
        counter += 1

with open('newFile01.csv', 'w') as file01:
    file01.write('\n'.join(newFile01))
with open('newFile02.csv', 'w') as file02:
    file02.write('\n'.join(newFile02))
with open('newFile03.csv', 'w') as file03:
    file03.write('\n'.join(newFile03))

This looks better, it is basically shorter and takes up less space in the hard drive but it still requires almost one gigabyte. This isn't surprising because the whole of the file is in the newFile01, newFile02 and newFile03 lists for a certain period. Things go even better if we write directly in the files. By the way, it is a frequently overlooked feature of the print function in Python3 that you are able to select the output destination. It is always the console by default but it can also be a file.

with open('datensatz.csv', 'r') as fileHandle, open('newFile01.csv', 'w') as file01, open('newFile02.csv', 'w') as file02, open('newFile03.csv', 'w') as file03:
    counter = 0 
    for line in fileHandle: 
        if counter % 3 == 0: 
            print(line, file = file01) 
        if counter % 3 == 1: 
            print(line, file = file02) 
        if counter % 3 == 2: 
            print(line, file = file03) 
        counter+=1

This makes the whole thing even shorter and during the process, it never exceeded 16 MB in the hard disk! This means it is possible to edit files here which would otherwise go far beyond the capacity of the hard disk without hesitation! I would now be confident editing files with hundreds of gigabytes.

We have still not yet resolved the scenario of wanting to separate according to customer numbers, for example. I may turn to that in a separate post. Just as the problem with the data header now only being found in the first file.

PS: It can be even shorter but that is the end:

with open('datensatz.csv', 'r') as fileHandle, open('newFile01.csv', 'w') as file01, open('newFile02.csv', 'w') as file02, open('newFile03.csv', 'w') as file03: 
    files, counter = (file01, file02, file03), 0 
    for line in fileHandle: 
        print(line, file = files[counter % 3]) 
        counter+=1

Let’s Unlock the Full Potential of Your Data – Together!

Looking to become more data-driven, optimize processes, or leverage cutting-edge technologies? Our blog provides valuable insights – but the best way to tackle your specific challenges is through a direct conversation.

Let’s talk – our experts are just one click away!

Want To Learn More? Contact Us!

Dr. Sebastian Petry

Domain Lead Data Science & AI

Who is b.telligent?

Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.

Get to know us

The top of an office building on a bright day

All posts

No previous post

No next post

Howto: Splitting Files With Standard Python Scripts

Ready-Made Data Sets Which Explode the Limits

Table of Contents

Splitting Files With Python Scripts

Problems With Conventional Tools

Naive Solution With Python Script

You See - It Has Worked!

Let’s Unlock the Full Potential of Your Data – Together!

Want To Learn More? Contact Us!

Your contact person

Dr. Sebastian Petry

Who is b.telligent?

Munich

Basel

Berlin

Cluj

Dusseldorf

Frankfurt

Hamburg

Nuremberg

Vienna

Zurich

Cluj

Vienna – Postal address

Vienna – Visitor address

Basel

Zurich

Nürnberg

Frankfurt

Düsseldorf

Hamburg

Berlin

Munich

Howto: Splitting Files With Standard Python Scripts

Ready-Made Data Sets Which Explode the Limits

Table of Contents

Splitting Files With Python Scripts

Problems With Conventional Tools

Naive Solution With Python Script

You See - It Has Worked!

Let’s Unlock the Full Potential of Your Data – Together!

Want To Learn More? Contact Us!

Your contact person

Dr. Sebastian Petry

Who is b.telligent?

Related Posts

Snowflake Document AI – Easily Extract Data From Unstructured Documents

Neural Averaging Ensembles for Tabular Data With TensorFlow 2.0

Neural Networks for Tabular Data: Ensemble Learning Without Trees

Sizing and Scaling Azure AI Search

Munich

Basel

Berlin

Cluj

Dusseldorf

Frankfurt

Hamburg

Nuremberg

Vienna

Zurich

Cluj

Vienna – Postal address

Vienna – Visitor address

Basel

Zurich

Nürnberg

Frankfurt

Düsseldorf

Hamburg

Berlin

Munich