Skip to main content

VERY Best Practice: Working with Paths in Python - Part 2

The same problem: listing folders and drives

In the last blog, we used a recursive function for a solution with less than 10 lines to scan folders and allow file evaluation by modification date and size.

Now I’m going to raise the bar somewhat for this example by showing even better alternatives.

Catenate the path with Pathlib

Old wine in new bottles?

The solution to the earlier example by catenating the paths was:

path_file = os.sep.join([path_dir, filename])

The advantage of this is that the solution is independent of the operating system, and one does not have to combine strings with a “+” sign or string formatting.

Yet, this is error prone in that one could inadvertently or mistakenly define the directory path with a closing path separator.

path_dir: str = r"C:/Users/sselt/Documents/blog_demo/"  # abschließender Trenner

filename: str = "some_file"

path_file = os.sep.join([path_dir, filename])

# C:/Users/sselt/Documents/blog_demo/\some_file

Although this example shows a functioning code, the wrong separator leads to an error when calling up the path. Such errors can occur whenever users manage the path in config files, far from the code, without paying attention to the convention.

A better solution has emerged since Python 3.4, as a pathlib module. This handles file and folder functions of Python’s os module with an object-oriented approach.

To repeat, here’s the old variant:

import os

path = "C:/Users/sselt/Documents/blog_demo/"

os.path.isdir(path)

os.path.isfile(path)

os.path.getsize(path)

 

And here is the new alternative:

from pathlib import Path

path: Path = Path("C:/Users/sselt/Documents/blog_demo/")

path.is_dir()

path.is_file()

path.stat().st_size

Both deliver exactly the same result. So, why is the second one much better?

Object oriented and more error tolerant

The call-ups are basically object oriented, and it may or may not be your preference – but I like this a lot more. We have an object here, like the path definition, which has attributes and methods.

However, the example applied here to overload operators is more exciting:

filename: Path = Path("some_file.txt")

path: Path = Path("C:/Users/sselt/Documents/blog_demo")

print( path / filename )

# C:\Users\sselt\Documents\blog_demo\some_file.txt

At first, the two-path division appears to be an invalid code. However, the path object was simply overloaded in such a manner that it functions like a catenated path.

In addition to this syntactic sugar, the path objects will intercept other typical errors:

filename: Path = Path("some_file.txt")

# hier path mit überflüssigem Trenner am Schluss

path: Path = Path("C:/Users/sselt/Documents/blog_demo/")

# hier path mit doppeltem Trenner

path: Path = Path("C:/Users/sselt/Documents/blog_demo//")

# hier path völlig durcheinander

path: Path = Path("C:\\Users/sselt\\Documents/blog_demo")  # hier ein wilder Mix

# alle Varianten führen zum selben Ergebnis

print(path/filename)

# C:\Users\sselt\Documents\blog_demo\some_file.txt

This variant is not only nicer, but also more robust against false inputs. In addition to other advantages, the code is also independent of the operating system. One defines only a generic path object, which manifests itself in a Windows system as a WindowsPath and in a Linux system as a PosixPath

Most functions that typically expect a string as a path can work directly with a path. Rarely, you may need to resolve the object simply with str(Path).

Processing the path with os.walk

In my last blog’s solution, I used os.listdir, os.path.isdir and a recursive function to iterate through the path tree and differentiate between folders and files.

But os.walk offers a better solution. This method does not create a list, but instead an iterator that you can call up line by line. The results contain the respective folder path and a list of all data files within the path. All this occurs by itself recursively, such that you get all the files with one call up.

The better solution with os.walk and Pathlib

If you combine the two aforementioned techniques, you get a solution that is simpler, fully independent of the OS, more robust against inconsequent path formats, and free of explicit recursions:

filesurvey = []

for row in os.walk(path):   # row beinhaltet jeweils einen Ordnerinhalt

    for filename in row[2]:  # row[2] ist ein tupel aus Dateinamen

        full_path: Path = Path(row[0]) / Path(filename)   # row[0] ist der Ordnerpfad

        filesurvey.append([path, filename, full_path.stat().st_mtime, full_path.stat().st_size])

   

If you can top this with a best practice, don’t hesitate to get in touch. I’d love your feedback!

Read here the first part of the blog post.

Stefan Seltmann
Your Contact
Stefan Seltmann
Lead Expert
Stefan loves programming, particularly when data engineering and data science are involved. He's turned his hobby into a career and has earned a reputation as a "phone a friend" whenever there's a tricky Python or Spark software development problem.
#CodeFirst, #TestMore, #CodeDoctor