Python Pointer: Find Files with os.walk()

Have a mess of files to read into Python? Maybe you downloaded Kaiko trade data, with unpredictable sub-directories and file names, from Penn+Box. Or maybe you’ve dropped TXT, PDF, and PY files into a single working directory that you’d rather not reorganize. A simple script will find the files you need, listing their names and paths for easy processing.

Because this process involves exploring our operating system’s file structure, we begin by importing the os module into our Python environment:

import os

This module, on top of a standard Python installation, should address any dependencies in our upcoming file-listing code.

Let’s define our file-listing function. We can name it, unimaginatively, list_files and give it two arguments, filepath and filetype:

def list_files(filepath, filetype):

filepath will tell the function where to start looking for files. This argument will take a file path string in your operating system’s format. (Be sure to encode or escape characters as appropriate.) When the function runs, it will assume this base directory contains all of the files and/or subfolders we need it to check.

filetype will tell the function what kind of file to find. This argument will take a file extension in string format (e.g.: '.csv' or '.TXT').

Within our function, we’ll need to store any relevant file paths our script finds. Let’s create an empty list for this purpose:

paths = []

Practically speaking, our function will find each file within filepath, check whether its file extension matches a given filetype, and add relevant results to paths. We begin this iterative process with a for loop to find and examine each file:

for root, dirs, files in os.walk(filepath):

In this configuration, os.walk() finds each file and path in filepath and generates a 3-tuple (a type of 3-item list) with components we will refer to as root, dirs, and files.

Because files lists all file names within a path, our function will iterate through each individual file name. Iterating again involves another for loop:

for file in files:

Within the file-level loop, our function can examine various aspects of each file. You may want to customize this section if your application has other requirements. For now, we’ll focus on checking files for a matching file extension.

Because comparing strings is case-sensitive while file extensions are not, we use the lower() method to convert both file and filetype to lower-case strings (file.lower() and filetype.lower(), respectively). This avoids confusion due to mismatched capitalization.

In turn, the endswith() method will compare the end of our lower case file (where the file extension lives) to the lower case filetype, returning True for a match or False otherwise.

We include our Boolean (True/False) result in an if statement so that only a matching file type (True outcome) triggers the next stage of our function.

if file.lower().endswith(filetype.lower()):

If the file’s extension matches, we want to add file and its location to paths, our list of relevant file paths. os.path.join() will combine the root file path and file name to construct a complete address our operating system can reference. The append() method will add this complete file address to our list of paths:

paths.append(os.path.join(root, file))

Our sets of loops will iterate through our folders and files, dutifully developing our paths list. In order to make this list available outside of our function, we need one final line:

return(paths)

Altogether, our code should read as follows:

import os
def list_files(filepath, filetype):
   paths = []
   for root, dirs, files in os.walk(filepath):
      for file in files:
         if file.lower().endswith(filetype.lower()):
            paths.append(os.path.join(root, file))
   return(paths)

Calling the list_files function—after you’ve run the above—and saving the resulting file locations list as an object might look something like this:

my_files_list = list_files(' C:\\Users\\Public\\Downloads', '.csv')

Now that your code can find files it needs, you can focus on merging data, analyzing text, or conducting whatever research you imagine.

7 thoughts on “Python Pointer: Find Files with os.walk()”

Gregory Tomy on March 14, 2021 at 12:34 pm said:

Thank you. This helped!

Reply ↓
Noor on May 8, 2021 at 5:01 am said:

Thank you so much Kevin!

Reply ↓
ken on September 11, 2021 at 8:20 pm said:

So , walk() generates a tuple with 3 items(root,dir,file). I don’t understand why we are not looping through the tuple first to get the 3 items then proceed to loop through each item to get what data is available !

Reply ↓
- Kevin A Thomas on September 12, 2021 at 10:33 am said:
  
  If you want to extract and use or append the data from each file, go for it! In my original use case, I needed a list of the files to then feed, as a batch, into another program.
  
  Reply ↓
Kumar on June 20, 2022 at 1:27 am said:

When i use “for root, dirs, files in os.walk(filepath):” i have observed that files which are below sub directories are not processed. How to overcome this? Here is my example folder structure.
Root_Folder
|_ > Sub_Folder1
|_> File_1
|_ > Sub_Folder2
|_ > Sub_Folder3
|_ > File_4
|_ > File_5
|_ > File_2
|_ > File_3

In above structure, File_4 and File_5 are not accessible by for root, dirs, files in os.walk(filepath):

Reply ↓
- Kumar on June 20, 2022 at 1:29 am said:
  
  Sorry spaces are not considered. File_4 & _5 path is Root_Folder/SubFolder2/SubFolder3/File_4 & 5
  
  Reply ↓
- Kevin A Thomas on June 21, 2022 at 10:33 am said:
  
  Strange, that behavior sounds different from my experience and the method documentation. If you currently are a student, staff, or faculty at the University of Pennsylvania; you’re welcome to schedule an appointment for help at https://libcal.library.upenn.edu/appointments/techconsults. If not, you might find an answer by searching and posting on a support message board such as https://stackoverflow.com/.
  
  Reply ↓

Datapoints: A blog from the Lippincott Library of the Wharton School of Business

Python Pointer: Find Files with os.walk()

7 thoughts on “Python Pointer: Find Files with os.walk()”

Leave a comment Cancel reply

Share this:

Related

7 thoughts on “Python Pointer: Find Files with os.walk()”

Leave a comment Cancel reply