Python: moving files and automating tasks.

3 min readJun 5, 2024

I did a lot of traveling in the past, and with that I have a lot of duplicate images from my hard drive. This issue was due to being lazy. I would transfer images to my external hard drive, but not delete some of the photos on my actual device or memory card.

To solve this, I’ve written a python script to automate the process, here’s how the file structure looks like:

project/
├── photos/
│   └── delete/
├── main.py

We’ll have a project folder which contains the script main.py script and a photo folder which contains duplicate images, when the script is ran, it should transfer all duplicate folder to a folder called delete. from there you can audit the duplicate folders.

first off lets take a look at the main.py file we’ll import the following packages:

os: Provides functions for interacting with the operating system, such as file and directory manipulation.
hashlib: Offers algorithms for secure hash functions, useful for data integrity checks by generating hashes.
shutil: Contains functions for high-level file operations like copying, moving, and deleting files and directories.
Pillow: A library for opening, manipulating, and saving many different image file formats in Python.

We can install Pillow using the following pip command:

pip install pillow

and import the necessary libraries (hashlib, shutil, and os are all part of the python language so we don’t need to install them using pip):

import os
import hashlib
import shutil
from PIL import Image

Next, we’ll write a method to detect duplicate images using the hashlib library:

def calculate_hash(image_path):
    """Calculate MD5 hash for an image file."""
    with Image.open(image_path) as img:
        hash_object = hashlib.md5(img.tobytes())
    return hash_object.hexdigest()

The calculate_hash function generates an MD5 hash for an image file to uniquely identify it. It opens the image using the Pillow library, converts the image into a byte array, and computes the MD5 hash of these bytes using the hashlib library. The function then returns the hash in its hexadecimal format. This hash can be used for tasks such as detecting duplicate images.

After we completed our method to determine duplicates, lets write a method to move these duplicates to a folder that we’ll call delete

def find_and_move_duplicates(folder_path, delete_folder):
    """Find duplicate image files in the specified folder and move them to the delete folder."""
    hashes = {}
    if not os.path.exists(delete_folder):
        os.makedirs(delete_folder)

    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith(('jpg', 'jpeg', 'png', 'gif', 'bmp')):
                file_path = os.path.join(root, file)
                try:
                    file_hash = calculate_hash(file_path)
                    if file_hash in hashes:
                        duplicate_path = os.path.join(delete_folder, file)
                        # Ensure the target path is unique in case of naming conflicts
                        base, extension = os.path.splitext(duplicate_path)
                        counter = 1
                        while os.path.exists(duplicate_path):
                            duplicate_path = f"{base}_{counter}{extension}"
                            counter += 1
                        shutil.move(file_path, duplicate_path)
                        print(f"Moved {file_path} to {duplicate_path}")
                    else:
                        hashes[file_hash] = file_path
                except Exception as e:
                    print(f"Error processing {file_path}: {e}")

The find_and_move_duplicates function identifies and moves duplicate image files from a specified folder to a designated delete folder. It first creates the delete folder if it doesn't exist. Then, it traverses the given folder and its subdirectories to find image files. For each image, it computes an MD5 hash using the calculate_hash function. If the hash matches an existing one, indicating a duplicate, it moves the duplicate file to the delete folder, ensuring unique filenames to avoid conflicts. Non-duplicate files are tracked using their hashes. Errors during processing are caught and reported.

Finally, we can test out this script by creating our folder paths and passing them through the function:

# Define the paths
folder_path = 'photos'  # Path to the folder containing the images
delete_folder = os.path.join(folder_path, 'delete')  # Path to the delete folder

# Find and move duplicate files to the delete folder
find_and_move_duplicates(folder_path, delete_folder)

we’ve specefied a folder_path to pull the existing duplicate folders from, and a delete_folder path to transfer the duplicate images. From there we’ll call the find_and_move_duplicates method passing in the folder_path and delete_folder as parameters.

I’m hoping this script comes in helpful someday if you also need to sort through bulk batches of duplicate images. To find the full source code you can visit the repository here:

GitHub - adamtlee/duplix

Contribute to adamtlee/duplix development by creating an account on GitHub.

github.com

Python: moving files and automating tasks.

GitHub - adamtlee/duplix

Contribute to adamtlee/duplix development by creating an account on GitHub.

Written by adam lee

Responses (1)