• We had to restore from a backup today after a failed software update. Backup was from 0000 EDT and restored it at 0800 EDT so we lost about 8hrs. Today is 07/20/2024. More info here.

Trying to multi-thread SHA checksum verification on a directory.

Joined
Mar 18, 2024
Messages
81
Reaction score
13
Credits
842
Based on this blog, this is the code which uses SHA 256 to get the hash of an entire directory -

Bash:
dir=<mydir>; find "$dir" -type f -exec sha256sum {} \; | sed "s~$dir~~g" | LC_ALL=C sort -d | sha256sum

I want to multi-thread this. AI is too dumb, so don't even try. Actually, ChatGPT did give me this command which does quite better than single threaded -
Bash:
dir=<mydir>
num_threads=4  # Specify the number of threads you want to use
time find "$dir" -type f | xargs -P$num_threads -n 1 sha256sum | sed "s~$dir~~g" | LC_ALL=C sort -d | sha256sum
But I have minimum knowledge in bash scripting, and so I can't verify if this code is the most efficient. But bench marking it shows that single threaded took 115 seconds while this one took 47 seconds. Still not like at least 3 or 4 times the improvement that I was expecting if I am was using 8 threads. And also it doesn't really max out all my CPU threads (all of them were hovering around like 50-80%), but replacing SHA256sum with SHA3sum -a 512 maxes it out, but of course that is a much advanced hashing algorithm.

I want to efficiently multi-thread this. My idea is that first we find and make a list all the objects we need to hash on a single thread, then we bring the threads and apply a grouping system where each thread gets a group of the objects to process on so that we avoid multiple threads working on the same object. After that we collect all the hashes from all the threads, free the threads, and then process the master checksum command to hash all these hashes on a single thread and get the resulting hash.

I just explained how to structure it. I have no idea how to code. I am still just learning.
 


If you want fast hash computation over large amount of files but you don't require specific algorithm then I suggest you benchmark which algorithm will be the fastest.

For instance, to compare speed of md5 vs sha1 vs sha256
Bash:
# Check which algorithm is faster
openssl speed md5 sha1 sha256

Will start the benchmark.

On my system sha1 is the fastest so I have used the following command to compute hashes of my game library:
Bash:
# Calculate hashes
find ./games -type f -exec sha1sum {} \; > sha1sum.txt

It took an hour to complete.

If you're working with *.exe files then it's better to verify certificates instead, not only it's faster but you can be sure file is original, example command:
Bash:
sudo update-ca-certificates
# Check certificates
find ./games -type f -name "*.exe" -exec bash -c 'echo "{}"; osslsigncode verify "{}" | grep "^Number of verified signatures:"' \; > signature.txt
 
It already is multithreaded.
What? How?

Oh so you mean the SHA command itself is multi-threaded? Yeah that is why I can see even when running sequentially, all of my CPU threads are doing some activity, but less than 20% and it is actually mostly just thread switching rather than staying on all the threads.

But as I mentioned, the multi-threading through xargs that ChatGPT gave does improve performance by a lot (and yes, the resulting hash is correct), it's just that I doubt that it is running at peak efficiency.

By multi-threading, I just meant that instead of one single software thread (not one CPU thread) calculating the hash, we use multiple threads that calculate the hash.

Please examine the code ChatGPT gave. I just have suspicion that multiple threads might be working on the same object, but it is unlikely of that because of the improvement I saw, but that improvement is still not enough consider 1 vs 8 threads.
 
Oh so you mean the SHA command itself is multi-threaded? Yeah that is why I can see even when running sequentially, all of my CPU threads are doing some activity, but less than 20% and it is actually mostly just thread switching rather than staying on all the threads.
The program won't use 100% of all your CPU cores.

But as I mentioned, the multi-threading through xargs that ChatGPT gave does improve performance by a lot (and yes, the resulting hash is correct), it's just that I doubt that it is running at peak efficiency.

By multi-threading, I just meant that instead of one single software thread (not one CPU thread) calculating the hash, we use multiple threads that calculate the hash.
Yeah, the xarg can be used to start multiple processes, but this is not multi-threading, it's multi-process.

Please examine the code ChatGPT gave. I just have suspicion that multiple threads might be working on the same object, but it is unlikely of that because of the improvement I saw, but that improvement is still not enough consider 1 vs 8 threads.
I suggest you to start multiple processes manually but paying attention so that each process deals with it's own set of objects, so that multiple processes don't handle same file.

How many files do you have and how are they organized, are they all in same directory or sub directories?
 
I suggest you to start multiple processes manually but paying attention so that each process deals with it's own set of objects, so that multiple processes don't handle same file.
I don't know how to code, that is why I am seeking help -_-.

How many files do you have and how are they organized, are they all in same directory or sub directories?
There are multiple files and folders inside its root folder. All this doesn't even matter.

If you are good in bash scripting (you can use other programming languages as well), then please help me multi-thread or multi-process this.
 
Here is bash code for your case, I have tested it on my system and it works much faster than single run:

Bash:
for dir in */ ; do
    [ -L "${dir%/}" ] && continue
    find "${dir}" -type f ! -name "sha1sum.txt" -exec sha1sum {} > "${dir}/sha1sum.txt" \; &
done
find . -maxdepth 1 -type f ! -name "sha1sum.txt" -exec sha1sum {} > "sha1sum.txt" \;
wait

You first need to CD into root directory where your data is and then run this code.

What it does it creates multiple sha1sum instances for each sub-directory in parallel.

Next it creates one sha1sum.txt file for each sub-directory.
This is needed so that each instance writes to it's own file to prevent data race.

You will thus end up with as many SHA files as there are child sub directories (depth of 1) plus top level directory separately.

And here is how you check all the SHA files afterwards:

Bash:
for dir in */ ; do
    find "${dir}" -type f -name "sha1sum.txt" -exec sha1sum -c -w {} > "${dir}/sha1sum-status.txt" \; &
done
find . -maxdepth 1 -type f -name "sha1sum.txt" -exec sha1sum -c -w {} > "sha1sum-status.txt" \;
wait

Which creates the "status" files saying if hashes match.
 
Last edited:
@CaffeineAddict

Actually, before I even try your script. I first tried to clearly verify if the original command works or not. It does give me the same hash if I run it twice, so there is no problem, but there is a problem with I think git or something.

So the directory I want to hash is a git cloned repo. I did a test where I cloned the repo, ran the hash command, and deleted the cloned repo. I did this 3 times. All the 3 times, I got 3 different hashes. But, if I run the hash command again without deleting the repo, I get the same hash. So every time I clone that same repo again, something is different and I get a different hash? If so, then the hash command is literally useless.

Do you have any idea what is going on?
 


Top