verbose with sort merge and does parallel help any?

None-yet

Member
Credits
898
Hey Guys and Gals. When using
Code:
sort *.txt -m -u --parallel=4> 12.3-addon.txt
(1)- is there any way to know where in the merge process it is? Tried using -v and --verbose but neither works. And, (2)- has anyone every checked to see if the
Code:
--parallel=4
actually speeds things up?

I found when merging 3 large files together into a new document it seems to not be any faster but if I break those 3 into 20 smaller documents then it seems to shave maybe 5 minutes off which I am not sure is much help. Would like to get input on this. Thanks!
 


dos2unix

Well-Known Member
Credits
3,030
How many CPU cores? Are they multi-thread?

Not exactly sure what you mean by merge process, but if you mean how do they all get
merged into one file... what do you think the ">" does?
 

JasKinasis

Well-Known Member
Credits
5,050
1. can sort display progress?
No, sadly it cannot.

2. Does --parallel increase the speed of a sort?
Not really, it seems to be for limiting the number of processor threads/cores used.

A quick peek at sort's info page (via info coreutils sort invocation) - shows that the default behaviour of sort is to use however many processors the machine has available. Up to a maximum of 8. Apparently more than 8 processors doesn't really yield much in the way of performance gains - so it has a hard limit of 8.

So, if you run sort without using --parallel, it will just use however many processors you have available (or a maximum of 8).

One other thing to note, is that the more processors you use, the RAM usage will increase exponentially.

So really, you only need to use --parallel if you want to limit the number of processors to use.

So for example, if you're ssh'd into a critical server, which is under a heavy load and you need to run sort on a large file - you might want to limit the number of threads to 2 - to allow other processors to still be used for other more critical tasks that the server is performing.

In that case - it will mean that the sort takes longer, but it won't be hogging all of the cpu threads. You can also set the maximum buffer size for it to use too.

So in your case - you probably don't need to use --parallel.
Not unless you need to limit the number of parallel threads to use.

However - considering the fact that memory usage increases logarithmically, the more processors you use - if you are sorting an immense file, perhaps you might want to consider trying the --parallel option and setting it to 1.
That should make the sort a little less memory intensive. So there will be less of the expensive memory related allocations going on.
Which MIGHT possible speed things up a little. It might be worth a shot!
Doing the sort in one thread will take longer, but - at the same time - the increase in memory performance might counterbalance things a little?! IDK - might be worth a shot?!

3. Regarding splitting the files VS monolithic files:
Breaking the huge files up into smaller chunks will naturally speed things up, because it requires less memory to be able to hold all of the data and sort it. So each sort is much less resource intensive.

With a huge, monolithic file sort, it will require a lot more memory and you're probably going to get huge amounts of data being swapped back and forth between RAM and the swap-file, which is quite a slow and resource intensive operation.

Also, internally - inside the program, there may be a lot of memory allocations and/or re-allocations going on during a large sort as things are being shifted around. And these kinds of memory operations are also computationally quite expensive.

So sorting a large number of smaller files is probably going to be quicker than sorting one extremely large file.
 
Last edited:

wizardfromoz

Super Moderator
Staff member
Gold Supporter
Credits
9,809
G'day @None-yet :)

I am moving this to Command Line, unless there is a Kali-specific reason for having it there.

Avagudweegend.
 

None-yet

Member
Credits
898
Thanks for all of the inputs. I have a great deal better understanding of the process now. I should have posted in command line to start with. Thanks.
 

Members online


Latest posts

Top