Compare files & deduplicate file without sorting?

P

postcd

Guest
Hello,

i got two files with several thousand lines

and i want to deduplicate file1 (remove from it lines that already exist in file2)

these lines contains various symbols like quotation marks, $ etc

the command must not change lines order. just remove duplicate lines...

i found several ontopic tutorials, but they use sorting which i cant use as its important for lines order to stay as it is, just remove duplicate lines..

Thank you
 
Last edited:


Yes, if you want to keep the order of the original file, using sort and uniq would not help you here.
Off the top of my head, something like this should work. But it might be a little slow if you are dealing with thousands of lines....
sortscript.sh:
Code:
#!/usr/bin/bash
file1="$1"
file2="$2"
outfile="output"
while IFS='' read -r line || [[ -n $line ]];
do
    if ! grep -w $line $file2 &> /dev/null ; then
        echo $line >> $outfile
    fi
done < "$file1"
Make sure you use chmod to make the script executable.
The script takes two files as parameters, so you run it like this:
Code:
./sortscript.sh /path/to/file1 /path/to/file2

The script will read each line from file1 and then use grep to see if that EXACT line exists in file2. If the line is not found in file2 it will write that line to a file called output.
At the end of the script, all of the unique lines from file1 are contained in the output file. Any matching lines found by grep are redirected to /dev/null to prevent any unnecessary output from grep being echoed to the screen.

If you are happy with the results this script yields, you could add the following line to the end of the script to replace file1 with the output file:
Code:
mv $outfile $file1

Again, not sure how long it will take to work on files containing thousands of entries. It might be slow, but it should at least work.
Let me know how you get on!
 
Last edited:

Staff online


Latest posts

Top