Merge two files based on similar line values.

FisherPRic3

New Member
Joined
Mar 12, 2020
Messages
5
Reaction score
0
Credits
0
Hey everyone,

I have a very specific goal in mind here and its possible that i may just end up needing a complicated script to accomplish this but please see below for my example.

I have a script that produces some text similar to what you see below.

| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

This data comes from an sql query and the goal here is that the Data column will be changed following this query and i want to run the query again after the change with the updated data from column 2 being appended to its respective line as a third column. So the new query would produce the below text.

| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

And the end goal would be to merge the two results so that the changes can be easily viewed right next to echo other. Ex. Below.

| Line 1 | Data 1 | Data 1(New) |
| Line 2 | Data 2 | Data 2(New) |
| Line 3 | Data 3 | Data 3(New) |
| Line 4 | Data 4 | Data 4(New) |

Im not sure if there is an easy way to accomplish this or not. If anyone can provide any insight it would by much appreciated.
 


If it's just two files and both files contain identical fields (e.g. Line1, Line2, Line3, Line4) - then you could simply use the cut and paste commands:
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) > mergedfile

That uses '|' as a delimiter in the cut and paste commands.

Using this for file1:
Code:
| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

And this for file2:
Code:
| Line 1 | Data A |
| Line 2 | Data B |
| Line 3 | Data C |
| Line 4 | Data D |

The command above yields the following mergedfile:
Code:
| Line 1 | Data 1 || Data A
| Line 2 | Data 2 || Data B
| Line 3 | Data 3 || Data C
| Line 4 | Data 4 || Data D

To fix the double pipe characters you could pipe to sed before the redirect.
e.g.
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) | sed 's/||/|/g' > mergedfile

Which will yield this:
Code:
| Line 1 | Data 1 | Data A
| Line 2 | Data 2 | Data B
| Line 3 | Data 3 | Data C
| Line 4 | Data 4 | Data D

So that should work for joining two files which have the same field headers in them.

You could put that into a script which takes 3 paths as parameters.
1. Path/filename of base-file
2. Path/filename of file to merge in
3. Path/filename of output file

Which would give us something like this:
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "mergedatafiles.sh file1 file2 outputfile"
    exit 1
}

# Check the parameters
if [[ $# -ne 3 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > "$3"

Save that as a shellscript - e.g. mergedatafiles.sh and make it executable via chmod:
Bash:
chmod +x mergedatafiles.sh

Then you can merge your files using the script.
./mergedatafiles.sh file1 file2 mergedfile

Then the following week/month/whenever you need to do it again:
./mergedatafiles.sh mergedfile file3 mergedfile2

But on subsequent merges - that still means merging your already merged base-file with another file and saving the result to a new merged file. So you'd get a new merged file every time you ran the script.

So we could modify the script a little, to remove the need for a new merge file each time.
Perhaps we only make it take two parameters - the base-file and the data-file to merge - then we write the output of our command to a temporary file and then afterwards we use mv to rename it to the base-file - the file in the first parameter.
So effectively we overwrite the base-file each time we merge a new file.....
e.g.
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "nameofthisscript.sh file1 file2"
    echo "data from file2 will be merged into file1"
    exit 1
}

# Check the parameters
if [[ $# -ne 2 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > tempfile
mv tempfile "$1"

#optionally uncomment the line below to remove the 2nd file.
# rm "$2"

Now when you run the script:
./mergedatafiles.sh file1 file2
The data from file2 should be appended to file1.

And then the next time you merge:
./mergedatafiles.sh file1 file3

So now file1 will just grow as you merge in more data files. And if you uncomment the final line of the script - the file that was merged into the master/base-file will be deleted. Or you could substitute the rm with a mv - to move the file to a different location after it has been merged. The options here are virtually endless.

I recommend testing these ideas on a copy of your data, to make sure it doesn't completely bork it up. But I'm pretty confident it will do what you want.
 
If it's just two files and both files contain identical fields (e.g. Line1, Line2, Line3, Line4) - then you could simply use the cut and paste commands:
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) > mergedfile

That uses '|' as a delimiter in the cut and paste commands.

Using this for file1:
Code:
| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

And this for file2:
Code:
| Line 1 | Data A |
| Line 2 | Data B |
| Line 3 | Data C |
| Line 4 | Data D |

The command above yields the following mergedfile:
Code:
| Line 1 | Data 1 || Data A
| Line 2 | Data 2 || Data B
| Line 3 | Data 3 || Data C
| Line 4 | Data 4 || Data D

To fix the double pipe characters you could pipe to sed before the redirect.
e.g.
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) | sed 's/||/|/g' > mergedfile

Which will yield this:
Code:
| Line 1 | Data 1 | Data A
| Line 2 | Data 2 | Data B
| Line 3 | Data 3 | Data C
| Line 4 | Data 4 | Data D

So that should work for joining two files which have the same field headers in them.

You could put that into a script which takes 3 paths as parameters.
1. Path/filename of base-file
2. Path/filename of file to merge in
3. Path/filename of output file

Which would give us something like this:
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "mergedatafiles.sh file1 file2 outputfile"
    exit 1
}

# Check the parameters
if [[ $# -ne 3 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > "$3"

Save that as a shellscript - e.g. mergedatafiles.sh and make it executable via chmod:
Bash:
chmod +x mergedatafiles.sh

Then you can merge your files using the script.
./mergedatafiles.sh file1 file2 mergedfile

Then the following week/month/whenever you need to do it again:
./mergedatafiles.sh mergedfile file3 mergedfile2

But on subsequent merges - that still means merging your already merged base-file with another file and saving the result to a new merged file. So you'd get a new merged file every time you ran the script.

So we could modify the script a little, to remove the need for a new merge file each time.
Perhaps we only make it take two parameters - the base-file and the data-file to merge - then we write the output of our command to a temporary file and then afterwards we use mv to rename it to the base-file - the file in the first parameter.
So effectively we overwrite the base-file each time we merge a new file.....
e.g.
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "nameofthisscript.sh file1 file2"
    echo "data from file2 will be merged into file1"
    exit 1
}

# Check the parameters
if [[ $# -ne 2 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > tempfile
mv tempfile "$1"

#optionally uncomment the line below to remove the 2nd file.
# rm "$2"

Now when you run the script:
./mergedatafiles.sh file1 file2
The data from file2 should be appended to file1.

And then the next time you merge:
./mergedatafiles.sh file1 file3

So now file1 will just grow as you merge in more data files. And if you uncomment the final line of the script - the file that was merged into the master/base-file will be deleted. Or you could substitute the rm with a mv - to move the file to a different location after it has been merged. The options here are virtually endless.

I recommend testing these ideas on a copy of your data, to make sure it doesn't completely bork it up. But I'm pretty confident it will do what you want.

This worked flawlessly, but something i hadn't considered was that there are rows in between and at the top and bottom that have "+" instead of "|" So Currently my output looks like this.

Code:
+---------------+--------------+|+---------------+--------------+
| Line          | Data         || Data
+---------------+--------------+|+---------------+--------------+
| Line 1        | Data 1       || Data A
+---------------+--------------+|+---------------+--------------+

So i am still pulling the first column in the rows that have "+"'s. is there a good way to to get those out of there assuming the "-" will be a variable length every time.
 
NVM. I figured that part out. it may be a little sloppy but i just run the merged file through the below sed commands.

Code:
sed -E 's/\+\|\+-+//g' final | sed 's/||/|/g'

this alters the merged file from what i posted above to this

Code:
+---------------+--------------+--------------+
| Line          | Data         | Data
+---------------+--------------+--------------+
| Line 1        | Data 1       | Data A
+---------------+--------------+--------------+

And now i am getting a little greedy but im wondering if there is a way for me to insert a "|" at the end of each line with a record in it.
 
Code:
awk '
{ gsub("[+][|][+]-+", ""); gsub("[|][|]", "|")}
/[+]$/ {l=length($0)}
! /[+]$/ && l {$0=sprintf("%-s%*s|", $0, l-length-1, " ")}
1
' final

I get an "Unexpected token" on line 4 with this. Its well beyond my depth so im not sure what is wrong.
 

Members online


Latest posts

Top