Merge two files based on similar line values.

FisherPRic3

New Member
Credits
0
Hey everyone,

I have a very specific goal in mind here and its possible that i may just end up needing a complicated script to accomplish this but please see below for my example.

I have a script that produces some text similar to what you see below.

| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

This data comes from an sql query and the goal here is that the Data column will be changed following this query and i want to run the query again after the change with the updated data from column 2 being appended to its respective line as a third column. So the new query would produce the below text.

| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

And the end goal would be to merge the two results so that the changes can be easily viewed right next to echo other. Ex. Below.

| Line 1 | Data 1 | Data 1(New) |
| Line 2 | Data 2 | Data 2(New) |
| Line 3 | Data 3 | Data 3(New) |
| Line 4 | Data 4 | Data 4(New) |

Im not sure if there is an easy way to accomplish this or not. If anyone can provide any insight it would by much appreciated.
 


JasKinasis

Well-Known Member
Credits
1,193
If it's just two files and both files contain identical fields (e.g. Line1, Line2, Line3, Line4) - then you could simply use the cut and paste commands:
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) > mergedfile
That uses '|' as a delimiter in the cut and paste commands.

Using this for file1:
Code:
| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |
And this for file2:
Code:
| Line 1 | Data A |
| Line 2 | Data B |
| Line 3 | Data C |
| Line 4 | Data D |
The command above yields the following mergedfile:
Code:
| Line 1 | Data 1 || Data A
| Line 2 | Data 2 || Data B
| Line 3 | Data 3 || Data C
| Line 4 | Data 4 || Data D
To fix the double pipe characters you could pipe to sed before the redirect.
e.g.
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) | sed 's/||/|/g' > mergedfile
Which will yield this:
Code:
| Line 1 | Data 1 | Data A
| Line 2 | Data 2 | Data B
| Line 3 | Data 3 | Data C
| Line 4 | Data 4 | Data D
So that should work for joining two files which have the same field headers in them.

You could put that into a script which takes 3 paths as parameters.
1. Path/filename of base-file
2. Path/filename of file to merge in
3. Path/filename of output file

Which would give us something like this:
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "mergedatafiles.sh file1 file2 outputfile"
    exit 1
}

# Check the parameters
if [[ $# -ne 3 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > "$3"
Save that as a shellscript - e.g. mergedatafiles.sh and make it executable via chmod:
Bash:
chmod +x mergedatafiles.sh
Then you can merge your files using the script.
./mergedatafiles.sh file1 file2 mergedfile

Then the following week/month/whenever you need to do it again:
./mergedatafiles.sh mergedfile file3 mergedfile2

But on subsequent merges - that still means merging your already merged base-file with another file and saving the result to a new merged file. So you'd get a new merged file every time you ran the script.

So we could modify the script a little, to remove the need for a new merge file each time.
Perhaps we only make it take two parameters - the base-file and the data-file to merge - then we write the output of our command to a temporary file and then afterwards we use mv to rename it to the base-file - the file in the first parameter.
So effectively we overwrite the base-file each time we merge a new file.....
e.g.
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "nameofthisscript.sh file1 file2"
    echo "data from file2 will be merged into file1"
    exit 1
}

# Check the parameters
if [[ $# -ne 2 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > tempfile
mv tempfile "$1"

#optionally uncomment the line below to remove the 2nd file.
# rm "$2"
Now when you run the script:
./mergedatafiles.sh file1 file2
The data from file2 should be appended to file1.

And then the next time you merge:
./mergedatafiles.sh file1 file3

So now file1 will just grow as you merge in more data files. And if you uncomment the final line of the script - the file that was merged into the master/base-file will be deleted. Or you could substitute the rm with a mv - to move the file to a different location after it has been merged. The options here are virtually endless.

I recommend testing these ideas on a copy of your data, to make sure it doesn't completely bork it up. But I'm pretty confident it will do what you want.
 

FisherPRic3

New Member
Credits
0
If it's just two files and both files contain identical fields (e.g. Line1, Line2, Line3, Line4) - then you could simply use the cut and paste commands:
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) > mergedfile
That uses '|' as a delimiter in the cut and paste commands.

Using this for file1:
Code:
| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |
And this for file2:
Code:
| Line 1 | Data A |
| Line 2 | Data B |
| Line 3 | Data C |
| Line 4 | Data D |
The command above yields the following mergedfile:
Code:
| Line 1 | Data 1 || Data A
| Line 2 | Data 2 || Data B
| Line 3 | Data 3 || Data C
| Line 4 | Data 4 || Data D
To fix the double pipe characters you could pipe to sed before the redirect.
e.g.
Bash:
paste -d'|' file1 <(cut -d'|' -f3 file2) | sed 's/||/|/g' > mergedfile
Which will yield this:
Code:
| Line 1 | Data 1 | Data A
| Line 2 | Data 2 | Data B
| Line 3 | Data 3 | Data C
| Line 4 | Data 4 | Data D
So that should work for joining two files which have the same field headers in them.

You could put that into a script which takes 3 paths as parameters.
1. Path/filename of base-file
2. Path/filename of file to merge in
3. Path/filename of output file

Which would give us something like this:
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "mergedatafiles.sh file1 file2 outputfile"
    exit 1
}

# Check the parameters
if [[ $# -ne 3 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > "$3"
Save that as a shellscript - e.g. mergedatafiles.sh and make it executable via chmod:
Bash:
chmod +x mergedatafiles.sh
Then you can merge your files using the script.
./mergedatafiles.sh file1 file2 mergedfile

Then the following week/month/whenever you need to do it again:
./mergedatafiles.sh mergedfile file3 mergedfile2

But on subsequent merges - that still means merging your already merged base-file with another file and saving the result to a new merged file. So you'd get a new merged file every time you ran the script.

So we could modify the script a little, to remove the need for a new merge file each time.
Perhaps we only make it take two parameters - the base-file and the data-file to merge - then we write the output of our command to a temporary file and then afterwards we use mv to rename it to the base-file - the file in the first parameter.
So effectively we overwrite the base-file each time we merge a new file.....
e.g.
Bash:
#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "nameofthisscript.sh file1 file2"
    echo "data from file2 will be merged into file1"
    exit 1
}

# Check the parameters
if [[ $# -ne 2 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > tempfile
mv tempfile "$1"

#optionally uncomment the line below to remove the 2nd file.
# rm "$2"
Now when you run the script:
./mergedatafiles.sh file1 file2
The data from file2 should be appended to file1.

And then the next time you merge:
./mergedatafiles.sh file1 file3

So now file1 will just grow as you merge in more data files. And if you uncomment the final line of the script - the file that was merged into the master/base-file will be deleted. Or you could substitute the rm with a mv - to move the file to a different location after it has been merged. The options here are virtually endless.

I recommend testing these ideas on a copy of your data, to make sure it doesn't completely bork it up. But I'm pretty confident it will do what you want.
This worked flawlessly, but something i hadn't considered was that there are rows in between and at the top and bottom that have "+" instead of "|" So Currently my output looks like this.

Code:
+---------------+--------------+|+---------------+--------------+
| Line          | Data         || Data
+---------------+--------------+|+---------------+--------------+
| Line 1        | Data 1       || Data A
+---------------+--------------+|+---------------+--------------+
So i am still pulling the first column in the rows that have "+"'s. is there a good way to to get those out of there assuming the "-" will be a variable length every time.
 

FisherPRic3

New Member
Credits
0
NVM. I figured that part out. it may be a little sloppy but i just run the merged file through the below sed commands.

Code:
sed -E 's/\+\|\+-+//g' final | sed 's/||/|/g'
this alters the merged file from what i posted above to this

Code:
+---------------+--------------+--------------+
| Line          | Data         | Data
+---------------+--------------+--------------+
| Line 1        | Data 1       | Data A
+---------------+--------------+--------------+
And now i am getting a little greedy but im wondering if there is a way for me to insert a "|" at the end of each line with a record in it.
 

FisherPRic3

New Member
Credits
0
Code:
awk '
{ gsub("[+][|][+]-+", ""); gsub("[|][|]", "|")}
/[+]$/ {l=length($0)}
! /[+]$/ && l {$0=sprintf("%-s%*s|", $0, l-length-1, " ")}
1
' final
I get an "Unexpected token" on line 4 with this. Its well beyond my depth so im not sure what is wrong.
 


Members online


Latest posts

Top