Merge two files based on similar line values.

FisherPRic3 · Mar 12, 2020

JasKinasis · Mar 12, 2020

If it's just two files and both files contain identical fields (e.g. Line1, Line2, Line3, Line4) - then you could simply use the cut and paste commands:

Bash:

paste -d'|' file1 <(cut -d'|' -f3 file2) > mergedfile

That uses '|' as a delimiter in the cut and paste commands.

Using this for file1:

Code:

| Line 1 | Data 1 |
| Line 2 | Data 2 |
| Line 3 | Data 3 |
| Line 4 | Data 4 |

And this for file2:

Code:

| Line 1 | Data A |
| Line 2 | Data B |
| Line 3 | Data C |
| Line 4 | Data D |

The command above yields the following mergedfile:

Code:

| Line 1 | Data 1 || Data A
| Line 2 | Data 2 || Data B
| Line 3 | Data 3 || Data C
| Line 4 | Data 4 || Data D

To fix the double pipe characters you could pipe to sed before the redirect.
e.g.

Bash:

paste -d'|' file1 <(cut -d'|' -f3 file2) | sed 's/||/|/g' > mergedfile

Which will yield this:

Code:

| Line 1 | Data 1 | Data A
| Line 2 | Data 2 | Data B
| Line 3 | Data 3 | Data C
| Line 4 | Data 4 | Data D

So that should work for joining two files which have the same field headers in them.

You could put that into a script which takes 3 paths as parameters.
1. Path/filename of base-file
2. Path/filename of file to merge in
3. Path/filename of output file

Which would give us something like this:

Bash:

#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "mergedatafiles.sh file1 file2 outputfile"
    exit 1
}

# Check the parameters
if [[ $# -ne 3 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > "$3"

Save that as a shellscript - e.g. mergedatafiles.sh and make it executable via chmod:

Bash:

chmod +x mergedatafiles.sh

Then you can merge your files using the script.
./mergedatafiles.sh file1 file2 mergedfile

Then the following week/month/whenever you need to do it again:
./mergedatafiles.sh mergedfile file3 mergedfile2

But on subsequent merges - that still means merging your already merged base-file with another file and saving the result to a new merged file. So you'd get a new merged file every time you ran the script.

So we could modify the script a little, to remove the need for a new merge file each time.
Perhaps we only make it take two parameters - the base-file and the data-file to merge - then we write the output of our command to a temporary file and then afterwards we use mv to rename it to the base-file - the file in the first parameter.
So effectively we overwrite the base-file each time we merge a new file.....
e.g.

Bash:

#!/usr/bin/env bash

usage()
{
    echo "ERROR: $1"
    echo "TODO: add some information about this script:"
    echo "e.g."
    echo "Usage:"
    echo "nameofthisscript.sh file1 file2"
    echo "data from file2 will be merged into file1"
    exit 1
}

# Check the parameters
if [[ $# -ne 2 ]] ; then
    usage "Incorrect number of parameters given..."
elif [[ ! -f $1 ]] ; then
    usage "$1 is not a file"
elif [[ ! -f $2 ]] ; then
    usage "$2 is not a file"
fi

# Now substitute the parameters into our original command
paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > tempfile
mv tempfile "$1"

#optionally uncomment the line below to remove the 2nd file.
# rm "$2"

Now when you run the script:
./mergedatafiles.sh file1 file2
The data from file2 should be appended to file1.

And then the next time you merge:
./mergedatafiles.sh file1 file3

So now file1 will just grow as you merge in more data files. And if you uncomment the final line of the script - the file that was merged into the master/base-file will be deleted. Or you could substitute the rm with a mv - to move the file to a different location after it has been merged. The options here are virtually endless.

I recommend testing these ideas on a copy of your data, to make sure it doesn't completely bork it up. But I'm pretty confident it will do what you want.

rdrtx1 · Mar 12, 2020

Code:

awk -F"|" 'NR==FNR {a[$2]=$3; next} a[$2] {$0=$0 a[$2] FS} 1' file2 file1

FisherPRic3 · Mar 12, 2020

JasKinasis said:
If it's just two files and both files contain identical fields (e.g. Line1, Line2, Line3, Line4) - then you could simply use the cut and paste commands:

Bash:

paste -d'|' file1 <(cut -d'|' -f3 file2) > mergedfile

That uses '|' as a delimiter in the cut and paste commands.

Using this for file1:

Code:

| Line 1 | Data 1 | | Line 2 | Data 2 | | Line 3 | Data 3 | | Line 4 | Data 4 |

And this for file2:

Code:

| Line 1 | Data A | | Line 2 | Data B | | Line 3 | Data C | | Line 4 | Data D |

The command above yields the following mergedfile:

Code:

| Line 1 | Data 1 || Data A | Line 2 | Data 2 || Data B | Line 3 | Data 3 || Data C | Line 4 | Data 4 || Data D

To fix the double pipe characters you could pipe to sed before the redirect.
e.g.

Bash:

paste -d'|' file1 <(cut -d'|' -f3 file2) | sed 's/||/|/g' > mergedfile

Which will yield this:

Code:

| Line 1 | Data 1 | Data A | Line 2 | Data 2 | Data B | Line 3 | Data 3 | Data C | Line 4 | Data 4 | Data D

So that should work for joining two files which have the same field headers in them.

You could put that into a script which takes 3 paths as parameters.
1. Path/filename of base-file
2. Path/filename of file to merge in
3. Path/filename of output file

Which would give us something like this:

Bash:

#!/usr/bin/env bash usage() { echo "ERROR: $1" echo "TODO: add some information about this script:" echo "e.g." echo "Usage:" echo "mergedatafiles.sh file1 file2 outputfile" exit 1 } # Check the parameters if [[ $# -ne 3 ]] ; then usage "Incorrect number of parameters given..." elif [[ ! -f $1 ]] ; then usage "$1 is not a file" elif [[ ! -f $2 ]] ; then usage "$2 is not a file" fi # Now substitute the parameters into our original command paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > "$3"

Save that as a shellscript - e.g. mergedatafiles.sh and make it executable via chmod:

Bash:

chmod +x mergedatafiles.sh

Then you can merge your files using the script.
./mergedatafiles.sh file1 file2 mergedfile

Then the following week/month/whenever you need to do it again:
./mergedatafiles.sh mergedfile file3 mergedfile2

But on subsequent merges - that still means merging your already merged base-file with another file and saving the result to a new merged file. So you'd get a new merged file every time you ran the script.

So we could modify the script a little, to remove the need for a new merge file each time.
Perhaps we only make it take two parameters - the base-file and the data-file to merge - then we write the output of our command to a temporary file and then afterwards we use mv to rename it to the base-file - the file in the first parameter.
So effectively we overwrite the base-file each time we merge a new file.....
e.g.

Bash:

#!/usr/bin/env bash usage() { echo "ERROR: $1" echo "TODO: add some information about this script:" echo "e.g." echo "Usage:" echo "nameofthisscript.sh file1 file2" echo "data from file2 will be merged into file1" exit 1 } # Check the parameters if [[ $# -ne 2 ]] ; then usage "Incorrect number of parameters given..." elif [[ ! -f $1 ]] ; then usage "$1 is not a file" elif [[ ! -f $2 ]] ; then usage "$2 is not a file" fi # Now substitute the parameters into our original command paste -d'|' "$1" <(cut -d'|' -f3 "$2") | sed 's/||/|/g' > tempfile mv tempfile "$1" #optionally uncomment the line below to remove the 2nd file. # rm "$2"

Now when you run the script:
./mergedatafiles.sh file1 file2
The data from file2 should be appended to file1.

And then the next time you merge:
./mergedatafiles.sh file1 file3

So now file1 will just grow as you merge in more data files. And if you uncomment the final line of the script - the file that was merged into the master/base-file will be deleted. Or you could substitute the rm with a mv - to move the file to a different location after it has been merged. The options here are virtually endless.

I recommend testing these ideas on a copy of your data, to make sure it doesn't completely bork it up. But I'm pretty confident it will do what you want.

This worked flawlessly, but something i hadn't considered was that there are rows in between and at the top and bottom that have "+" instead of "|" So Currently my output looks like this.

Code:

+---------------+--------------+|+---------------+--------------+
| Line          | Data         || Data
+---------------+--------------+|+---------------+--------------+
| Line 1        | Data 1       || Data A
+---------------+--------------+|+---------------+--------------+

So i am still pulling the first column in the rows that have "+"'s. is there a good way to to get those out of there assuming the "-" will be a variable length every time.

FisherPRic3 · Mar 12, 2020

NVM. I figured that part out. it may be a little sloppy but i just run the merged file through the below sed commands.

Code:

sed -E 's/\+\|\+-+//g' final | sed 's/||/|/g'

this alters the merged file from what i posted above to this

Code:

+---------------+--------------+--------------+
| Line          | Data         | Data
+---------------+--------------+--------------+
| Line 1        | Data 1       | Data A
+---------------+--------------+--------------+

And now i am getting a little greedy but im wondering if there is a way for me to insert a "|" at the end of each line with a record in it.

rdrtx1 · Mar 12, 2020

Code:

awk '
{ gsub("[+][|][+]-+", ""); gsub("[|][|]", "|")}
/[+]$/ {l=length($0)}
! /[+]$/ && l {$0=sprintf("%-s%*s|", $0, l-length-1, " ")}
1
' final

FisherPRic3 · Mar 16, 2020

rdrtx1 said:

I get an "Unexpected token" on line 4 with this. Its well beyond my depth so im not sure what is wrong.

rdrtx1 · Mar 18, 2020

try using gawk.

FisherPRic3 · Mar 19, 2020

Same deal "awk: cmd. line:4: Unexpected token"

Merge two files based on similar line values.

FisherPRic3

New Member

JasKinasis

Super Moderator

rdrtx1

New Member

FisherPRic3

New Member

FisherPRic3

New Member

rdrtx1

New Member

FisherPRic3

New Member

rdrtx1

New Member

FisherPRic3

New Member

Members online

Latest posts