Merging two csv files into one

lvirden · Mar 12, 2018

So I have two csv files that have a common key. gnu join by itself can produce lines of output for records which have a common key.

What I would love to have happen, however, is when I use the -a 1 -a 2 , the program produce -e's values for each field of the merged output for the missing line in the file.
In other words

File 1
a 1
b 3
c 4

File 2
a 2
c 6
d 5

joined together with -a 1 -a 2 -e'""' currently produces
a 1 2
b 3
c 4 6
d 5

but lines b and d should have 3 fields, not 2 fields. The expected output was
a 1 2
b 3 ""
d 4 6
d "" 5

Is there a flag that I am missing that will produce the expected output?
I tried
-o 1.1 1.2 2.2
but that ended up giving me
a 1 2
b 3 ""
c 4 6
"" "" 5

Because the d record wasn't in the first file, it got an empty value.
I don't want to see 2 occurrences of the matching key - just 1.

Rob · Mar 12, 2018

that's a good one.. thinking @JasKinasis will have an answer for ya..

Replying to view later..

JasKinasis · Mar 13, 2018

To my eyes, for basic merging purposes (merging two sets of data into one) - this is the syntax I would use:

Code:

join -a1 -a2 file1 file2

Which would yield this:

Code:

a 1 2
b 3
c 4 6
d 5

That would typically be the output that I'd expect from join when attempting to merge two data-sets together.

That way all of the values for a,b,c and d are there in the merged file and I don't care which file it originally came from - all of the data has been collated into a single file.

But, from the looks of your requirements - you want to perform two things:
1. Merge the data
2. Be able to see which file each piece of data originally came from.

Sadly, from looking at the man and info pages for join - AND from having a play with join - I've been unable to find anything that could yield the output you want.... At least not using a single join command on its own!

But you could use two join commands and sort to get the output you require:

Code:

join -a 1 file1 file2 > outputfile
join -a2 -v2 file1 file2 | awk '/ / {gsub(" ", "   ");print}' >> outputfile
sort outputfile -o outputfile

In the above:
The first join command joins file1 and file2 - including lines that are only in file1 and writes the output to a file called outputfile via output redirection.

So after the first join command, outputfile would contain:

Code:

a 1 2
b 3
c 4 6

The second join command outputs lines that are only in file2.
Each line output by the second join is piped through awk. Any lines containing spaces are expanded to three spaces and printed. The printed, modified string is then appended to the outputfile via output redirection.

The reason we're piping through awk and adding spaces in the second join command is because otherwise, the lines would come out looking like this:

Code:

d 5

And you want the values for fields that are only in file2 to be displayed in column 2. So we are using awk to expand the space in the middle of each line to three spaces. So the value (5) will end up in the 2nd column of the output.
e.g.

Code:

d   5

Finally we sort outputfile to ensure that all non-matching lines from file2 are in their correct place, overwriting outputfile with the sorted output

Yielding an outputfile, which looks like this:

Code:

NOTE: We don't NEED to sort the outputfile in the above example, because the only line that is in file2 and NOT file1 is the final line.

But if the two datasets were like this:-
file1:

Code:

a 1
c 4
e 7

file2:

Code:

a 2
b 3
c 6
d 5

After the two join commands, the output would look like this:

Code:

So it would need to be sorted to yield:

Code:

So although the final sort isn't required with the sample data you have provided - the sort at the end could be necessary if you run a different set of data through those commands. So it would be prudent to keep the final sort there.

And now that we know what we are doing - we can turn those few commands into a re-usable script that could be ran on any two data-files that match the format of your example files.

The following script takes three parameters:
1. /path/to/file1
2. /path/to/file2
3. /path/to/outputfile

mergedata.sh (or you could call it whatever you want!)

Code:

#!/usr/bin/env bash

# ensure we have 3 parameters
if [[ $# -ne 3 ]]; then
    echo "Error - Requires exactly 3 parameters: "
    echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
    exit 1
fi

# ensure file1 exists
if [[ ! -f $1 ]]; then
   echo "Error: \"$1\" does not exist!"
   exit  1
fi

# ensure file2 exists
if [[ ! -f $2 ]]; then
   echo "Error: \"$2\" does not exist!"
   exit 1
fi

# ensure the directory for the output file exists
if [[ ! -d $(dirname $3) ]]; then
   echo "Error: Unable to create $3"
   echo "    The directory \"$(dirname $3)\" does not exist!"
   exit 1
fi

# join file1 and file2 including unmatched lines in file1
# and write to output-file
join -a1 $1 $2 > $3

# append unmatched lines from file2 to output file
# adding some extra spaces with awk
join -a2 -v2 $1 $2 | awk '/ / {gsub(" ", "   ");print}' >> $3

# sort the output file
sort $3 -o $3

# display the content of the file on-screen
echo
cat $3

There's quite a bit more code in the above.
First up there are a few checks to ensure that:
1. We have the correct number of parameters
2. Each input file exists
3. The path to the output file is valid.

If any of the checks fail - we print an error message and quit with an error code.

Otherwise - if everything is OK, we perform our join operations on the two files, writing to the output file as we go. Then we sort the output file and display it on-screen.

lvirden · Mar 13, 2018

Thank you for your kind answer. However, there is one thing misunderstood that might make things easier to understand.
The intent of merging these two files is to create one import file to a program that is going to assign each "column" to a different table column. Thus, without maintaining keys for each line, the data is list (no key - no record). And given that each csv "column" is assigned to a different table column, it is imperative that data that comes from just one of the two files has place holders that should that the remaining columns need to be treated as empty. Otherwise, the import would either take columns from the next row of data, generate an error for the rows with insufficient data, or assign the wrong csv columns to the wrong database columns - all of which would be disasterous.

JasKinasis · Mar 13, 2018

Ah - my bad!

Sorry - I was messing around with join for a quite a while when I first started composing the previous post.

For some reason, by the time I got back to composing the message, I thought you had spaces in your blank fields. I should have referred back to your post.... Doh!

Well, my previous solution was at least close...

So what about 3 joins and a sort?

Code:

join file1 file2 > outputfile
join -a1 -v1 file1 file2 | awk '{print $0" \"\""}' >> outputfile
join -a2 -v2 file1 file2 | awk '/ / {gsub(" ", " \"\" ");print}' >> outputfile
sort outputfile -o outputfile

The 3 join commands in the above do the following:
1. Writes fields that are common to both files to the outputfile
2. Appends fields that are only in file 1 (adding "" to the end of each line) to the output file
3. Appends fields that are only in file 2 (adding "" to the middle of the line) to the output file

Then the sort just sorts everything based on the first field.

That would make the full standalone script look something like this:

Code:

#!/usr/bin/env bash

# ensure we have 3 parameters
if [[ $# -ne 3 ]]; then
   echo "Error - Requires exactly 3 parameters: "
   echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
   exit 1
fi

# ensure file1 exists
if [[ ! -f $1 ]]; then
   echo "Error: \"$1\" does not exist!"
   exit  1
fi

# ensure file2 exists
if [[ ! -f $2 ]]; then
   echo "Error: \"$2\" does not exist!"
   exit 1
fi

# ensure the directory for the output file exists
if [[ ! -d $(dirname $3) ]]; then
   echo "Error: Unable to create $3"
   echo "    The directory \"$(dirname $3)\" does not exist!"
   exit 1
fi

# join and output lines common to both files
# and write to output-file
join $1 $2 > $3

# append unmatched lines from file 1 to output file
# adding placeholder ("") at the end of the line
join -a1 -v1 $1 $2 | awk '{print $0" \"\""}' >> $3

# append unmatched lines from file2 to output file
# adding placeholder ("") to the middle of each line
join -a2 -v2 $1 $2 | awk '/ / {gsub(" ", " \"\"  ");print}' >> $3

# sort the output file
sort $3 -o $3

# display the content of the file on-screen
echo
cat $3

Yielding this as the output:

Code:

a 1 2
b 3 ""
c 4 6
d "" 5

Is that a bit more like it?
Sorry about the previous attempt!

[edit]
IMPORTANT:
Something just occurred to me....
I've just been assuming that you are only working with numeric data in your files - as per the examples.
But if any of the unique fields from file 2 are not numeric and contain something like quoted strings with spaces
e.g.
If the unique line read in from data-file2 is something like:
Fieldname "Dave Smith"

- Then the awk after the 3rd join will insert ' "" ' in each space, yielding:
Fieldname "" "Dave "" Smith"

Whereas the expected output would be:
Fieldname "" "Dave Smith"

So if that is a problem, we would have to rethink the awk after the 3rd join command to ensure that only the first space is replaced.
But if we're only dealing with numerical data, it won't be an issue!
[/edit]

JasKinasis · Mar 13, 2018

OK, In case you have strings in your data files, I have improved what happens after the 3rd join:

Code:

join -a2 -v2 file1 file2 | sed -e 's/  / \"\" /' >> outputfile

That will only replace the first space in the unique lines from the 2nd file.

So the 4 commands would now be:

Code:

join file1 file2 > outputfile
join -a1 -v1 file1 file2 | awk '{print $0" \"\""}' >> outputfile
join -a2 -v2 file1 file2 | sed -e 's/  / \"\" /' >> outputfile
sort outputfile -o outputfile

Making the full standalone script:

Code:

#!/usr/bin/env bash

if [[ $# -ne 3 ]]; then
    echo "Error - Requires exactly 3 parameters: "
    echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
    exit 1
fi

if [[ ! -f $1 ]]; then
   echo "Error: The file \"$1\" does not exist!"
   exit  1
fi

if [[ ! -f $2 ]]; then
   echo "Error: The file \"$2\" does not exist!"
   exit 1
fi

if [[ ! -d $(dirname $3) ]]; then
   echo "Error: Unable to create $3"
   echo "    The directory \"$(dirname $3)\" does not exist!"
   exit 1
fi

join $1 $2 > $3
join -a1 -v1 $1 $2 | awk '{print $0" \"\""}' >> $3
join -a2 -v2 $1 $2 | sed -e 's/ / \"\" /' >> $3
sort $3 -o $3

# display the content of the file on-screen
echo
cat $3

Contemplative Farmer · Mar 22, 2018

Have you looked at using Perl? It is included in most distributions and is both powerful as well as simple to use for applications such as yours. The user community is large, the language is well documented, under current and continued development...though very mature. I would recommend you take a look at it.
Hope that is helpful.

Merging two csv files into one

lvirden

New Member

Rob

Administrator

JasKinasis

Super Moderator

lvirden

New Member

JasKinasis

Super Moderator

JasKinasis

Super Moderator

Contemplative Farmer

New Member

Members online

Latest posts