Purchase Linux CDs / DVDs / Flash Drives at OSDisc.com

Welcome to Our Community

While Linux.org has been around for a while, we recently changed management and had to purge most of the content (including users). If you signed up before April 23rd, 2017 please sign up again. Thanks!

Merging two csv files into one

Discussion in 'Getting Started' started by lvirden, Mar 12, 2018.

Tags:
  1. lvirden

    lvirden New Member

    Joined:
    Oct 6, 2017
    Messages:
    7
    Likes Received:
    0
    So I have two csv files that have a common key. gnu join by itself can produce lines of output for records which have a common key.

    What I would love to have happen, however, is when I use the -a 1 -a 2 , the program produce -e's values for each field of the merged output for the missing line in the file.
    In other words

    File 1
    a 1
    b 3
    c 4

    File 2
    a 2
    c 6
    d 5

    joined together with -a 1 -a 2 -e'""' currently produces
    a 1 2
    b 3
    c 4 6
    d 5

    but lines b and d should have 3 fields, not 2 fields. The expected output was
    a 1 2
    b 3 ""
    d 4 6
    d "" 5



    Is there a flag that I am missing that will produce the expected output?
    I tried
    -o 1.1 1.2 2.2
    but that ended up giving me
    a 1 2
    b 3 ""
    c 4 6
    "" "" 5

    Because the d record wasn't in the first file, it got an empty value.
    I don't want to see 2 occurrences of the matching key - just 1.
     
  2. Rob

    Rob Administrator
    Staff Member

    Joined:
    Oct 27, 2011
    Messages:
    277
    Likes Received:
    754
    that's a good one.. thinking @JasKinasis will have an answer for ya.. ;) Replying to view later..
     
  3. JasKinasis

    JasKinasis Well-Known Member

    Joined:
    Apr 25, 2017
    Messages:
    302
    Likes Received:
    573
    To my eyes, for basic merging purposes (merging two sets of data into one) - this is the syntax I would use:
    Code:
    join -a1 -a2 file1 file2
    Which would yield this:
    Code:
    a 1 2
    b 3
    c 4 6
    d 5
    
    That would typically be the output that I'd expect from join when attempting to merge two data-sets together.

    That way all of the values for a,b,c and d are there in the merged file and I don't care which file it originally came from - all of the data has been collated into a single file.

    But, from the looks of your requirements - you want to perform two things:
    1. Merge the data
    2. Be able to see which file each piece of data originally came from.

    Sadly, from looking at the man and info pages for join - AND from having a play with join - I've been unable to find anything that could yield the output you want.... At least not using a single join command on its own!

    But you could use two join commands and sort to get the output you require:
    Code:
    join -a 1 file1 file2 > outputfile
    join -a2 -v2 file1 file2 | awk '/ / {gsub(" ", "   ");print}' >> outputfile
    sort outputfile -o outputfile
    
    In the above:
    The first join command joins file1 and file2 - including lines that are only in file1 and writes the output to a file called outputfile via output redirection.

    So after the first join command, outputfile would contain:
    Code:
    a 1 2
    b 3
    c 4 6
    
    The second join command outputs lines that are only in file2.
    Each line output by the second join is piped through awk. Any lines containing spaces are expanded to three spaces and printed. The printed, modified string is then appended to the outputfile via output redirection.

    The reason we're piping through awk and adding spaces in the second join command is because otherwise, the lines would come out looking like this:
    Code:
    d 5
    
    And you want the values for fields that are only in file2 to be displayed in column 2. So we are using awk to expand the space in the middle of each line to three spaces. So the value (5) will end up in the 2nd column of the output.
    e.g.
    Code:
    d   5
    
    Finally we sort outputfile to ensure that all non-matching lines from file2 are in their correct place, overwriting outputfile with the sorted output

    Yielding an outputfile, which looks like this:
    Code:
    a 1 2
    b 3
    c 4 6
    d   5
    
    NOTE: We don't NEED to sort the outputfile in the above example, because the only line that is in file2 and NOT file1 is the final line.

    But if the two datasets were like this:-
    file1:
    Code:
    a 1
    c 4
    e 7
    
    file2:
    Code:
    a 2
    b 3
    c 6
    d 5
    
    After the two join commands, the output would look like this:
    Code:
    a 1 2
    c 4 6
    e 7
    b   3
    d   5
    
    So it would need to be sorted to yield:
    Code:
    a 1 2
    b   3
    c 4 6
    d   5
    e 7
    
    So although the final sort isn't required with the sample data you have provided - the sort at the end could be necessary if you run a different set of data through those commands. So it would be prudent to keep the final sort there.


    And now that we know what we are doing - we can turn those few commands into a re-usable script that could be ran on any two data-files that match the format of your example files.

    The following script takes three parameters:
    1. /path/to/file1
    2. /path/to/file2
    3. /path/to/outputfile

    mergedata.sh (or you could call it whatever you want!)
    Code:
    #!/usr/bin/env bash
    
    # ensure we have 3 parameters
    if [[ $# -ne 3 ]]; then
        echo "Error - Requires exactly 3 parameters: "
        echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
        exit 1
    fi
    
    # ensure file1 exists
    if [[ ! -f $1 ]]; then
       echo "Error: \"$1\" does not exist!"
       exit  1
    fi
    
    # ensure file2 exists
    if [[ ! -f $2 ]]; then
       echo "Error: \"$2\" does not exist!"
       exit 1
    fi
    
    # ensure the directory for the output file exists
    if [[ ! -d $(dirname $3) ]]; then
       echo "Error: Unable to create $3"
       echo "    The directory \"$(dirname $3)\" does not exist!"
       exit 1
    fi
    
    # join file1 and file2 including unmatched lines in file1
    # and write to output-file
    join -a1 $1 $2 > $3
    
    # append unmatched lines from file2 to output file
    # adding some extra spaces with awk
    join -a2 -v2 $1 $2 | awk '/ / {gsub(" ", "   ");print}' >> $3
    
    # sort the output file
    sort $3 -o $3
    
    # display the content of the file on-screen
    echo
    cat $3
    
    
    There's quite a bit more code in the above.
    First up there are a few checks to ensure that:
    1. We have the correct number of parameters
    2. Each input file exists
    3. The path to the output file is valid.

    If any of the checks fail - we print an error message and quit with an error code.

    Otherwise - if everything is OK, we perform our join operations on the two files, writing to the output file as we go. Then we sort the output file and display it on-screen.
     
    #3 JasKinasis, Mar 13, 2018
    Last edited: Mar 13, 2018
    nuna likes this.
  4. lvirden

    lvirden New Member

    Joined:
    Oct 6, 2017
    Messages:
    7
    Likes Received:
    0
    Thank you for your kind answer. However, there is one thing misunderstood that might make things easier to understand.
    The intent of merging these two files is to create one import file to a program that is going to assign each "column" to a different table column. Thus, without maintaining keys for each line, the data is list (no key - no record). And given that each csv "column" is assigned to a different table column, it is imperative that data that comes from just one of the two files has place holders that should that the remaining columns need to be treated as empty. Otherwise, the import would either take columns from the next row of data, generate an error for the rows with insufficient data, or assign the wrong csv columns to the wrong database columns - all of which would be disasterous.
     
  5. JasKinasis

    JasKinasis Well-Known Member

    Joined:
    Apr 25, 2017
    Messages:
    302
    Likes Received:
    573
    Ah - my bad!

    Sorry - I was messing around with join for a quite a while when I first started composing the previous post.

    For some reason, by the time I got back to composing the message, I thought you had spaces in your blank fields. I should have referred back to your post.... Doh!

    Well, my previous solution was at least close...

    So what about 3 joins and a sort?
    Code:
    join file1 file2 > outputfile
    join -a1 -v1 file1 file2 | awk '{print $0" \"\""}' >> outputfile
    join -a2 -v2 file1 file2 | awk '/ / {gsub(" ", " \"\" ");print}' >> outputfile
    sort outputfile -o outputfile
    
    The 3 join commands in the above do the following:
    1. Writes fields that are common to both files to the outputfile
    2. Appends fields that are only in file 1 (adding "" to the end of each line) to the output file
    3. Appends fields that are only in file 2 (adding "" to the middle of the line) to the output file

    Then the sort just sorts everything based on the first field.

    That would make the full standalone script look something like this:
    Code:
    #!/usr/bin/env bash
    
    # ensure we have 3 parameters
    if [[ $# -ne 3 ]]; then
       echo "Error - Requires exactly 3 parameters: "
       echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
       exit 1
    fi
    
    # ensure file1 exists
    if [[ ! -f $1 ]]; then
       echo "Error: \"$1\" does not exist!"
       exit  1
    fi
    
    # ensure file2 exists
    if [[ ! -f $2 ]]; then
       echo "Error: \"$2\" does not exist!"
       exit 1
    fi
    
    # ensure the directory for the output file exists
    if [[ ! -d $(dirname $3) ]]; then
       echo "Error: Unable to create $3"
       echo "    The directory \"$(dirname $3)\" does not exist!"
       exit 1
    fi
    
    # join and output lines common to both files
    # and write to output-file
    join $1 $2 > $3
    
    # append unmatched lines from file 1 to output file
    # adding placeholder ("") at the end of the line
    join -a1 -v1 $1 $2 | awk '{print $0" \"\""}' >> $3
    
    # append unmatched lines from file2 to output file
    # adding placeholder ("") to the middle of each line
    join -a2 -v2 $1 $2 | awk '/ / {gsub(" ", " \"\"  ");print}' >> $3
    
    # sort the output file
    sort $3 -o $3
    
    # display the content of the file on-screen
    echo
    cat $3
    
    
    Yielding this as the output:
    Code:
    a 1 2
    b 3 ""
    c 4 6
    d "" 5
    
    Is that a bit more like it?
    Sorry about the previous attempt!

    [edit]
    IMPORTANT:
    Something just occurred to me....
    I've just been assuming that you are only working with numeric data in your files - as per the examples.
    But if any of the unique fields from file 2 are not numeric and contain something like quoted strings with spaces
    e.g.
    If the unique line read in from data-file2 is something like:
    Fieldname "Dave Smith"

    - Then the awk after the 3rd join will insert ' "" ' in each space, yielding:
    Fieldname "" "Dave "" Smith"

    Whereas the expected output would be:
    Fieldname "" "Dave Smith"

    So if that is a problem, we would have to rethink the awk after the 3rd join command to ensure that only the first space is replaced.
    But if we're only dealing with numerical data, it won't be an issue!
    [/edit]
     
    #5 JasKinasis, Mar 13, 2018
    Last edited: Mar 13, 2018
  6. JasKinasis

    JasKinasis Well-Known Member

    Joined:
    Apr 25, 2017
    Messages:
    302
    Likes Received:
    573
    OK, In case you have strings in your data files, I have improved what happens after the 3rd join:

    Code:
    join -a2 -v2 file1 file2 | sed -e 's/  / \"\" /' >> outputfile
    That will only replace the first space in the unique lines from the 2nd file.

    So the 4 commands would now be:
    Code:
    join file1 file2 > outputfile
    join -a1 -v1 file1 file2 | awk '{print $0" \"\""}' >> outputfile
    join -a2 -v2 file1 file2 | sed -e 's/  / \"\" /' >> outputfile
    sort outputfile -o outputfile
    
    Making the full standalone script:
    Code:
    #!/usr/bin/env bash
    
    if [[ $# -ne 3 ]]; then
        echo "Error - Requires exactly 3 parameters: "
        echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
        exit 1
    fi
    
    if [[ ! -f $1 ]]; then
       echo "Error: The file \"$1\" does not exist!"
       exit  1
    fi
    
    if [[ ! -f $2 ]]; then
       echo "Error: The file \"$2\" does not exist!"
       exit 1
    fi
    
    if [[ ! -d $(dirname $3) ]]; then
       echo "Error: Unable to create $3"
       echo "    The directory \"$(dirname $3)\" does not exist!"
       exit 1
    fi
    
    join $1 $2 > $3
    join -a1 -v1 $1 $2 | awk '{print $0" \"\""}' >> $3
    join -a2 -v2 $1 $2 | sed -e 's/ / \"\" /' >> $3
    sort $3 -o $3
    
    # display the content of the file on-screen
    echo
    cat $3
    
    
    
     
    Rob and nuna like this.
  7. Contemplative Farmer

    Joined:
    Mar 21, 2018
    Messages:
    5
    Likes Received:
    7
    Have you looked at using Perl? It is included in most distributions and is both powerful as well as simple to use for applications such as yours. The user community is large, the language is well documented, under current and continued development...though very mature. I would recommend you take a look at it.
    Hope that is helpful.
     

Share This Page