Optimizing BASH Scripts

Discussion in 'Shell / Command Line' started by DevynCJohnson, Jul 10, 2013.

  1. DevynCJohnson

    DevynCJohnson Well-Known Member Staff Member Staff Writer

    Messages:
    1,336
    Likes Received:
    1,074
    Trophy Points:
    113
    Shell script writers may write a script that runs slowly, but the programmer cannot figure out what code is causing the issue. Even if they find the problem, they may have difficulties finding a better way to write the code. This article will show inefficient code and how to write efficient script. This article will also show methods of writing code to make the code execute quickly.

    One tip that helps in shell script writing is to not have unnecessary code that does not serve a real purpose or acts as a "middleman". For instance, instead of writing "ls | cat > ./my_file" type "ls > ./my_file". The cat command (conCATonation) puts the many lines of output of ls (this lists the folders' contents) together and then places them in a file (> ./my_file). The funnel (>) will get the output and put it in a file, so the cat command is not needed.

    One of the best commands to use in BASH scripts that help speed up a script is "&". Putting an ampersand at the end of a line will tell the BASH interpreter to multithread that line. This means the ampersand-containing line will be executed while the rest of the script continues running. Otherwise, the next line is not read or executed until the previous line is completed. For illustration:


    Code:
    hostname > ./hostname_file &
    cal 2017
    This code will put the computers hostname in a file and display a calendar at the same time. Otherwise, the user would not see the calendar until after the hostname file is created.

    The ampersand does not work with all lines of code. BASH does not allow programmers to multithread the creation of variables. For example, this code will not work:


    Code:
    NUMBERS=$(cat ./large_file | sed -e -r "s|([0-9]*)|1|gI") &
    ping linux.org

    This code copies the numbers out of a large file that may be several gigabytes in size. This process will take a long time. If the script later references the NUMBERS variable, the variable will be void. This is because the first line is performed in a different shell (it is run separately from the script). When the script calls that variable, nothing will be found because another shell created the variable. The programmer does not want the script to wait to ping Linux.org, and the programmer knows the variable will not be saved with this type of coding. Here is the solution:


    Code:
    cat ./large_file | sed -e -r "s|([0-9]*)|1|gI" > ./output &
    ping linux.org
    MANY LINES OF OTHER CODE
    NUMBERS=$(< ./output)

    This code sends the output to a file. Before this data is used, the file is opened and the contents are saved to the variable. Now, the time-consuming process is performed while other code is executed. When the information is needed as a variable, the script can save it from the file.


    Programmers can make code like this:


    Code:
    hostname > ./hostname &
    unity --version > ./program_version &
    cat ./large_file | sed -e -r "s|([0-9]*)|1|gI" > ./output &
    less ./hostname

    The first three lines will be executed separately and the script will proceed in running the script. However, there is a problem with this code. The fourth line will open the file "hostname" within the terminal for the user to view. This file is being created while it is being opened, so this will not work. Instead, add one essential line:


    Code:
    hostname > ./hostname &
    unity --version > ./program_version &
    cat ./large_file | sed -e -r "s|([0-9]*)|1|gI" > ./output &
    wait
    less ./hostname 

    The "wait" command will tell the interpreter to wait until all commands have finished executing. Now, the above code will work.

    In summary for multithreads, multithreading the creation of variable and having two dependent commands run at the same time is not a great idea. Only multithread code if no other code depends on its output.

    Many programmers may have two alternative styles of writing a piece of code. There is a BASH command that will allow programmers to test which code is faster. This command is "time". For illustration, programmers can copy the contents of a file many ways:


    Code:
    more ./Linux_Resume.TXT > ./test
    less ./Linux_Resume.TXT > ./test
    cat ./Linux_Resume.TXT > ./test

    A programmer can use "time" and get these results:


    Code:
    User@Laptop:~$ time(more ./Linux_Resume.TXT > ./test)
     
    real    0m0.004s
    user    0m0.000s
    sys    0m0.000s
     
    User@Laptop:~$ time(less ./Linux_Resume.TXT > ./test)
     
    real    0m0.022s
    user    0m0.004s
    sys    0m0.004s
     
    User@Laptop:~$ time(cat ./Linux_Resume.TXT > ./test)
     
    real    0m0.004s
    user    0m0.000s
    sys    0m0.000s 

    The results indicate that "cat" is faster, second is "more", and last is "less". The "real" line shows the elapsed time. The "user" line shows how much time the code spent in user mode. The "sys" line displays the time spent executing the code in kernel mode. Programmers should focus on the real time and disregard the other fields.

    Another helpful tip is to try not to write to the hard-drive unless absolutely necessary as when multithreading to get a value for a variable. Instead, when possible, save data to variables which are written on the RAM. RAM storage is read and written faster than writing and reading a file to the hard-drive. Below shows a time test to prove this statement:


    Code:
    User@Laptop:~$ time(ls -al > ./test)
     
    real    0m0.020s
    user    0m0.004s
    sys    0m0.004s
     
    User@Laptop:~$ time(VARIABLE=$(ls -al))
     
    real    0m0.010s
    user    0m0.004s
    sys    0m0.004s 

    The ls command with the parameter -al prints a small output. Now imagine running these two sets of code with a command that generates a large output. Also keep in mind where the data is being written if the output must be sent to a file. For example, writing to a USB memory card will take more time than writing to a SATA hard-drive. This is due to the faster read/write speeds of hard-drives (especially SATA) compared to USB memory cards.

    Any code that depends on data over a network will be slow if the network speed is low. Also, if the requested data is large, then the code will take some time to download the needed data. If possible, avoid using data that originates from the network.

    When manipulating text, try using AWK instead of SED. If a user must use SED, then use the command "ssed". SuperSED is used exactly the same as the sed command. SuperSED was designed differently for swift execution. Also, if SED or SuperSED are used, write the SED commands as precisely as possible and only use code that is needed. For example, if the programmer does not need to use regex patterns, then do not use the -r parameter. Although no regex patterns would be used, the sed and ssed command will still load that code and try to use regex. This minor error will cost the script some performance. For illustration:

    Code:
    User@Laptop:~$ time(ls -al | sed -e "s|a|Z|gI" > ./test)
     
    real    0m0.007s
    user    0m0.008s
    sys    0m0.000s
    User@Laptop:~$ time(ls -al | sed -e -r "s|a|Z|gI" > ./test)
    sed: -e expression #1, char 1: unknown command: `-'
     
    real    0m0.009s
    user    0m0.000s
    sys    0m0.004s
    User@Laptop:~$ time(ls -al | sed -r -e "s|a|Z|gI" > ./test)
     
    real    0m0.010s
    user    0m0.000s
    sys    0m0.008s
    User@Laptop:~$ time(ls -al | ssed -r -e "s|a|Z|gI" > ./test)
     
    real    0m0.008s
    user    0m0.000s
    sys    0m0.004s
    User@Laptop:~$ time(ls -al | ssed -e "s|a|Z|gI" > ./test)
     
    real    0m0.005s
    user    0m0.000s
    sys    0m0.000s 

    Notice the real times for each command. Realize that leaving the unnecessary -r parameter causes the commands to run slower. Also notice that the second command takes up more time when it finds a syntax error. (This error is caused when "-e -r" are typed instead of "-r -e". The "-e" means pattern which is the find and replace commands.)

    One important tip is to keep the code as simple as possible. When writing any program, remember KISS - 'Keep It Simple, Stupid'. If the code can be simplified or deleted, do so. Overly complicated code can use up more time and processing resources. Complex code can also be difficult for a programmer to figure out later in time. This may cause the programmer to code more unneeded programming that causes the script to run slower.

    With any programming language, it will take time to learn all of the tricks. There are many more that are not in this article. There are so many shortcuts and tricks even in BASH alone. Take time to play around and tweak scripts.

Share This Page