Command-line Text Manipulation

DevynCJohnson · Oct 4, 2015

The Linux command-line has numerous uses and abilities. The command-line and shell scripting are also capable of manipulating text including text within files. An introduction to many of the command-line tools for text manipulation is important to people wanting to have a better experience with the Linux operating system.

awk
Awk is a small C-like language that is Turing-complete and is processed/interpreted by the "awk" command in command-lines. In general, awk is faster than sed, but awk can be harder to use (according to some users). Unlike grep, awk can search for certain hex values. Awk also supports conditional statements and loops.

Awk scripts are written like shell scripts, but they contain awk commands and the awk hashpling "#!/usr/bin/awk -f". Awk scripts may use the "*.awk" file-extension.

One-liners

Domain Expiration - whois dcjtech.info | awk '/Registry Expiry Date:/ {print $4}'
List Httpd 404 Errors - awk '$9 == 404 ' /var/log/httpd/access.log
List Usernames and UIDs - awk -F":" '{print $1 " " $3}' /etc/passwd
List Users - awk -F':' '{print $1}' /etc/passwd
List Users (Alphabetically) - awk -F':' '{print $1}' /etc/passwd | sort
Remove Duplicate Lines - awk '!x[$0]++' FILE.txt > NEW_FILE.txt

Generate Random Numbers

Code:

#!/usr/bin/awk -f
BEGIN {
  srand()
}
{
  for(i=1;i<=10;i++)
  print rand(); exit
}

Various implementations of awk are available

gawk - GNU awk is based on the POSIX awk standard and has additional features
jawk - Java awk (http://sourceforge.net/projects/jawk/) is an awk implementation written in Java.
mawk - Modified awk (http://invisible-island.net/mawk/mawk.html) is smaller and faster than Gawk.
nawk - New awk is AT&T's version of awk and is the standard awk implementation that uses the POSIX awk standards.
oawk - Old Awk is the original awk. The name "oawk" is used for compatibility.

sed
The Stream Editor (sed) (https://www.gnu.org/software/sed/) is a Unix utility that manipulates text based on special commands that are written using the "sed" language. Both the language used by the command and the command itself are called "sed". The language is simple and Turing-complete, and many users say is easier to learn than awk. However, awk is generally faster than sed. The sed language can be used in sed scripts which use the "#!/bin/sed -f" hashpling and may use the "*.sed" file-extension. "sed" is commonly used for finding and replacing text.

One-liners

Count Lines in File - sed -n '$=' FILE.txt
Double-space a File - sed G FILE.txt > NEW_FILE.txt
Find and Replace - sed 's/FIND/REPLACE/g' FILE.txt > NEW_FILE.txt
Find and Replace (Case-insensitive) - sed -i 's/FIND/REPLACE/g' FILE.txt > NEW_FILE.txt
Removing Trailing Whitespace (Each Line of File) - sed 's/[ \t]*$//g' FILE.txt > NEW_FILE.txt

ssed
Super-sed (http://sed.sourceforge.net/grabbag/ssed/) is an enhanced version of sed that is generally faster than the original sed.

Perl
Perl (https://www.perl.org/) is a scripting language that is commonly used for advanced text manipulations (among other uses). Perl can also be used as an alternative to PHP on dynamic servers. Perl can be used in the command-line or in Perl scripts, which contain the "#!/usr/bin/perl" hashpling and may use the "*.pl" file-extension. Perl is a Turing-complete computer language.

Perl can be used as a substitute for the "sed" command. For example, "sed 's/FIND/REPLACE/g'" = "perl -pe 's/FIND/REPLACE/g'". Obviously, Perl supports the language and syntax used by sed. Perl is also an excellent replacement for other text manipulation tools such as awk, cut, uniq, and others.

grep
Grep (http://www.gnu.org/software/grep/) is a Unix utility used to search plain-text. Grep also supports regular expression (regex) which are "wildcards".

Example Commands

Case-insensitive Search - grep -i -e "FIND" FILE.txt
Count Instances Found - grep -c -e "FIND" FILE.txt
Display Line Number with Output - grep -n -e "FIND" FILE.txt
Invert Match - grep -v -e "FIND" FILE.txt
Search Files in Directory Recursively - grep -r -e "FIND" /DIRECTORY/

Grep Variants

agrep - Approximate grep is a proprietary utility that supports many search algorithms, especially "fuzzy string searching".
egrep - Extended grep has additional regular expression features.
fgrep - Fixed grep does not support regex and uses the Aho–Corasick string matching algorithm.
pgrep - Process grep searches process names for a given string and then returns the process ID (PID).

cut
The "cut" command (http://linux.die.net/man/1/cut) can remove/extract bytes, characters, and fields from files. Various parameters are used to specify what part or parts of the file are to be removed or displayed. By default, the "cut" command outputs the sorted results to standard output, thus leaving the original file unchanged.

Display First Five Characters - cut -c1-5 FILE.txt
Display Third Character of Each Line - cut -c3 FILE.txt
List User Homes (Alphabetically) - cut -d':' -f1,6 /etc/passwd | sort
List Users - cut -d':' -f1 /etc/passwd

sort
The "sort" command (http://linux.die.net/man/1/sort) is used to sort the lines of a text file. By default, "sort" sorts alphabetically. However, the "-n" parameter can be used to sort numerically. The "sort" command outputs the sorted results to standard output, thus leaving the original file unchanged. Using the "-t" parameter, the field delimiter can be specified such as the "pipe" character (-t'|').

FUN FACT: "sort -u FILE.txt" achieves the same results as "sort FILE.txt | uniq".

Sort by the Third Column - sort -k3 FILE.txt
Sort by the Third Column (Reversed) - sort -k3 -r FILE.txt
Sort by the Third Column (Save Results) - sort -k3 FILE.txt > NEW_FILE.txt
Sort Files by Size - ls -al | sort -r -n -k5

uniq
The "uniq" command (http://www.computerhope.com/unix/uuniq.htm) removes duplicate lines in a sorted file. This means the duplicate lines must be together (each on their own line) for "uniq" to find and remove them. Typically, the "sort" command is used with the "uniq" command.

Count Duplicate Lines - uniq -c FILE.txt
Display Unique Lines - uniq -u FILE.txt
List Duplicate Lines - uniq -d FILE.txt
Remove Duplicate Lines - uniq FILE.txt

replace
"replace" (http://www.computerhope.com/unix/replace.htm) is a Unix utility that finds and replaces text. The general syntax is "replace FIND REPLACE -- LIST_OF_FILE_PATHS". For illustration, to replace "NIX" with "Unix" in a text file, type 'replace "NIX" "Unix" -- FILE.txt'.

Further Reading

Gawk Man Page - http://linux.about.com/library/cmd/blcmdl1_awk.htm
Grep Command - http://www.linux.org/threads/linux-linux-shell-19-–-grep-command.7032/
Mawk Man Page - http://linux.die.net/man/1/mawk
Sed Manual - https://www.gnu.org/software/sed/manual/sed.html
Sed Scripts - http://sed.sourceforge.net/grabbag/scripts/
An Introduction and Tutorial for Sed - http://www.grymoire.com/Unix/Sed.html
Scripting and Command-line Reading Guide - http://www.linux.org/threads/scripting-command-line-reading-guide.6029/
Using Perl - http://www.linux.org/threads/using-perl.4111/
Text Processing and Manipulation - http://www.linux.org/threads/text-processing-and-manipulation.4107/

Command-line Text Manipulation

DevynCJohnson

Guest

Attachments

Similar threads

Follow Linux.org

Members online

Latest posts