Command-line Text Manipulation

D

DevynCJohnson

Guest
The Linux command-line has numerous uses and abilities. The command-line and shell scripting are also capable of manipulating text including text within files. An introduction to many of the command-line tools for text manipulation is important to people wanting to have a better experience with the Linux operating system.

awk
Awk is a small C-like language that is Turing-complete and is processed/interpreted by the "awk" command in command-lines. In general, awk is faster than sed, but awk can be harder to use (according to some users). Unlike grep, awk can search for certain hex values. Awk also supports conditional statements and loops.

Awk scripts are written like shell scripts, but they contain awk commands and the awk hashpling "#!/usr/bin/awk -f". Awk scripts may use the "*.awk" file-extension.

One-liners
  • Domain Expiration - whois dcjtech.info | awk '/Registry Expiry Date:/ {print $4}'
  • List Httpd 404 Errors - awk '$9 == 404 ' /var/log/httpd/access.log
  • List Usernames and UIDs - awk -F":" '{print $1 " " $3}' /etc/passwd
  • List Users - awk -F':' '{print $1}' /etc/passwd
  • List Users (Alphabetically) - awk -F':' '{print $1}' /etc/passwd | sort
  • Remove Duplicate Lines - awk '!x[$0]++' FILE.txt > NEW_FILE.txt

Generate Random Numbers
Code:
#!/usr/bin/awk -f
BEGIN {
  srand()
}
{
  for(i=1;i<=10;i++)
  print rand(); exit
}

Various implementations of awk are available
  • gawk - GNU awk is based on the POSIX awk standard and has additional features
  • jawk - Java awk (http://sourceforge.net/projects/jawk/) is an awk implementation written in Java.
  • mawk - Modified awk (http://invisible-island.net/mawk/mawk.html) is smaller and faster than Gawk.
  • nawk - New awk is AT&T's version of awk and is the standard awk implementation that uses the POSIX awk standards.
  • oawk - Old Awk is the original awk. The name "oawk" is used for compatibility.

sed
The Stream Editor (sed) (https://www.gnu.org/software/sed/) is a Unix utility that manipulates text based on special commands that are written using the "sed" language. Both the language used by the command and the command itself are called "sed". The language is simple and Turing-complete, and many users say is easier to learn than awk. However, awk is generally faster than sed. The sed language can be used in sed scripts which use the "#!/bin/sed -f" hashpling and may use the "*.sed" file-extension. "sed" is commonly used for finding and replacing text.

One-liners
  • Count Lines in File - sed -n '$=' FILE.txt
  • Double-space a File - sed G FILE.txt > NEW_FILE.txt
  • Find and Replace - sed 's/FIND/REPLACE/g' FILE.txt > NEW_FILE.txt
  • Find and Replace (Case-insensitive) - sed -i 's/FIND/REPLACE/g' FILE.txt > NEW_FILE.txt
  • Removing Trailing Whitespace (Each Line of File) - sed 's/[ \t]*$//g' FILE.txt > NEW_FILE.txt

ssed
Super-sed (http://sed.sourceforge.net/grabbag/ssed/) is an enhanced version of sed that is generally faster than the original sed.

Perl
Perl (https://www.perl.org/) is a scripting language that is commonly used for advanced text manipulations (among other uses). Perl can also be used as an alternative to PHP on dynamic servers. Perl can be used in the command-line or in Perl scripts, which contain the "#!/usr/bin/perl" hashpling and may use the "*.pl" file-extension. Perl is a Turing-complete computer language.

Perl can be used as a substitute for the "sed" command. For example, "sed 's/FIND/REPLACE/g'" = "perl -pe 's/FIND/REPLACE/g'". Obviously, Perl supports the language and syntax used by sed. Perl is also an excellent replacement for other text manipulation tools such as awk, cut, uniq, and others.

grep
Grep (http://www.gnu.org/software/grep/) is a Unix utility used to search plain-text. Grep also supports regular expression (regex) which are "wildcards".

Example Commands
  • Case-insensitive Search - grep -i -e "FIND" FILE.txt
  • Count Instances Found - grep -c -e "FIND" FILE.txt
  • Display Line Number with Output - grep -n -e "FIND" FILE.txt
  • Invert Match - grep -v -e "FIND" FILE.txt
  • Search Files in Directory Recursively - grep -r -e "FIND" /DIRECTORY/

Grep Variants
  • agrep - Approximate grep is a proprietary utility that supports many search algorithms, especially "fuzzy string searching".
  • egrep - Extended grep has additional regular expression features.
  • fgrep - Fixed grep does not support regex and uses the Aho–Corasick string matching algorithm.
  • pgrep - Process grep searches process names for a given string and then returns the process ID (PID).

cut
The "cut" command (http://linux.die.net/man/1/cut) can remove/extract bytes, characters, and fields from files. Various parameters are used to specify what part or parts of the file are to be removed or displayed. By default, the "cut" command outputs the sorted results to standard output, thus leaving the original file unchanged.
  • Display First Five Characters - cut -c1-5 FILE.txt
  • Display Third Character of Each Line - cut -c3 FILE.txt
  • List User Homes (Alphabetically) - cut -d':' -f1,6 /etc/passwd | sort
  • List Users - cut -d':' -f1 /etc/passwd

sort
The "sort" command (http://linux.die.net/man/1/sort) is used to sort the lines of a text file. By default, "sort" sorts alphabetically. However, the "-n" parameter can be used to sort numerically. The "sort" command outputs the sorted results to standard output, thus leaving the original file unchanged. Using the "-t" parameter, the field delimiter can be specified such as the "pipe" character (-t'|').

FUN FACT: "sort -u FILE.txt" achieves the same results as "sort FILE.txt | uniq".
  • Sort by the Third Column - sort -k3 FILE.txt
  • Sort by the Third Column (Reversed) - sort -k3 -r FILE.txt
  • Sort by the Third Column (Save Results) - sort -k3 FILE.txt > NEW_FILE.txt
  • Sort Files by Size - ls -al | sort -r -n -k5

uniq
The "uniq" command (http://www.computerhope.com/unix/uuniq.htm) removes duplicate lines in a sorted file. This means the duplicate lines must be together (each on their own line) for "uniq" to find and remove them. Typically, the "sort" command is used with the "uniq" command.

  • Count Duplicate Lines - uniq -c FILE.txt
  • Display Unique Lines - uniq -u FILE.txt
  • List Duplicate Lines - uniq -d FILE.txt
  • Remove Duplicate Lines - uniq FILE.txt

replace
"replace" (http://www.computerhope.com/unix/replace.htm) is a Unix utility that finds and replaces text. The general syntax is "replace FIND REPLACE -- LIST_OF_FILE_PATHS". For illustration, to replace "NIX" with "Unix" in a text file, type 'replace "NIX" "Unix" -- FILE.txt'.

Further Reading
 

Attachments

  • slide.jpg
    slide.jpg
    29 KB · Views: 1,734

Members online


Top