LFCS Regular Expression (Part 2)

Jarret B

Well-Known Member
Staff member
Joined
May 22, 2017
Messages
336
Reaction score
361
Credits
11,504
When dealing with files and folders on a system or even going through the contents of a file you will need to understand ‘Regular Expression’. ‘Regular Expression’ are characters used to define a pattern. The pattern can be used to search through a text file or listing files and folders.

Sometime instead of ‘Regular Expression’ you may see ‘regex’ or ‘regexp’.

The parts of the ‘Regular Expression’ can be distinguished in many ways, but I will try to separate them a little more to keep it simple.

The parts of the ‘Regular Expression’ are:
  1. Characters and Groups
  2. Anchors
  3. Class/Range
  4. Quantifier
NOTE: Part 1 of this article covered Characters and Groups, and Anchors. This article will cover Class/Range and Quantifiers. If you have not read Part 1 of this article then please do so.

Class/Range

A Class is a set of characters or a group. Classes are placed in brackets ([]).

A Range is a special grouping of characters placed within the Class.

For example, to search for any uppercase letter between ‘A’ and ‘L’ followed by an ‘e’ would be:

[A-L]e’

NOTE: If you copy and paste the commands into a Terminal, the single quote marks need to be replaced with standard quote marks.

The Range can consist of letters ([a-z] or [A-Z]) and numbers ([0-9]).

To negate a Class/Range use the ‘^’ symbol inside the brackets. For example, to find a ‘$’ not followed by a number would be:

$[^0-9]’

Class/Range Examples

Two small examples were given, but we can go into more detail.

If we wanted to search for the letter ‘e’ which is proceeded by the letters ‘rst’ or ‘RST’ then the command would be:

grep ‘[r-tR-T]e’ grephelp.txt

The results of the scan can be seen in Figure 1. In most cases the search finds a lot of lower-case ‘r, s, t’ followed by ‘e’. In the third from the last line you can see that the search found the word ‘Report’ which has a capitalized ‘R’. The Range we specified was both lower-case ‘r, s, t’ and upper-case ‘R, S, T’.


Figure 01.jpg

FIGURE 1

Numbers can also be found by using the Class and Range of ‘[0-9]’. We can search for all numbers in the ‘grephelp.txt’ by using the command:

grep ‘[0-9]’ grephelp.txt

The results of the command are shown in Figure 2. You can see that four lines were found which contain numbers. Numbers ‘0, 1, 2’ were actually found.

Figure 02.jpg

FIGURE 2

If the search were made for only numbers ‘0, 1’ then only three lines would be found as shown in Figure 3.

Figure 03.jpg

FIGURE 3

If you wish to negate the Range given, then you would use the ‘^’ symbol before the range but within the brackets. For example, to find lines where the word ‘byte’ does not follow a number then the command would be:

grep ‘[^0-9] byte’ grephelp.txt

The result of the search is shown in Figure 4 where only one line was found.

Figure 04.jpg

FIGURE 4

To see what would be found if the Range were not negated, then try the command:

grep ‘[0-9] byte’ grephelp.txt

Without negating the Range there are two lines found as shown in Figure 5.

Figure 05.jpg

FIGURE 5

Using Class and Ranges you can easily perform searches which can be case-insensitive. To search for the parameters ‘-r’ and ‘-R’ within the ‘grephelp.txt’ file, the command is:

grep ‘\-[rR],’ grephelp.txt

Notice the use of the backslash (\). The backslash is used to ‘escape’ the following character after it. The dash (-) can be used to pass parameters to ‘grep’, but we need it seen as a literal dash to be included in the search. For ‘grep’ to recognize it as a dash and not to be used to pass a parameter then the backslash is used. The results of the search are shown in Figure 6.

Figure 06.jpg

FIGURE 6

There are shortcut keys to use to replace certain Class and Ranges. Here are the special character sets:
  • [[:alnum]] – same as [a-zA-Z0-9]
  • [[:cntrl:]] - same as a Control Character
  • [[:lower:]] - same as [a-z]
  • [[:upper:]] - same as [A-Z]
  • [[:space:]] - whitespace (spaces)
  • [[:blank:]] - whitespace (spaces)
  • [[alpha:]] - same as [a-zA-Z]
  • [[:digit:]] - same as [0-9]
  • [[:print:]] - any printable character
  • [[:punct:]] - punctuation characters
An example is to list all of the parameters for ‘grep’. Looking at the text file you notice the parameters are given as ‘-[a-zA-Z],’. There are cases where a parameter can include a number. The parameters are all followed by a comma. The command to do this search using a special character set would be:

grep ‘\-[[:alnum:]],’ grephelp.txt


The results are shown in Figure 7.

Figure 07.jpg

FIGURE 7

Quantifier

A Quantifier is a special character used to represent other characters.

There are four basic Quantifiers with the last one having three variations in its usage.

The Quantifiers are as follows:

  1. * - 0 or more occurrences
  2. + - 1 or more occurrences
  3. ? - 0 or 1 occurrence
  4. {#} - specific number of occurrences
  5. {#,} - a specific number or more of occurrences
  6. {#,#} - a range of occurrences from least to maximum

Quantifier Examples

If we want to search for the letters ‘nul’, ‘null’, ‘nulll’, etc. the command would be:

grep ‘nu*l’ grephelp.txt

The asterisk (*) specifies that the following character would be found 0 or more times. If you add more characters to the search string, such as ‘nu*lel’ then it acts like the asterisk is not there. The asterisk should only be followed by one character to work.

Remember that the asterisk looks for even zero occurrences.

If you need to search for a character that must appear at least once, then use the plus (+) sign. If you want to search for ‘nul’ and make sure the letter ‘l’ is definitely there then the command is:

grep ‘nul\+’ grephelp.txt

NOTE: The ‘+’ must be escaped with the backslash.

The command will find all instances of ‘nul’, ‘null’, etc.

If you needed to find all instances of a ‘u’ and ‘l’ after the ‘n’ then the command would be:

grep ‘nu\+l\+’ grephelp.txt

The command searches for an ‘n’ followed by at least one ‘u’ and at least one or more ‘l’. This can find strings like ‘nul’, ‘nuul’, ‘nuull’, etc. It will not find a whole string like ‘nulul’ since the letters ‘u’ and ‘l’ alternate, but it will find the ‘nul’ part of the string.

The question mark (?) searches for 0 or 1 occurrence.

When the command “grep 'nu\?l' grephelp.txt” it will search for the strings ‘nl’ and ‘nul’. The letter ‘u’ will either appear 0 times or once.

If the search string were changed to 'nu\?l\?' then the ‘u’ and ‘l’ would either appear 0 or 1 times. The search would return ‘n’, ‘nu’, ‘nl’ and ‘nul’.

The curly brackets are used to specify the number of times a character is found. For example, to find all instances of a string ‘null’ the command is:

grep -E ‘nul{2}’ grephelp.txt

Here, the ‘l’ is repeated twice. So the match must be ‘null’. The command could easily be made:

grep ‘null’ grephelp.txt

NOTE: To use the curly brackets you must add the ‘-E’ parameter to ‘grep’.

If we were to search for a string of three vowels the command is:

grep -E ‘[aeiou]{3}’ grephelp.txt

The resulting matches would be ‘Miscellaneous’ and ‘quiet’. We could search for three consonants in a row by the command:

grep -E ‘[^aeiou ]{3}’ grephelp.txt

There needs to be a space added within the brackets otherwise a space is counted as a match.

To specify three or more consonants, the command would be:

grep -E ‘[^aeiou ]{3,}’ grephelp.txt

By not specifying an end limit the maximum amount is infinite.

When specifying an ending amount then the ends are included. For example, the command to search for 6 to 7 upper and lower-cased consonants would be:

grep -E ‘[^aeiouAEIOU ]{6,7}’ grephelp.txt

The results are shown in Figure 8.


Figure 08.jpg

FIGURE 8

Practice with the Regular Expressions and see what search strings you can design to find certain strings in a file. Good Luck!
 


Top