regex

dos2unix

Well-Known Member
Joined
May 3, 2019
Messages
4,414
Reaction score
4,615
Credits
41,683
I apologize for doing too many articles in a short amount of time. Chriastmas vacation is over and I have to go back to work Monday
:)

Regular Expressions Part 1: Introduction & Basics


Regular expressions (regex) are powerful pattern-matching tools used throughout Linux and programming. They allow you to search, match, and manipulate text based on patterns rather than exact strings.




What is a Regular Expression?


A regular expression is a sequence of characters that defines a search pattern. Instead of searching for the exact text "error", you could search for any line containing "error" or "Error" or "ERROR", or even more complex patterns like "any word starting with 'err'".




Where Regex is Used


Regular expressions appear in many Linux tools and programming languages:


Command-line tools:
Code:
grepsedawklessvim


Programming languages:
Code:
PerlPythonJavaScriptPHPRubyJava




Understanding POSIX Flavors


POSIX (Portable Operating System Interface) defines two main regex flavors for Unix/Linux systems:


Basic Regular Expressions (BRE):


  • Default in grep and sed
  • Requires backslashes for special characters
  • More verbose syntax
  • Limited feature set

Extended Regular Expressions (ERE):


  • Used with grep -E (or egrep) and awk
  • Cleaner syntax (fewer backslashes)
  • More intuitive
  • Additional operators

We'll cover both flavors in detail in upcoming posts. For now, understand that the same pattern might require different syntax depending on which tool you're using.




Basic Pattern Elements


Literal characters
match themselves:
Code:
error
Matches the exact text "error"


The dot (.) matches any single character:
Code:
e.ror
Matches "error", "e5ror", "e ror", etc.


Character classes [ ] match any one character inside:
Code:
[Ee]rror
Matches "error" or "Error"


Code:
[0-9]
Matches any single digit


Code:
[a-z]
Matches any lowercase letter


Code:
[A-Za-z0-9]
Matches any letter or digit


Negated character classes [^ ] match anything NOT listed:
Code:
[^0-9]
Matches any character that's not a digit




Anchors


Anchors don't match characters—they match positions:


^ matches start of line:
Code:
^Error
Matches "Error" only at the beginning of a line


$ matches end of line:
Code:
Error$
Matches "Error" only at the end of a line


Combined:
Code:
^Error$
Matches lines containing only "Error" (nothing before or after)




Quantifiers (How Many Times)


* matches zero or more of the preceding character:
Code:
erro*r
Matches "errr", "error", "eroor", "errrrr", etc.


Note: In BRE (basic grep/sed), you use * directly. Other quantifiers require special handling, which we'll cover in later parts.




Escape Character


The backslash \ makes special characters literal:
Code:
.
Matches an actual period (not "any character")


Code:
$
Matches a dollar sign (not "end of line")




Simple Examples


Find lines containing "error" (case-insensitive):
Code:
grep -i error logfile.txt


Find lines starting with a number:
Code:
grep '^[0-9]' file.txt


Find email-like patterns:
Code:
grep '[a-zA-Z0-9]@[a-zA-Z0-9].' file.txt


Find empty lines:
Code:
grep '^$' file.txt


Find lines containing "error" or "warning":
Code:
grep 'error|warning' file.txt




Common Predefined Character Classes


Many regex flavors support shorthand for common patterns:


Code:
[:alnum:]    Alphanumeric characters[:alpha:]    Alphabetic characters[:digit:]    Digits 0-9[:lower:]    Lowercase letters[:upper:]    Uppercase letters[:space:]    Whitespace (space, tab, newline)[:punct:]    Punctuation characters


Used inside character classes:
Code:
grep '[[:digit:]]' file.txt
Matches any line containing a digit




Testing Your Regex


Before using regex in scripts, test it:


Interactive testing with grep:
Code:
echo "test string" | grep 'pattern'


Show matching part with color:
Code:
grep --color 'pattern' file.txt


Count matches:
Code:
grep -c 'pattern' file.txt




Next in This Series


In Part 2, we'll dive deeper into POSIX Basic Regular Expressions (BRE), including:

  • Why certain characters need escaping
  • Grouping and backreferences
  • Repetition patterns
  • Real-world examples with grep and sed
 


Regular Expressions Part 2: POSIX Basic (BRE)
POSIX Basic Regular Expressions (BRE) are the default regex flavor used by grep and sed. While they're more verbose than other flavors, understanding BRE is essential because these tools are everywhere in Linux.

Where BRE is Used
Default in these commands:
Code:
grep
sed
To use BRE explicitly:
Code:
grep pattern file.txt
sed 's/pattern/replacement/' file.txt

The Backslash Problem
BRE's most confusing aspect: many special characters require backslashes, which seems backwards.
In BRE, these are literal unless escaped:

Parentheses ( )
Curly braces { }
Plus sign +
Question mark ?
Pipe |

To make them special, you MUST escape them:
Code:
(    Start a group
)    End a group
{    Start repetition count
}    End repetition count
+    One or more (GNU extension)
?    Zero or one (GNU extension)
|    Alternation (GNU extension)
This is opposite of most programming languages!

Grouping and Backreferences
Groups capture matched text for reuse:
Create a group:
Code:
(pattern)
Reference captured groups:
Code:
\1    First group
\2    Second group
\3    Third group
Example - Find repeated words:
Code:
grep '([a-z]*) \1' file.txt
Matches "the the" or "is is"
Example - Swap two words with sed:
Code:
echo "John Doe" | sed 's/(.) (.)/\2 \1/'
Output: "Doe John"
Example - Match HTML tags:
Code:
grep '<([a-z])>.</\1>' file.html
Matches <div>content</div> but not <div>content</span>

Repetition Quantifiers
The asterisk * (zero or more):
Code:
a*
Matches "", "a", "aa", "aaa", etc.
Escaped braces {n,m} (specific counts):
Exactly n times:
Code:
a{3}
Matches exactly "aaa"
At least n times:
Code:
a{3,}
Matches "aaa", "aaaa", "aaaaa", etc.
Between n and m times:
Code:
a{2,4}
Matches "aa", "aaa", or "aaaa"
Example - Find phone numbers (###-####):
Code:
grep '[0-9]{3}-[0-9]{4}' file.txt
Example - Find 3-5 letter words:
Code:
grep '\b[a-z]{3,5}\b' file.txt

GNU Extensions to BRE
GNU grep and sed add these (not in strict POSIX BRE):
+ (one or more):
Code:
a+
Matches "a", "aa", "aaa", etc. (but not empty string)
? (zero or one):
Code:
colou?r
Matches "color" or "colour"
| (alternation - OR):
Code:
cat|dog
Matches "cat" or "dog"
Word boundaries:
Code:
<     Start of word
>     End of word
\b     Word boundary (GNU)
Example - Find whole word "test":
Code:
grep '<test>' file.txt
Matches "test" but not "testing" or "contest"

Practical BRE Examples
1. Find lines with IP addresses:
Code:
grep '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}' /var/log/auth.log
2. Remove blank lines:
Code:
sed '/^$/d' file.txt
3. Find lines starting with # (comments):
Code:
grep '^#' config.conf
4. Replace multiple spaces with single space:
Code:
sed 's/  */ /g' file.txt
5. Extract lines between two patterns:
Code:
sed -n '/START/,/END/p' file.txt
6. Find duplicate consecutive lines:
Code:
grep '^\(.*\)\n\1\n\1
\n\1' file.txt

7. Match email addresses:
Code:
grep '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt
8. Find C-style comments:
Code:
grep '/*.**/' source.c
9. Find lines with exactly 5 characters:
Code:
grep '^.{5}$' file.txt
10. Replace date format (MM/DD/YYYY to YYYY-MM-DD):
Code:
sed 's/([0-9]{2})/([0-9]{2})/([0-9]{4})/\3-\1-\2/' dates.txt

Common BRE Pitfalls
Forgetting to escape special characters:
Code:
grep '(test)'     # Wrong - matches literal "(test)"
grep '(test)'   # Right - creates a group
Using + without escaping:
Code:
grep 'a+'         # Wrong in strict BRE - matches literal "a+"
grep 'a+'        # Right (GNU extension)
grep 'aa*'        # Portable alternative (one 'a' plus zero or more)
Forgetting anchors:
Code:
grep '192.168'    # Matches anywhere in line
grep '^192.168'   # Only at start of line

BRE vs ERE Quick Comparison
Same pattern in both flavors:
BRE (grep/sed default):
Code:
grep '(abc){2,4}|xyz' file.txt
ERE (grep -E):
Code:
grep -E '(abc){2,4}|xyz' file.txt
ERE is cleaner! We'll cover it in Part 3.

Testing BRE Patterns
Echo and pipe to grep:
Code:
echo "test123" | grep '[0-9]{3}'
Use grep with color to see what matches:
Code:
grep --color '<[a-z]{5,}>' file.txt
Test sed substitution without changing file:
Code:
sed -n 's/(pattern)/MATCHED: \1/p' file.txt

Next in This Series
Part 3 will cover POSIX Extended Regular Expressions (ERE), which simplify much of BRE's awkward syntax and are used with grep -E and awk.
 
Regular Expressions Part 2: POSIX Basic (BRE)
POSIX Basic Regular Expressions (BRE) are the default regex flavor used by grep and sed. While they're more verbose than other flavors, understanding BRE is essential because these tools are everywhere in Linux.

Where BRE is Used
Default in these commands:
Code:
grep
sed
To use BRE explicitly:
Code:
grep pattern file.txt
sed 's/pattern/replacement/' file.txt

The Backslash Problem
BRE's most confusing aspect: many special characters require backslashes, which seems backwards.
In BRE, these are literal unless escaped:

Parentheses ( )
Curly braces { }
Plus sign +
Question mark ?
Pipe |

To make them special, you MUST escape them:
Code:
(    Start a group
)    End a group
{    Start repetition count
}    End repetition count
+    One or more (GNU extension)
?    Zero or one (GNU extension)
|    Alternation (GNU extension)
This is opposite of most programming languages!

Grouping and Backreferences
Groups capture matched text for reuse:
Create a group:
Code:
(pattern)
Reference captured groups:
Code:
\1    First group
\2    Second group
\3    Third group
Example - Find repeated words:
Code:
grep '([a-z]*) \1' file.txt
Matches "the the" or "is is"
Example - Swap two words with sed:
Code:
echo "John Doe" | sed 's/(.) (.)/\2 \1/'
Output: "Doe John"
Example - Match HTML tags:
Code:
grep '<([a-z])>.</\1>' file.html
Matches <div>content</div> but not <div>content</span>

Repetition Quantifiers
The asterisk * (zero or more):
Code:
a*
Matches "", "a", "aa", "aaa", etc.
Escaped braces {n,m} (specific counts):
Exactly n times:
Code:
a{3}
Matches exactly "aaa"
At least n times:
Code:
a{3,}
Matches "aaa", "aaaa", "aaaaa", etc.
Between n and m times:
Code:
a{2,4}
Matches "aa", "aaa", or "aaaa"
Example - Find phone numbers (###-####):
Code:
grep '[0-9]{3}-[0-9]{4}' file.txt
Example - Find 3-5 letter words:
Code:
grep '\b[a-z]{3,5}\b' file.txt

GNU Extensions to BRE
GNU grep and sed add these (not in strict POSIX BRE):
+ (one or more):
Code:
a+
Matches "a", "aa", "aaa", etc. (but not empty string)
? (zero or one):
Code:
colou?r
Matches "color" or "colour"
| (alternation - OR):
Code:
cat|dog
Matches "cat" or "dog"
Word boundaries:
Code:
<     Start of word
>     End of word
\b     Word boundary (GNU)
Example - Find whole word "test":
Code:
grep '<test>' file.txt
Matches "test" but not "testing" or "contest"

Practical BRE Examples
1. Find lines with IP addresses:
Code:
grep '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}' /var/log/auth.log
2. Remove blank lines:
Code:
sed '/^$/d' file.txt
3. Find lines starting with # (comments):
Code:
grep '^#' config.conf
4. Replace multiple spaces with single space:
Code:
sed 's/  */ /g' file.txt
5. Extract lines between two patterns:
Code:
sed -n '/START/,/END/p' file.txt
6. Find duplicate consecutive lines:
Code:
grep '^\(.*\)\n\1\n\1
\n\1' file.txt

7. Match email addresses:
Code:
grep '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt
8. Find C-style comments:
Code:
grep '/*.**/' source.c
9. Find lines with exactly 5 characters:
Code:
grep '^.{5}$' file.txt
10. Replace date format (MM/DD/YYYY to YYYY-MM-DD):
Code:
sed 's/([0-9]{2})/([0-9]{2})/([0-9]{4})/\3-\1-\2/' dates.txt

Common BRE Pitfalls
Forgetting to escape special characters:
Code:
grep '(test)'     # Wrong - matches literal "(test)"
grep '(test)'   # Right - creates a group
Using + without escaping:
Code:
grep 'a+'         # Wrong in strict BRE - matches literal "a+"
grep 'a+'        # Right (GNU extension)
grep 'aa*'        # Portable alternative (one 'a' plus zero or more)
Forgetting anchors:
Code:
grep '192.168'    # Matches anywhere in line
grep '^192.168'   # Only at start of line

BRE vs ERE Quick Comparison
Same pattern in both flavors:
BRE (grep/sed default):
Code:
grep '(abc){2,4}|xyz' file.txt
ERE (grep -E):
Code:
grep -E '(abc){2,4}|xyz' file.txt
ERE is cleaner! We'll cover it in Part 3.

Testing BRE Patterns
Echo and pipe to grep:
Code:
echo "test123" | grep '[0-9]{3}'
Use grep with color to see what matches:
Code:
grep --color '<[a-z]{5,}>' file.txt
Test sed substitution without changing file:
Code:
sed -n 's/(pattern)/MATCHED: \1/p' file.txt

Next in This Series
Part 3 will cover POSIX Extended Regular Expressions (ERE), which simplify much of BRE's awkward syntax and are used with grep -E and awk.
 
Regular Expressions Part 4: Perl-Compatible (PCRE)

Perl-Compatible Regular Expressions (PCRE) are the most powerful and feature-rich regex flavor. PCRE extends POSIX ERE with advanced features that make complex pattern matching much easier.

---

Where PCRE is Used

Command-line:
Code:
grep -P
pcregrep

Programming languages:
Code:
Perl
PHP
Python (mostly compatible)
JavaScript (similar)
Ruby
Java
C# / .NET

Note: Not all Linux systems have grep -P enabled (depends on how grep was compiled).

---

Basic PCRE (Same as ERE)

All ERE features work in PCRE:
Code:
(group)         Groups
+               One or more
?               Zero or one
{n,m}           Repetition counts
|               Alternation
^               Start of line
$               End of line

PCRE builds on these with powerful additions.

---

Non-Capturing Groups

Sometimes you need grouping for precedence but don't want to capture the match:

Syntax:
Code:
(?:pattern)

Example - Group without capturing:
Code:
grep -P '(?:http|https)://example\.com' urls.txt
Groups the protocol but doesn't create a backreference.

Why use it:
  • Faster (no memory allocation for capture)
  • Cleaner when you have many groups
  • Doesn't affect backreference numbering

Example - Multiple non-capturing groups:
Code:
grep -P '(?:Mr|Mrs|Ms|Dr)\.? (?:[A-Z][a-z]+) [A-Z][a-z]+' names.txt

---

Lookahead Assertions

Lookahead checks if a pattern follows without including it in the match.

Positive lookahead (?=...):
"Match if followed by..."

Code:
foo(?=bar)
Matches "foo" only if followed by "bar", but doesn't include "bar" in match.

Example - Find numbers followed by "px":
Code:
grep -oP '\d+(?=px)' styles.css
Matches "16" in "16px" (not "16px")

Example - Password must contain digit:
Code:
grep -P '^(?=.*\d).{8,}$' passwords.txt
Matches passwords 8+ chars that contain at least one digit.

Negative lookahead (?!...):
"Match if NOT followed by..."

Code:
foo(?!bar)
Matches "foo" only if NOT followed by "bar"

Example - Find "test" not followed by "ing":
Code:
grep -P 'test(?!ing)' file.txt
Matches "test" and "tester" but not "testing"

Example - Find files without .txt extension:
Code:
grep -P '\w+(?!\.txt)$' filelist.txt

---

Lookbehind Assertions

Lookbehind checks if a pattern precedes without including it in the match.

Positive lookbehind (?<=...):
"Match if preceded by..."

Code:
(?<=foo)bar
Matches "bar" only if preceded by "foo"

Example - Extract price numbers:
Code:
grep -oP '(?<=\$)\d+\.\d{2}' prices.txt
Matches "19.99" in "$19.99" (not the $)

Example - Find words after "Error:":
Code:
grep -oP '(?<=Error: )\w+' log.txt

Negative lookbehind (?<!...):
"Match if NOT preceded by..."

Code:
(?<!foo)bar
Matches "bar" only if NOT preceded by "foo"

Example - Find standalone "test" (not "retest"):
Code:
grep -P '(?<!re)test' file.txt

---

Named Capture Groups

Instead of numbered backreferences (\1, \2), use names:

Syntax:
Code:
(?<name>pattern)

Example - Extract date components:
Code:
grep -P '(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})' dates.txt

Use in replacement (with Perl):
Code:
perl -pe 's/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/$+{month}\/$+{day}\/$+{year}/' dates.txt

---

Shorthand Character Classes

PCRE provides convenient shortcuts:

Code:
\d      Digit [0-9]
\D      Non-digit [^0-9]
\w      Word character [a-zA-Z0-9_]
\W      Non-word character [^a-zA-Z0-9_]
\s      Whitespace [ \t\n\r\f]
\S      Non-whitespace [^ \t\n\r\f]

Example - Find words followed by numbers:
Code:
grep -P '\w+\s+\d+' file.txt

Example - Remove all whitespace:
Code:
perl -pe 's/\s+//g' file.txt

Example - Find non-alphanumeric characters:
Code:
grep -oP '\W' file.txt

---

Advanced Quantifiers

Lazy (non-greedy) quantifiers:

Add ? after quantifier to match as little as possible:

Code:
*?      Zero or more (lazy)
+?      One or more (lazy)
??      Zero or one (lazy)
{n,m}?  Range (lazy)

Example - Extract first HTML tag:
Code:
echo "<div>text</div><span>more</span>" | grep -oP '<.*?>'
Matches <div> (not entire string)

Example - Extract content between quotes:
Code:
grep -oP '".*?"' file.txt

Possessive quantifiers:

Add + after quantifier (prevents backtracking):

Code:
*+      Zero or more (possessive)
++      One or more (possessive)

Used for performance optimization in complex patterns.

---

Unicode Support

PCRE handles Unicode properly:

Code:
\p{L}   Any Unicode letter
\p{N}   Any Unicode number
\p{P}   Any Unicode punctuation
\p{Sc}  Currency symbols

Example - Match emoji:
Code:
grep -P '\p{So}' file.txt

Example - Match any letter (any language):
Code:
grep -P '\p{L}+' multilingual.txt

---

Practical PCRE Examples

1. Validate strong password:
Code:
grep -P '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$' passwords.txt
Requires: lowercase, uppercase, digit, special char, 8+ length

2. Extract IPv4 addresses (strict):
Code:
grep -oP '(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)' file.txt

3. Match balanced parentheses (recursive):
Code:
grep -P '\((?:[^()]++|(?R))*+\)' code.txt

4. Extract email addresses (strict):
Code:
grep -oP '(?i)\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}\b' file.txt

5. Find words NOT in quotes:
Code:
grep -oP '(?<!")(\w+)(?!")' file.txt

6. Extract hashtags:
Code:
grep -oP '(?<!\w)#\w+' social.txt

7. Match credit card numbers (with optional spaces):
Code:
grep -P '\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}' cards.txt

8. Find duplicate words (case-insensitive):
Code:
grep -P '(?i)\b(\w+)\s+\1\b' file.txt

9. Extract content between XML/HTML tags:
Code:
grep -oP '(?<=<title>).*?(?=</title>)' page.html

10. Match file paths (Unix/Windows):
Code:
grep -P '(?:[a-zA-Z]:|/)(?:[\w.-]+[/\\])*[\w.-]+' paths.txt

---

Comparing Flavors: Same Pattern

Match email addresses:

BRE (grep):
Code:
grep '[a-zA-Z0-9._%+-]\+@[a-zA-Z0-9.-]\+\.[a-zA-Z]\{2,\}' file.txt

ERE (grep -E):
Code:
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' file.txt

PCRE (grep -P):
Code:
grep -P '\b[\w._%+-]+@[\w.-]+\.\w{2,}\b' file.txt

PCRE is shortest and most readable!

---

Testing PCRE Patterns

Test with echo:
Code:
echo "[email protected]" | grep -P '\w+@\w+\.\w+'

Extract only matches:
Code:
grep -oP 'pattern' file.txt

Use Perl for complex testing:
Code:
echo "test string" | perl -ne 'print if /pattern/'

Online testers:
  • regex101.com (supports PCRE)
  • regexr.com

---

PCRE Limitations

Not available everywhere:
  • Some Linux distros compile grep without PCRE support
  • Use grep --version to check

Performance:
  • More complex = slower
  • Lookahead/lookbehind can be expensive
  • Test on large files before production use

Portability:
  • PCRE patterns may not work with basic grep
  • Document which flavor you're using
  • Consider fallback to ERE for scripts

---

When to Use PCRE

Use PCRE when you need:
  • Lookahead or lookbehind
  • Named capture groups
  • Unicode support
  • Lazy quantifiers
  • Complex validation (passwords, emails, etc.)

Stick with ERE when:
  • Pattern is simple
  • Portability matters
  • grep -P not available
  • Performance critical

---

Next in This Series

Part 5 will cover Python's regex module, which is similar to PCRE but has its own quirks and features. We'll explore raw strings, compiled patterns, and Python-specific methods.

---
 
Regular Expressions Part 5: Python Regex

Python's re module provides powerful regex capabilities similar to PCRE, with some Python-specific features and syntax. Understanding Python regex is essential for scripting and automation.

---

Importing the Module

Python regex requires importing the re module:

Code:
import re

---

Raw Strings - Critical for Regex

Python uses backslashes for escape sequences (\n, \t, etc.), which conflicts with regex backslashes.

Wrong way (double escaping required):
Code:
pattern = "\\d+\\s+\\w+"

Right way (raw strings):
Code:
pattern = r"\d+\s+\w+"

Always use raw strings (r"...") for regex patterns in Python!

---

Basic Matching Functions

re.search() - Find first match anywhere:
Code:
import re

text = "The price is $19.99"
match = re.search(r'\$\d+\.\d{2}', text)

if match:
    print(match.group())
Output: $19.99

re.match() - Match at start of string only:
Code:
text = "Error: file not found"
match = re.match(r'Error:', text)

if match:
    print("Found error at start")

re.findall() - Find all matches:
Code:
text = "Prices: $10.99, $25.50, $5.00"
prices = re.findall(r'\$\d+\.\d{2}', text)
print(prices)
Output: ['$10.99', '$25.50', '$5.00']

re.finditer() - Find all matches as iterator:
Code:
text = "Error at line 10, Error at line 25"
for match in re.finditer(r'line (\d+)', text):
    print(f"Line number: {match.group(1)}")

---

Substitution

re.sub() - Replace matches:
Code:
text = "Hello World"
result = re.sub(r'World', 'Python', text)
print(result)
Output: Hello Python

With backreferences:
Code:
text = "John Doe"
result = re.sub(r'(\w+) (\w+)', r'\2, \1', text)
print(result)
Output: Doe, John

Using function for replacement:
Code:
def double_number(match):
    num = int(match.group(1))
    return str(num * 2)

text = "I have 5 apples and 10 oranges"
result = re.sub(r'(\d+)', double_number, text)
print(result)
Output: I have 10 apples and 20 oranges

re.subn() - Replace and count:
Code:
text = "cat dog cat bird cat"
result, count = re.subn(r'cat', 'tiger', text)
print(f"Result: {result}, Replacements: {count}")
Output: Result: tiger dog tiger bird tiger, Replacements: 3

---

Splitting Strings

re.split() - Split by pattern:
Code:
text = "one,two;three:four"
parts = re.split(r'[,;:]', text)
print(parts)
Output: ['one', 'two', 'three', 'four']

With capture groups:
Code:
text = "one,two;three"
parts = re.split(r'([,;])', text)
print(parts)
Output: ['one', ',', 'two', ';', 'three']

---

Compiled Patterns

For patterns used multiple times, compile once for better performance:

Code:
import re

pattern = re.compile(r'\b\w+@\w+\.\w+\b')

text1 = "Contact: [email protected]"
text2 = "Email: [email protected]"

print(pattern.search(text1).group())
print(pattern.search(text2).group())

With flags:
Code:
pattern = re.compile(r'error', re.IGNORECASE)

---

Match Objects

Match objects contain information about the match:

Code:
text = "Error at line 42"
match = re.search(r'line (\d+)', text)

if match:
    print(match.group())      # Entire match: 'line 42'
    print(match.group(1))     # First group: '42'
    print(match.start())      # Start position: 9
    print(match.end())        # End position: 16
    print(match.span())       # Tuple: (9, 16)

Multiple groups:
Code:
text = "2024-01-03"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)

if match:
    year, month, day = match.groups()
    print(f"Year: {year}, Month: {month}, Day: {day}")

---

Named Groups

Python supports named capture groups:

Code:
text = "John Doe, age 30"
pattern = r'(?P<first>\w+) (?P<last>\w+), age (?P<age>\d+)'
match = re.search(pattern, text)

if match:
    print(match.group('first'))   # John
    print(match.group('last'))    # Doe
    print(match.group('age'))     # 30
    print(match.groupdict())      # {'first': 'John', 'last': 'Doe', 'age': '30'}

Backreference to named group:
Code:
text = "the the quick brown fox"
duplicates = re.findall(r'\b(?P<word>\w+)\s+(?P=word)\b', text)
print(duplicates)
Output: ['the']

---

Regex Flags

Common flags:
Code:
re.IGNORECASE  or  re.I      Case-insensitive
re.MULTILINE   or  re.M      ^ and $ match line boundaries
re.DOTALL      or  re.S      . matches newlines too
re.VERBOSE     or  re.X      Allow comments and whitespace
re.ASCII       or  re.A      ASCII-only matching

Using flags:
Code:
pattern = re.compile(r'error', re.IGNORECASE)

Multiple flags:
Code:
pattern = re.compile(r'pattern', re.I | re.M)

Inline flags:
Code:
pattern = r'(?i)error'  # Case-insensitive

---

Verbose Regex (re.VERBOSE)

Make complex patterns readable:

Code:
pattern = re.compile(r'''
    \b                  # Word boundary
    (?P<user>[\w.]+)    # Username
    @                   # At symbol
    (?P<domain>[\w.]+)  # Domain
    \.                  # Dot
    (?P<tld>\w{2,})     # TLD
    \b                  # Word boundary
''', re.VERBOSE)

text = "Contact: [email protected]"
match = pattern.search(text)
if match:
    print(match.groupdict())

---

Practical Python Examples

1. Extract all email addresses:
Code:
import re

text = "Contact [email protected] or [email protected]"
emails = re.findall(r'\b[\w.%+-]+@[\w.-]+\.\w{2,}\b', text)
print(emails)

2. Validate phone numbers:
Code:
def validate_phone(number):
    pattern = r'^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$'
    match = re.match(pattern, number)
    return match is not None

print(validate_phone("(555) 123-4567"))  # True
print(validate_phone("555-1234"))        # False

3. Extract hashtags from social media:
Code:
text = "Check out #Python and #Regex tutorials! #coding"
hashtags = re.findall(r'#\w+', text)
print(hashtags)
Output: ['#Python', '#Regex', '#coding']

4. Parse log files:
Code:
log = "2024-01-03 14:30:25 ERROR Failed to connect"
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)'
match = re.match(pattern, log)

if match:
    date, time, level, message = match.groups()
    print(f"Level: {level}, Message: {message}")

5. Clean HTML tags:
Code:
html = "<p>Hello <b>World</b></p>"
clean = re.sub(r'<[^>]+>', '', html)
print(clean)
Output: Hello World

6. Validate password strength:
Code:
def is_strong_password(password):
    # At least 8 chars, one uppercase, one lowercase, one digit
    pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$'
    return re.match(pattern, password) is not None

print(is_strong_password("Pass123"))     # True
print(is_strong_password("weakpass"))    # False

7. Extract URLs:
Code:
text = "Visit https://example.com or http://test.org"
urls = re.findall(r'https?://[\w.-]+\.[\w]{2,}', text)
print(urls)

8. Replace multiple whitespace:
Code:
text = "Too    many      spaces"
clean = re.sub(r'\s+', ' ', text)
print(clean)
Output: Too many spaces

9. Extract numbers from text:
Code:
text = "I have 5 apples, 10 oranges, and 3.5 pounds of grapes"
integers = re.findall(r'\b\d+\b', text)
floats = re.findall(r'\b\d+\.?\d*\b', text)
print(f"Integers: {integers}")
print(f"All numbers: {floats}")

10. Parse CSV with regex:
Code:
line = 'John,Doe,30,"New York, NY",Engineer'
fields = re.findall(r'(?:[^,"]+|"[^"]*")+', line)
print(fields)

---

Common Python Regex Patterns

Email validation:
Code:
r'\b[\w.%+-]+@[\w.-]+\.\w{2,}\b'

URL matching:
Code:
r'https?://(?:www\.)?[\w.-]+\.[\w]{2,}(?:/[\w./?=&-]*)?'

IP address:
Code:
r'\b(?:\d{1,3}\.){3}\d{1,3}\b'

Date (YYYY-MM-DD):
Code:
r'\d{4}-\d{2}-\d{2}'

Time (HH:MM:SS):
Code:
r'\d{2}:\d{2}:\d{2}'

Hexadecimal color:
Code:
r'#[0-9A-Fa-f]{6}'

---

Error Handling

Catch invalid regex:
Code:
import re

try:
    pattern = re.compile(r'[invalid(')
except re.error as e:
    print(f"Regex error: {e}")

Check if match exists:
Code:
match = re.search(r'pattern', text)
if match:
    print(match.group())
else:
    print("No match found")

---

Performance Tips

1. Compile patterns used multiple times:
Code:
pattern = re.compile(r'\d+')
for line in file:
    match = pattern.search(line)

2. Use raw strings:
Code:
pattern = r'\d+'  # Good
pattern = "\\d+"  # Bad (harder to read, easy to mess up)

3. Avoid catastrophic backtracking:
Code:
# Bad (can hang on certain inputs)
pattern = r'(a+)+'

# Better
pattern = r'a+'

4. Use non-capturing groups when possible:
Code:
pattern = r'(?:http|https)://\w+'  # Faster than capturing

---

Differences from PCRE

Python regex is mostly PCRE-compatible but:

  • No \K (keep) assertion
  • No possessive quantifiers by default (use regex module for these)
  • Slightly different Unicode handling
  • Some advanced PCRE features missing

For full PCRE compatibility, use the regex module:
Code:
pip install regex
import regex

---

Next in This Series

Part 6 will cover practical real-world regex examples and best practices, including log file parsing, data validation, text extraction, and common pitfalls to avoid.
 
Regular Expressions Part 6: Practical Examples & Best Practices

This final part covers real-world regex applications, common patterns, performance considerations, and pitfalls to avoid. These are the patterns you'll actually use in daily system administration and scripting.

---

Log File Parsing

1. Apache/Nginx access logs:

Parse common log format:
Code:
grep -E '^([0-9.]+) - - \[([^\]]+)\] "([A-Z]+) ([^ ]+) HTTP/[0-9.]+" ([0-9]+) ([0-9]+)' access.log

Extract just IP addresses:
Code:
grep -oP '^\d+\.\d+\.\d+\.\d+' access.log | sort | uniq -c | sort -rn

Find 404 errors:
Code:
grep -E '" 404 ' access.log

Extract URLs requested:
Code:
grep -oP 'GET \K[^ ]+' access.log

2. System log patterns:

Find failed SSH attempts:
Code:
grep -E 'Failed password for .* from [0-9.]+' /var/log/auth.log

Extract failed login IPs:
Code:
grep -oP 'Failed password for .* from \K[0-9.]+' /var/log/auth.log | sort | uniq -c | sort -rn

Find kernel errors:
Code:
grep -E 'kernel:.*error' /var/log/syslog

Parse systemd journal:
Code:
journalctl | grep -E '^[A-Z][a-z]{2} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'

3. Application logs:

Extract timestamps:
Code:
grep -oP '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}' app.log

Find errors with context:
Code:
grep -E -A 3 -B 3 'ERROR|CRITICAL' app.log

Extract JSON from logs:
Code:
grep -oP '\{[^}]+\}' app.log

---

Data Validation

1. Email addresses:

Simple validation:
Code:
^[\w.%+-]+@[\w.-]+\.\w{2,}$

More strict:
Code:
^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

2. Phone numbers:

US format (flexible):
Code:
^\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})$

International format:
Code:
^\+?[1-9]\d{1,14}$

3. URLs:

Basic URL:
Code:
^https?://[\w.-]+\.[\w]{2,}(/[\w./?=&-]*)?$

With optional www and path:
Code:
^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)$

4. IP addresses:

IPv4 (simple):
Code:
^(\d{1,3}\.){3}\d{1,3}$

IPv4 (strict, validates ranges):
Code:
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

IPv6:
Code:
^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$

5. Passwords:

Minimum requirements:
Code:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
Requires: lowercase, uppercase, digit, special char, 8+ length

6. Credit card numbers:

Basic format:
Code:
^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$

Specific cards:
Code:
Visa:       ^4\d{3}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$
MasterCard: ^5[1-5]\d{2}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$
Amex:       ^3[47]\d{2}[\s-]?\d{6}[\s-]?\d{5}$

---

Text Extraction and Manipulation

1. Extract all numbers:
Code:
grep -oP '\d+\.?\d*' file.txt

2. Extract words in quotes:
Code:
grep -oP '(?<=")[^"]+(?=")' file.txt

3. Extract hashtags:
Code:
grep -oP '#\w+' social.txt

4. Extract mentions (@username):
Code:
grep -oP '@\w+' social.txt

5. Remove HTML tags:
Code:
sed 's/<[^>]*>//g' file.html

6. Extract MAC addresses:
Code:
grep -oP '([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}' network.txt

7. Extract version numbers:
Code:
grep -oP '\d+\.\d+\.\d+' changelog.txt

8. Convert date formats:
Code:
sed -E 's/([0-9]{2})\/([0-9]{2})\/([0-9]{4})/\3-\1-\2/g' dates.txt
Converts MM/DD/YYYY to YYYY-MM-DD

---

Configuration File Parsing

1. Extract key-value pairs:
Code:
grep -oP '^\s*\K\w+(?==)' config.conf

2. Remove comments:
Code:
sed 's/#.*//' config.conf
sed 's/\/\/.*//' config.js

3. Extract variable assignments:
Code:
grep -P '^\s*[A-Z_]+=.+' .env

4. Find uncommented lines:
Code:
grep -vE '^\s*(#|$)' config.conf

5. Extract section headers:
Code:
grep -oP '(?<=\[)[^\]]+(?=\])' config.ini

---

Network and Security

1. Find private IP addresses:
Code:
grep -oP '(?:10|172\.(?:1[6-9]|2[0-9]|3[01])|192\.168)\.\d{1,3}\.\d{1,3}' file.txt

2. Extract URLs from HTML:
Code:
grep -oP '(?:href|src)="https?://[^"]+' page.html

3. Find potential SQL injection attempts:
Code:
grep -iE "(union|select|insert|update|delete|drop|exec|script)" access.log

4. Extract SSL certificate info:
Code:
openssl x509 -in cert.pem -text | grep -oP 'Subject:.*'

5. Parse firewall rules:
Code:
iptables -L | grep -oP '\d+\.\d+\.\d+\.\d+/\d+'

---

Common Mistakes and How to Avoid Them

1. Forgetting to escape special characters:

Wrong:
Code:
grep '192.168.1.1' file.txt
Matches "192x168x1x1" (. matches any character)

Right:
Code:
grep '192\.168\.1\.1' file.txt

2. Greedy vs lazy matching:

Wrong (greedy):
Code:
echo "<tag>content</tag><tag>more</tag>" | grep -oP '<tag>.*</tag>'
Matches entire string

Right (lazy):
Code:
echo "<tag>content</tag><tag>more</tag>" | grep -oP '<tag>.*?</tag>'
Matches each tag separately

3. Not anchoring patterns:

Wrong:
Code:
grep '\d{3}' file.txt
Matches 123 in "abc123def"

Right:
Code:
grep '^\d{3}$' file.txt
Matches only lines with exactly 3 digits

4. Catastrophic backtracking:

Dangerous:
Code:
(a+)+b
Can hang on "aaaaaaaaaaaaaaa" (no b at end)

Better:
Code:
a+b

5. Case sensitivity:

Remember to use:
Code:
grep -i 'pattern'           # Case-insensitive grep
sed 's/pattern/replace/gi'  # Case-insensitive sed
re.IGNORECASE              # Python flag

---

Performance Best Practices

1. Anchor patterns when possible:
Code:
^pattern    # Faster - only checks start
pattern$    # Faster - only checks end
pattern     # Slower - checks entire line

2. Use character classes efficiently:
Code:
[0-9]       # Good
\d          # PCRE only, same speed
[0123456789] # Slower, harder to read

3. Avoid unnecessary capture groups:
Code:
(?:pattern)  # Non-capturing (faster)
(pattern)    # Capturing (slower if not needed)

4. Compile patterns in scripts:

Python:
Code:
pattern = re.compile(r'\d+')  # Compile once
for line in file:
    pattern.search(line)       # Reuse

5. Use specific quantifiers:
Code:
\d{10}      # Better - specific count
\d{10,10}   # Worse - range with same min/max
\d+         # Worst for known length

---

Testing and Debugging

1. Test incrementally:
Code:
# Start simple
\d+

# Add complexity
\d+\.\d+

# Add anchors
^\d+\.\d+$

# Add validation
^(?:\d{1,3}\.){3}\d{1,3}$

2. Use verbose mode for complex patterns:

Python:
Code:
pattern = re.compile(r'''
    ^                 # Start of line
    (?P<ip>           # IP address group
        (?:\d{1,3}\.){3}  # First 3 octets
        \d{1,3}       # Last octet
    )
    \s+               # Whitespace
    (?P<port>\d+)     # Port number
    $                 # End of line
''', re.VERBOSE)

3. Online testers:
  • regex101.com (explains matches step-by-step)
  • regexr.com (visual interface)
  • debuggex.com (railroad diagrams)

4. Test edge cases:
Code:
# Test with:
- Empty strings
- Maximum length inputs
- Special characters
- Unicode characters
- Whitespace variations

---

Real-World System Administration Scripts

1. Find large files modified recently:
Code:
find /var/log -type f -mtime -7 -size +100M | grep -E '\.(log|txt)$'

2. Monitor failed logins:
Code:
#!/bin/bash
tail -f /var/log/auth.log | grep --line-buffered -E 'Failed password' | \
while read line; do
    ip=$(echo "$line" | grep -oP 'from \K[0-9.]+')
    echo "Failed login from: $ip"
done

3. Extract errors from multiple logs:
Code:
find /var/log -name "*.log" -exec grep -H -E 'ERROR|CRITICAL' {} \; | \
grep -oP '^[^:]+:\K.*'

4. Validate config files:
Code:
#!/bin/bash
for file in /etc/*.conf; do
    if grep -qP '[^[:print:]]' "$file"; then
        echo "Non-printable characters in $file"
    fi
done

5. Parse and summarize access logs:
Code:
#!/bin/bash
echo "Top 10 IP addresses:"
grep -oP '^\d+\.\d+\.\d+\.\d+' /var/log/apache2/access.log | \
    sort | uniq -c | sort -rn | head -10

echo -e "\nTop 10 requested URLs:"
grep -oP 'GET \K[^ ]+' /var/log/apache2/access.log | \
    sort | uniq -c | sort -rn | head -10

---

Quick Reference Card

Anchors:
Code:
^       Start of line
$       End of line
\b      Word boundary
\<      Start of word
\>      End of word

Character Classes:
Code:
.       Any character
\d      Digit
\w      Word character
\s      Whitespace
[abc]   Any of a, b, c
[^abc]  Not a, b, c
[a-z]   Range a to z

Quantifiers:
Code:
*       Zero or more
+       One or more
?       Zero or one
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m
*?      Lazy zero or more
+?      Lazy one or more

Groups:
Code:
(...)       Capture group
(?:...)     Non-capturing group
(?<name>...)  Named group
\1          Backreference
(?=...)     Positive lookahead
(?!...)     Negative lookahead
(?<=...)    Positive lookbehind
(?<!...)    Negative lookbehind

Flags:
Code:
grep -i     Case-insensitive
grep -E     Extended regex
grep -P     Perl regex
grep -o     Only matching
sed -E      Extended regex
sed -i      In-place edit

---

Final Recommendations

1. Start simple, add complexity gradually

2. Always test patterns on sample data first

3. Document complex regex (use comments)

4. Consider readability over brevity

5. Use the right tool:
  • Simple patterns: BRE (grep/sed)
  • Medium complexity: ERE (grep -E)
  • Complex patterns: PCRE (grep -P) or Python

6. Keep a personal regex library for common patterns

7. When in doubt, search existing solutions (but understand them!)

---

Conclusion

Regular expressions are powerful but can be complex. Master the basics first, then gradually add advanced features as needed. The key is practice - the more you use regex, the more intuitive it becomes.

Remember: A simple, readable regex that works is better than a clever, compact one that nobody can maintain.

---

Series Complete!

You now have a comprehensive guide to regular expressions across multiple flavors. Refer back to these parts as needed for syntax reference and practical examples.

---

There... whew! How is that for the longest -- most difficult too read article you ever read? ( over 3 weeks in the making )
 


Follow Linux.org

Staff online

Members online


Top