regex

dos2unix · Jan 3, 2026

I apologize for doing too many articles in a short amount of time. Chriastmas vacation is over and I have to go back to work Monday

Regular Expressions Part 1: Introduction & Basics

Regular expressions (regex) are powerful pattern-matching tools used throughout Linux and programming. They allow you to search, match, and manipulate text based on patterns rather than exact strings.

What is a Regular Expression?

A regular expression is a sequence of characters that defines a search pattern. Instead of searching for the exact text "error", you could search for any line containing "error" or "Error" or "ERROR", or even more complex patterns like "any word starting with 'err'".

Where Regex is Used

Regular expressions appear in many Linux tools and programming languages:

Command-line tools:

Code:

grepsedawklessvim

Programming languages:

Code:

PerlPythonJavaScriptPHPRubyJava

Understanding POSIX Flavors

POSIX (Portable Operating System Interface) defines two main regex flavors for Unix/Linux systems:

Basic Regular Expressions (BRE):

Default in grep and sed
Requires backslashes for special characters
More verbose syntax
Limited feature set

Extended Regular Expressions (ERE):

Used with grep -E (or egrep) and awk
Cleaner syntax (fewer backslashes)
More intuitive
Additional operators

We'll cover both flavors in detail in upcoming posts. For now, understand that the same pattern might require different syntax depending on which tool you're using.

Basic Pattern Elements

Literal characters match themselves:

Code:

error

Matches the exact text "error"

The dot (.) matches any single character:

Code:

e.ror

Matches "error", "e5ror", "e ror", etc.

Character classes [ ] match any one character inside:

Code:

[Ee]rror

Matches "error" or "Error"

Code:

[0-9]

Matches any single digit

Code:

[a-z]

Matches any lowercase letter

Code:

[A-Za-z0-9]

Matches any letter or digit

Negated character classes [^ ] match anything NOT listed:

Code:

[^0-9]

Matches any character that's not a digit

Anchors

Anchors don't match characters—they match positions:

^ matches start of line:

Code:

^Error

Matches "Error" only at the beginning of a line

$ matches end of line:

Code:

Error$

Matches "Error" only at the end of a line

Combined:

Code:

^Error$

Matches lines containing only "Error" (nothing before or after)

Quantifiers (How Many Times)

* matches zero or more of the preceding character:

Code:

erro*r

Matches "errr", "error", "eroor", "errrrr", etc.

Note: In BRE (basic grep/sed), you use * directly. Other quantifiers require special handling, which we'll cover in later parts.

Escape Character

The backslash \ makes special characters literal:

Code:

Matches an actual period (not "any character")

Code:

Matches a dollar sign (not "end of line")

Simple Examples

Find lines containing "error" (case-insensitive):

Code:

grep -i error logfile.txt

Find lines starting with a number:

Code:

grep '^[0-9]' file.txt

Find email-like patterns:

Code:

grep '[a-zA-Z0-9]@[a-zA-Z0-9].' file.txt

Find empty lines:

Code:

grep '^$' file.txt

Find lines containing "error" or "warning":

Code:

grep 'error|warning' file.txt

Common Predefined Character Classes

Many regex flavors support shorthand for common patterns:

Code:

[:alnum:]    Alphanumeric characters[:alpha:]    Alphabetic characters[:digit:]    Digits 0-9[:lower:]    Lowercase letters[:upper:]    Uppercase letters[:space:]    Whitespace (space, tab, newline)[:punct:]    Punctuation characters

Used inside character classes:

Code:

grep '[[:digit:]]' file.txt

Matches any line containing a digit

Testing Your Regex

Before using regex in scripts, test it:

Interactive testing with grep:

Code:

echo "test string" | grep 'pattern'

Show matching part with color:

Code:

grep --color 'pattern' file.txt

Count matches:

Code:

grep -c 'pattern' file.txt

Next in This Series

In Part 2, we'll dive deeper into POSIX Basic Regular Expressions (BRE), including:

Why certain characters need escaping
Grouping and backreferences
Repetition patterns
Real-world examples with grep and sed

dos2unix · Jan 3, 2026

Regular Expressions Part 2: POSIX Basic (BRE)
POSIX Basic Regular Expressions (BRE) are the default regex flavor used by grep and sed. While they're more verbose than other flavors, understanding BRE is essential because these tools are everywhere in Linux.

Where BRE is Used
Default in these commands:

Code:

grep
sed

To use BRE explicitly:

Code:

grep pattern file.txt
sed 's/pattern/replacement/' file.txt

The Backslash Problem
BRE's most confusing aspect: many special characters require backslashes, which seems backwards.
In BRE, these are literal unless escaped:

Parentheses ( )
Curly braces { }
Plus sign +
Question mark ?
Pipe |

To make them special, you MUST escape them:

Code:

(    Start a group
)    End a group
{    Start repetition count
}    End repetition count
+    One or more (GNU extension)
?    Zero or one (GNU extension)
|    Alternation (GNU extension)

This is opposite of most programming languages!

Grouping and Backreferences
Groups capture matched text for reuse:
Create a group:

Code:

(pattern)

Reference captured groups:

Code:

\1    First group
\2    Second group
\3    Third group

Example - Find repeated words:

Code:

grep '([a-z]*) \1' file.txt

Matches "the the" or "is is"
Example - Swap two words with sed:

Code:

echo "John Doe" | sed 's/(.) (.)/\2 \1/'

Output: "Doe John"
Example - Match HTML tags:

Code:

grep '<([a-z])>.</\1>' file.html

Matches <div>content</div> but not <div>content</span>

Repetition Quantifiers
The asterisk * (zero or more):

Code:

a*

Matches "", "a", "aa", "aaa", etc.
Escaped braces {n,m} (specific counts):
Exactly n times:

Code:

a{3}

Matches exactly "aaa"
At least n times:

Code:

a{3,}

Matches "aaa", "aaaa", "aaaaa", etc.
Between n and m times:

Code:

a{2,4}

Matches "aa", "aaa", or "aaaa"
Example - Find phone numbers (###-####):

Code:

grep '[0-9]{3}-[0-9]{4}' file.txt

Example - Find 3-5 letter words:

Code:

grep '\b[a-z]{3,5}\b' file.txt

GNU Extensions to BRE
GNU grep and sed add these (not in strict POSIX BRE):
+ (one or more):

Code:

a+

Matches "a", "aa", "aaa", etc. (but not empty string)
? (zero or one):

Code:

colou?r

Matches "color" or "colour"
| (alternation - OR):

Code:

cat|dog

Matches "cat" or "dog"
Word boundaries:

Code:

<     Start of word
>     End of word
\b     Word boundary (GNU)

Example - Find whole word "test":

Code:

grep '<test>' file.txt

Matches "test" but not "testing" or "contest"

Practical BRE Examples
1. Find lines with IP addresses:

Code:

grep '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}' /var/log/auth.log

2. Remove blank lines:

Code:

sed '/^$/d' file.txt

3. Find lines starting with # (comments):

Code:

grep '^#' config.conf

4. Replace multiple spaces with single space:

Code:

sed 's/  */ /g' file.txt

5. Extract lines between two patterns:

Code:

sed -n '/START/,/END/p' file.txt

6. Find duplicate consecutive lines:

Code:

grep '^\(.*\)\n\1\n\1
\n\1' file.txt

7. Match email addresses:

Code:

grep '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt

8. Find C-style comments:

Code:

grep '/*.**/' source.c

9. Find lines with exactly 5 characters:

Code:

grep '^.{5}$' file.txt

10. Replace date format (MM/DD/YYYY to YYYY-MM-DD):

Code:

sed 's/([0-9]{2})/([0-9]{2})/([0-9]{4})/\3-\1-\2/' dates.txt

Common BRE Pitfalls
Forgetting to escape special characters:

Code:

grep '(test)'     # Wrong - matches literal "(test)"
grep '(test)'   # Right - creates a group

Using + without escaping:

Code:

grep 'a+'         # Wrong in strict BRE - matches literal "a+"
grep 'a+'        # Right (GNU extension)
grep 'aa*'        # Portable alternative (one 'a' plus zero or more)

Forgetting anchors:

Code:

grep '192.168'    # Matches anywhere in line
grep '^192.168'   # Only at start of line

BRE vs ERE Quick Comparison
Same pattern in both flavors:
BRE (grep/sed default):

Code:

grep '(abc){2,4}|xyz' file.txt

ERE (grep -E):

Code:

grep -E '(abc){2,4}|xyz' file.txt

ERE is cleaner! We'll cover it in Part 3.

Testing BRE Patterns
Echo and pipe to grep:

Code:

echo "test123" | grep '[0-9]{3}'

Use grep with color to see what matches:

Code:

grep --color '<[a-z]{5,}>' file.txt

Test sed substitution without changing file:

Code:

sed -n 's/(pattern)/MATCHED: \1/p' file.txt

Next in This Series
Part 3 will cover POSIX Extended Regular Expressions (ERE), which simplify much of BRE's awkward syntax and are used with grep -E and awk.

dos2unix · Jan 3, 2026

Regular Expressions Part 2: POSIX Basic (BRE)
POSIX Basic Regular Expressions (BRE) are the default regex flavor used by grep and sed. While they're more verbose than other flavors, understanding BRE is essential because these tools are everywhere in Linux.

Where BRE is Used
Default in these commands:

Code:

grep
sed

To use BRE explicitly:

Code:

grep pattern file.txt
sed 's/pattern/replacement/' file.txt

The Backslash Problem
BRE's most confusing aspect: many special characters require backslashes, which seems backwards.
In BRE, these are literal unless escaped:

Parentheses ( )
Curly braces { }
Plus sign +
Question mark ?
Pipe |

To make them special, you MUST escape them:

Code:

(    Start a group
)    End a group
{    Start repetition count
}    End repetition count
+    One or more (GNU extension)
?    Zero or one (GNU extension)
|    Alternation (GNU extension)

This is opposite of most programming languages!

Grouping and Backreferences
Groups capture matched text for reuse:
Create a group:

Code:

(pattern)

Reference captured groups:

Code:

\1    First group
\2    Second group
\3    Third group

Example - Find repeated words:

Code:

grep '([a-z]*) \1' file.txt

Matches "the the" or "is is"
Example - Swap two words with sed:

Code:

echo "John Doe" | sed 's/(.) (.)/\2 \1/'

Output: "Doe John"
Example - Match HTML tags:

Code:

grep '<([a-z])>.</\1>' file.html

Matches <div>content</div> but not <div>content</span>

Repetition Quantifiers
The asterisk * (zero or more):

Code:

a*

Matches "", "a", "aa", "aaa", etc.
Escaped braces {n,m} (specific counts):
Exactly n times:

Code:

a{3}

Matches exactly "aaa"
At least n times:

Code:

a{3,}

Matches "aaa", "aaaa", "aaaaa", etc.
Between n and m times:

Code:

a{2,4}

Matches "aa", "aaa", or "aaaa"
Example - Find phone numbers (###-####):

Code:

grep '[0-9]{3}-[0-9]{4}' file.txt

Example - Find 3-5 letter words:

Code:

grep '\b[a-z]{3,5}\b' file.txt

GNU Extensions to BRE
GNU grep and sed add these (not in strict POSIX BRE):
+ (one or more):

Code:

a+

Matches "a", "aa", "aaa", etc. (but not empty string)
? (zero or one):

Code:

colou?r

Matches "color" or "colour"
| (alternation - OR):

Code:

cat|dog

Matches "cat" or "dog"
Word boundaries:

Code:

<     Start of word
>     End of word
\b     Word boundary (GNU)

Example - Find whole word "test":

Code:

grep '<test>' file.txt

Matches "test" but not "testing" or "contest"

Practical BRE Examples
1. Find lines with IP addresses:

Code:

grep '[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}' /var/log/auth.log

2. Remove blank lines:

Code:

sed '/^$/d' file.txt

3. Find lines starting with # (comments):

Code:

grep '^#' config.conf

4. Replace multiple spaces with single space:

Code:

sed 's/  */ /g' file.txt

5. Extract lines between two patterns:

Code:

sed -n '/START/,/END/p' file.txt

6. Find duplicate consecutive lines:

Code:

grep '^\(.*\)\n\1\n\1
\n\1' file.txt

7. Match email addresses:

Code:

grep '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt

8. Find C-style comments:

Code:

grep '/*.**/' source.c

9. Find lines with exactly 5 characters:

Code:

grep '^.{5}$' file.txt

10. Replace date format (MM/DD/YYYY to YYYY-MM-DD):

Code:

sed 's/([0-9]{2})/([0-9]{2})/([0-9]{4})/\3-\1-\2/' dates.txt

Common BRE Pitfalls
Forgetting to escape special characters:

Code:

grep '(test)'     # Wrong - matches literal "(test)"
grep '(test)'   # Right - creates a group

Using + without escaping:

Code:

grep 'a+'         # Wrong in strict BRE - matches literal "a+"
grep 'a+'        # Right (GNU extension)
grep 'aa*'        # Portable alternative (one 'a' plus zero or more)

Forgetting anchors:

Code:

grep '192.168'    # Matches anywhere in line
grep '^192.168'   # Only at start of line

BRE vs ERE Quick Comparison
Same pattern in both flavors:
BRE (grep/sed default):

Code:

grep '(abc){2,4}|xyz' file.txt

ERE (grep -E):

Code:

grep -E '(abc){2,4}|xyz' file.txt

ERE is cleaner! We'll cover it in Part 3.

Testing BRE Patterns
Echo and pipe to grep:

Code:

echo "test123" | grep '[0-9]{3}'

Use grep with color to see what matches:

Code:

grep --color '<[a-z]{5,}>' file.txt

Test sed substitution without changing file:

Code:

sed -n 's/(pattern)/MATCHED: \1/p' file.txt

Next in This Series
Part 3 will cover POSIX Extended Regular Expressions (ERE), which simplify much of BRE's awkward syntax and are used with grep -E and awk.

dos2unix · Jan 3, 2026

Regular Expressions Part 4: Perl-Compatible (PCRE)

Perl-Compatible Regular Expressions (PCRE) are the most powerful and feature-rich regex flavor. PCRE extends POSIX ERE with advanced features that make complex pattern matching much easier.

---

Where PCRE is Used

Command-line:

Code:

grep -P
pcregrep

Programming languages:

Code:

Perl
PHP
Python (mostly compatible)
JavaScript (similar)
Ruby
Java
C# / .NET

Note: Not all Linux systems have grep -P enabled (depends on how grep was compiled).

---

Basic PCRE (Same as ERE)

All ERE features work in PCRE:

Code:

(group)         Groups
+               One or more
?               Zero or one
{n,m}           Repetition counts
|               Alternation
^               Start of line
$               End of line

PCRE builds on these with powerful additions.

---

Non-Capturing Groups

Sometimes you need grouping for precedence but don't want to capture the match:

Syntax:

Code:

(?:pattern)

Example - Group without capturing:

Code:

grep -P '(?:http|https)://example\.com' urls.txt

Groups the protocol but doesn't create a backreference.

Why use it:

Faster (no memory allocation for capture)
Cleaner when you have many groups
Doesn't affect backreference numbering

Example - Multiple non-capturing groups:

Code:

grep -P '(?:Mr|Mrs|Ms|Dr)\.? (?:[A-Z][a-z]+) [A-Z][a-z]+' names.txt

---

Lookahead Assertions

Lookahead checks if a pattern follows without including it in the match.

Positive lookahead (?=...):
"Match if followed by..."

Code:

foo(?=bar)

Matches "foo" only if followed by "bar", but doesn't include "bar" in match.

Example - Find numbers followed by "px":

Code:

grep -oP '\d+(?=px)' styles.css

Matches "16" in "16px" (not "16px")

Example - Password must contain digit:

Code:

grep -P '^(?=.*\d).{8,}$' passwords.txt

Matches passwords 8+ chars that contain at least one digit.

Negative lookahead (?!...):
"Match if NOT followed by..."

Code:

foo(?!bar)

Matches "foo" only if NOT followed by "bar"

Example - Find "test" not followed by "ing":

Code:

grep -P 'test(?!ing)' file.txt

Matches "test" and "tester" but not "testing"

Example - Find files without .txt extension:

Code:

grep -P '\w+(?!\.txt)$' filelist.txt

---

Lookbehind Assertions

Lookbehind checks if a pattern precedes without including it in the match.

Positive lookbehind (?<=...):
"Match if preceded by..."

Code:

(?<=foo)bar

Matches "bar" only if preceded by "foo"

Example - Extract price numbers:

Code:

grep -oP '(?<=\$)\d+\.\d{2}' prices.txt

Matches "19.99" in "$19.99" (not the $)

Example - Find words after "Error:":

Code:

grep -oP '(?<=Error: )\w+' log.txt

Negative lookbehind (?<!...):
"Match if NOT preceded by..."

Code:

(?<!foo)bar

Matches "bar" only if NOT preceded by "foo"

Example - Find standalone "test" (not "retest"):

Code:

grep -P '(?<!re)test' file.txt

---

Named Capture Groups

Instead of numbered backreferences (\1, \2), use names:

Syntax:

Code:

(?<name>pattern)

Example - Extract date components:

Code:

grep -P '(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})' dates.txt

Use in replacement (with Perl):

Code:

perl -pe 's/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/$+{month}\/$+{day}\/$+{year}/' dates.txt

---

Shorthand Character Classes

PCRE provides convenient shortcuts:

Code:

\d      Digit [0-9]
\D      Non-digit [^0-9]
\w      Word character [a-zA-Z0-9_]
\W      Non-word character [^a-zA-Z0-9_]
\s      Whitespace [ \t\n\r\f]
\S      Non-whitespace [^ \t\n\r\f]

Example - Find words followed by numbers:

Code:

grep -P '\w+\s+\d+' file.txt

Example - Remove all whitespace:

Code:

perl -pe 's/\s+//g' file.txt

Example - Find non-alphanumeric characters:

Code:

grep -oP '\W' file.txt

---

Advanced Quantifiers

Lazy (non-greedy) quantifiers:

Add ? after quantifier to match as little as possible:

Code:

*?      Zero or more (lazy)
+?      One or more (lazy)
??      Zero or one (lazy)
{n,m}?  Range (lazy)

Example - Extract first HTML tag:

Code:

echo "<div>text</div><span>more</span>" | grep -oP '<.*?>'

Matches <div> (not entire string)

Example - Extract content between quotes:

Code:

grep -oP '".*?"' file.txt

Possessive quantifiers:

Add + after quantifier (prevents backtracking):

Code:

*+      Zero or more (possessive)
++      One or more (possessive)

Used for performance optimization in complex patterns.

---

Unicode Support

PCRE handles Unicode properly:

Code:

\p{L}   Any Unicode letter
\p{N}   Any Unicode number
\p{P}   Any Unicode punctuation
\p{Sc}  Currency symbols

Example - Match emoji:

Code:

grep -P '\p{So}' file.txt

Example - Match any letter (any language):

Code:

grep -P '\p{L}+' multilingual.txt

---

Practical PCRE Examples

1. Validate strong password:

Code:

grep -P '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$' passwords.txt

Requires: lowercase, uppercase, digit, special char, 8+ length

2. Extract IPv4 addresses (strict):

Code:

grep -oP '(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)' file.txt

3. Match balanced parentheses (recursive):

Code:

grep -P '\((?:[^()]++|(?R))*+\)' code.txt

4. Extract email addresses (strict):

Code:

grep -oP '(?i)\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}\b' file.txt

5. Find words NOT in quotes:

Code:

grep -oP '(?<!")(\w+)(?!")' file.txt

6. Extract hashtags:

Code:

grep -oP '(?<!\w)#\w+' social.txt

7. Match credit card numbers (with optional spaces):

Code:

grep -P '\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}' cards.txt

8. Find duplicate words (case-insensitive):

Code:

grep -P '(?i)\b(\w+)\s+\1\b' file.txt

9. Extract content between XML/HTML tags:

Code:

grep -oP '(?<=<title>).*?(?=</title>)' page.html

10. Match file paths (Unix/Windows):

Code:

grep -P '(?:[a-zA-Z]:|/)(?:[\w.-]+[/\\])*[\w.-]+' paths.txt

---

Comparing Flavors: Same Pattern

Match email addresses:

BRE (grep):

Code:

grep '[a-zA-Z0-9._%+-]\+@[a-zA-Z0-9.-]\+\.[a-zA-Z]\{2,\}' file.txt

ERE (grep -E):

Code:

grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' file.txt

PCRE (grep -P):

Code:

grep -P '\b[\w._%+-]+@[\w.-]+\.\w{2,}\b' file.txt

PCRE is shortest and most readable!

---

Testing PCRE Patterns

Test with echo:

Code:

echo "[email protected]" | grep -P '\w+@\w+\.\w+'

Extract only matches:

Code:

grep -oP 'pattern' file.txt

Use Perl for complex testing:

Code:

echo "test string" | perl -ne 'print if /pattern/'

Online testers:

regex101.com (supports PCRE)
regexr.com

---

PCRE Limitations

Not available everywhere:

Some Linux distros compile grep without PCRE support
Use grep --version to check

Performance:

More complex = slower
Lookahead/lookbehind can be expensive
Test on large files before production use

Portability:

PCRE patterns may not work with basic grep
Document which flavor you're using
Consider fallback to ERE for scripts

---

When to Use PCRE

Use PCRE when you need:

Lookahead or lookbehind
Named capture groups
Unicode support
Lazy quantifiers
Complex validation (passwords, emails, etc.)

Stick with ERE when:

Pattern is simple
Portability matters
grep -P not available
Performance critical

---

Next in This Series

Part 5 will cover Python's regex module, which is similar to PCRE but has its own quirks and features. We'll explore raw strings, compiled patterns, and Python-specific methods.

---

dos2unix · Jan 3, 2026

Regular Expressions Part 5: Python Regex

Python's re module provides powerful regex capabilities similar to PCRE, with some Python-specific features and syntax. Understanding Python regex is essential for scripting and automation.

---

Importing the Module

Python regex requires importing the re module:

Code:

import re

---

Raw Strings - Critical for Regex

Python uses backslashes for escape sequences (\n, \t, etc.), which conflicts with regex backslashes.

Wrong way (double escaping required):

Code:

pattern = "\\d+\\s+\\w+"

Right way (raw strings):

Code:

pattern = r"\d+\s+\w+"

Always use raw strings (r"...") for regex patterns in Python!

---

Basic Matching Functions

re.search() - Find first match anywhere:

Code:

import re

text = "The price is $19.99"
match = re.search(r'\$\d+\.\d{2}', text)

if match:
    print(match.group())

Output: $19.99

re.match() - Match at start of string only:

Code:

text = "Error: file not found"
match = re.match(r'Error:', text)

if match:
    print("Found error at start")

re.findall() - Find all matches:

Code:

text = "Prices: $10.99, $25.50, $5.00"
prices = re.findall(r'\$\d+\.\d{2}', text)
print(prices)

Output: ['$10.99', '$25.50', '$5.00']

re.finditer() - Find all matches as iterator:

Code:

text = "Error at line 10, Error at line 25"
for match in re.finditer(r'line (\d+)', text):
    print(f"Line number: {match.group(1)}")

---

Substitution

re.sub() - Replace matches:

Code:

text = "Hello World"
result = re.sub(r'World', 'Python', text)
print(result)

Output: Hello Python

With backreferences:

Code:

text = "John Doe"
result = re.sub(r'(\w+) (\w+)', r'\2, \1', text)
print(result)

Output: Doe, John

Using function for replacement:

Code:

def double_number(match):
    num = int(match.group(1))
    return str(num * 2)

text = "I have 5 apples and 10 oranges"
result = re.sub(r'(\d+)', double_number, text)
print(result)

Output: I have 10 apples and 20 oranges

re.subn() - Replace and count:

Code:

text = "cat dog cat bird cat"
result, count = re.subn(r'cat', 'tiger', text)
print(f"Result: {result}, Replacements: {count}")

Output: Result: tiger dog tiger bird tiger, Replacements: 3

---

Splitting Strings

re.split() - Split by pattern:

Code:

text = "one,two;three:four"
parts = re.split(r'[,;:]', text)
print(parts)

Output: ['one', 'two', 'three', 'four']

With capture groups:

Code:

text = "one,two;three"
parts = re.split(r'([,;])', text)
print(parts)

Output: ['one', ',', 'two', ';', 'three']

---

Compiled Patterns

For patterns used multiple times, compile once for better performance:

Code:

import re

pattern = re.compile(r'\b\w+@\w+\.\w+\b')

text1 = "Contact: [email protected]"
text2 = "Email: [email protected]"

print(pattern.search(text1).group())
print(pattern.search(text2).group())

With flags:

Code:

pattern = re.compile(r'error', re.IGNORECASE)

---

Match Objects

Match objects contain information about the match:

Code:

text = "Error at line 42"
match = re.search(r'line (\d+)', text)

if match:
    print(match.group())      # Entire match: 'line 42'
    print(match.group(1))     # First group: '42'
    print(match.start())      # Start position: 9
    print(match.end())        # End position: 16
    print(match.span())       # Tuple: (9, 16)

Multiple groups:

Code:

text = "2024-01-03"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)

if match:
    year, month, day = match.groups()
    print(f"Year: {year}, Month: {month}, Day: {day}")

---

Named Groups

Python supports named capture groups:

Code:

text = "John Doe, age 30"
pattern = r'(?P<first>\w+) (?P<last>\w+), age (?P<age>\d+)'
match = re.search(pattern, text)

if match:
    print(match.group('first'))   # John
    print(match.group('last'))    # Doe
    print(match.group('age'))     # 30
    print(match.groupdict())      # {'first': 'John', 'last': 'Doe', 'age': '30'}

Backreference to named group:

Code:

text = "the the quick brown fox"
duplicates = re.findall(r'\b(?P<word>\w+)\s+(?P=word)\b', text)
print(duplicates)

Output: ['the']

---

Regex Flags

Common flags:

Code:

re.IGNORECASE  or  re.I      Case-insensitive
re.MULTILINE   or  re.M      ^ and $ match line boundaries
re.DOTALL      or  re.S      . matches newlines too
re.VERBOSE     or  re.X      Allow comments and whitespace
re.ASCII       or  re.A      ASCII-only matching

Using flags:

Code:

pattern = re.compile(r'error', re.IGNORECASE)

Multiple flags:

Code:

pattern = re.compile(r'pattern', re.I | re.M)

Inline flags:

Code:

pattern = r'(?i)error'  # Case-insensitive

---

Verbose Regex (re.VERBOSE)

Make complex patterns readable:

Code:

pattern = re.compile(r'''
    \b                  # Word boundary
    (?P<user>[\w.]+)    # Username
    @                   # At symbol
    (?P<domain>[\w.]+)  # Domain
    \.                  # Dot
    (?P<tld>\w{2,})     # TLD
    \b                  # Word boundary
''', re.VERBOSE)

text = "Contact: [email protected]"
match = pattern.search(text)
if match:
    print(match.groupdict())

---

Practical Python Examples

1. Extract all email addresses:

Code:

import re

text = "Contact [email protected] or [email protected]"
emails = re.findall(r'\b[\w.%+-]+@[\w.-]+\.\w{2,}\b', text)
print(emails)

2. Validate phone numbers:

Code:

def validate_phone(number):
    pattern = r'^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$'
    match = re.match(pattern, number)
    return match is not None

print(validate_phone("(555) 123-4567"))  # True
print(validate_phone("555-1234"))        # False

3. Extract hashtags from social media:

Code:

text = "Check out #Python and #Regex tutorials! #coding"
hashtags = re.findall(r'#\w+', text)
print(hashtags)

Output: ['#Python', '#Regex', '#coding']

4. Parse log files:

Code:

log = "2024-01-03 14:30:25 ERROR Failed to connect"
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)'
match = re.match(pattern, log)

if match:
    date, time, level, message = match.groups()
    print(f"Level: {level}, Message: {message}")

5. Clean HTML tags:

Code:

html = "<p>Hello <b>World</b></p>"
clean = re.sub(r'<[^>]+>', '', html)
print(clean)

Output: Hello World

6. Validate password strength:

Code:

def is_strong_password(password):
    # At least 8 chars, one uppercase, one lowercase, one digit
    pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$'
    return re.match(pattern, password) is not None

print(is_strong_password("Pass123"))     # True
print(is_strong_password("weakpass"))    # False

7. Extract URLs:

Code:

text = "Visit https://example.com or http://test.org"
urls = re.findall(r'https?://[\w.-]+\.[\w]{2,}', text)
print(urls)

8. Replace multiple whitespace:

Code:

text = "Too    many      spaces"
clean = re.sub(r'\s+', ' ', text)
print(clean)

Output: Too many spaces

9. Extract numbers from text:

Code:

text = "I have 5 apples, 10 oranges, and 3.5 pounds of grapes"
integers = re.findall(r'\b\d+\b', text)
floats = re.findall(r'\b\d+\.?\d*\b', text)
print(f"Integers: {integers}")
print(f"All numbers: {floats}")

10. Parse CSV with regex:

Code:

line = 'John,Doe,30,"New York, NY",Engineer'
fields = re.findall(r'(?:[^,"]+|"[^"]*")+', line)
print(fields)

---

Common Python Regex Patterns

Email validation:

Code:

r'\b[\w.%+-]+@[\w.-]+\.\w{2,}\b'

URL matching:

Code:

r'https?://(?:www\.)?[\w.-]+\.[\w]{2,}(?:/[\w./?=&-]*)?'

IP address:

Code:

r'\b(?:\d{1,3}\.){3}\d{1,3}\b'

Date (YYYY-MM-DD):

Code:

r'\d{4}-\d{2}-\d{2}'

Time (HH:MM:SS):

Code:

r'\d{2}:\d{2}:\d{2}'

Hexadecimal color:

Code:

r'#[0-9A-Fa-f]{6}'

---

Error Handling

Catch invalid regex:

Code:

import re

try:
    pattern = re.compile(r'[invalid(')
except re.error as e:
    print(f"Regex error: {e}")

Check if match exists:

Code:

match = re.search(r'pattern', text)
if match:
    print(match.group())
else:
    print("No match found")

---

Performance Tips

1. Compile patterns used multiple times:

Code:

pattern = re.compile(r'\d+')
for line in file:
    match = pattern.search(line)

2. Use raw strings:

Code:

pattern = r'\d+'  # Good
pattern = "\\d+"  # Bad (harder to read, easy to mess up)

3. Avoid catastrophic backtracking:

Code:

# Bad (can hang on certain inputs)
pattern = r'(a+)+'

# Better
pattern = r'a+'

4. Use non-capturing groups when possible:

Code:

pattern = r'(?:http|https)://\w+'  # Faster than capturing

---

Differences from PCRE

Python regex is mostly PCRE-compatible but:

No \K (keep) assertion
No possessive quantifiers by default (use regex module for these)
Slightly different Unicode handling
Some advanced PCRE features missing

For full PCRE compatibility, use the regex module:

Code:

pip install regex
import regex

---

Next in This Series

Part 6 will cover practical real-world regex examples and best practices, including log file parsing, data validation, text extraction, and common pitfalls to avoid.

dos2unix · Jan 4, 2026

Regular Expressions Part 6: Practical Examples & Best Practices

This final part covers real-world regex applications, common patterns, performance considerations, and pitfalls to avoid. These are the patterns you'll actually use in daily system administration and scripting.

---

Log File Parsing

1. Apache/Nginx access logs:

Parse common log format:

Code:

grep -E '^([0-9.]+) - - \[([^\]]+)\] "([A-Z]+) ([^ ]+) HTTP/[0-9.]+" ([0-9]+) ([0-9]+)' access.log

Extract just IP addresses:

Code:

grep -oP '^\d+\.\d+\.\d+\.\d+' access.log | sort | uniq -c | sort -rn

Find 404 errors:

Code:

grep -E '" 404 ' access.log

Extract URLs requested:

Code:

grep -oP 'GET \K[^ ]+' access.log

2. System log patterns:

Find failed SSH attempts:

Code:

grep -E 'Failed password for .* from [0-9.]+' /var/log/auth.log

Extract failed login IPs:

Code:

grep -oP 'Failed password for .* from \K[0-9.]+' /var/log/auth.log | sort | uniq -c | sort -rn

Find kernel errors:

Code:

grep -E 'kernel:.*error' /var/log/syslog

Parse systemd journal:

Code:

journalctl | grep -E '^[A-Z][a-z]{2} [0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'

3. Application logs:

Extract timestamps:

Code:

grep -oP '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}' app.log

Find errors with context:

Code:

grep -E -A 3 -B 3 'ERROR|CRITICAL' app.log

Extract JSON from logs:

Code:

grep -oP '\{[^}]+\}' app.log

---

Data Validation

1. Email addresses:

Simple validation:

Code:

^[\w.%+-]+@[\w.-]+\.\w{2,}$

More strict:

Code:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

2. Phone numbers:

US format (flexible):

Code:

^\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})$

International format:

Code:

^\+?[1-9]\d{1,14}$

3. URLs:

Basic URL:

Code:

^https?://[\w.-]+\.[\w]{2,}(/[\w./?=&-]*)?$

With optional www and path:

Code:

^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*)$

4. IP addresses:

IPv4 (simple):

Code:

^(\d{1,3}\.){3}\d{1,3}$

IPv4 (strict, validates ranges):

Code:

^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

IPv6:

Code:

^(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$

5. Passwords:

Minimum requirements:

Code:

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$

Requires: lowercase, uppercase, digit, special char, 8+ length

6. Credit card numbers:

Basic format:

Code:

^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$

Specific cards:

Code:

Visa:       ^4\d{3}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$
MasterCard: ^5[1-5]\d{2}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$
Amex:       ^3[47]\d{2}[\s-]?\d{6}[\s-]?\d{5}$

---

Text Extraction and Manipulation

1. Extract all numbers:

Code:

grep -oP '\d+\.?\d*' file.txt

2. Extract words in quotes:

Code:

grep -oP '(?<=")[^"]+(?=")' file.txt

3. Extract hashtags:

Code:

grep -oP '#\w+' social.txt

4. Extract mentions (@username):

Code:

grep -oP '@\w+' social.txt

5. Remove HTML tags:

Code:

sed 's/<[^>]*>//g' file.html

6. Extract MAC addresses:

Code:

grep -oP '([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}' network.txt

7. Extract version numbers:

Code:

grep -oP '\d+\.\d+\.\d+' changelog.txt

8. Convert date formats:

Code:

sed -E 's/([0-9]{2})\/([0-9]{2})\/([0-9]{4})/\3-\1-\2/g' dates.txt

Converts MM/DD/YYYY to YYYY-MM-DD

---

Configuration File Parsing

1. Extract key-value pairs:

Code:

grep -oP '^\s*\K\w+(?==)' config.conf

2. Remove comments:

Code:

sed 's/#.*//' config.conf
sed 's/\/\/.*//' config.js

3. Extract variable assignments:

Code:

grep -P '^\s*[A-Z_]+=.+' .env

4. Find uncommented lines:

Code:

grep -vE '^\s*(#|$)' config.conf

5. Extract section headers:

Code:

grep -oP '(?<=\[)[^\]]+(?=\])' config.ini

---

Network and Security

1. Find private IP addresses:

Code:

grep -oP '(?:10|172\.(?:1[6-9]|2[0-9]|3[01])|192\.168)\.\d{1,3}\.\d{1,3}' file.txt

2. Extract URLs from HTML:

Code:

grep -oP '(?:href|src)="https?://[^"]+' page.html

3. Find potential SQL injection attempts:

Code:

grep -iE "(union|select|insert|update|delete|drop|exec|script)" access.log

4. Extract SSL certificate info:

Code:

openssl x509 -in cert.pem -text | grep -oP 'Subject:.*'

5. Parse firewall rules:

Code:

iptables -L | grep -oP '\d+\.\d+\.\d+\.\d+/\d+'

---

Common Mistakes and How to Avoid Them

1. Forgetting to escape special characters:

Wrong:

Code:

grep '192.168.1.1' file.txt

Matches "192x168x1x1" (. matches any character)

Right:

Code:

grep '192\.168\.1\.1' file.txt

2. Greedy vs lazy matching:

Wrong (greedy):

Code:

echo "<tag>content</tag><tag>more</tag>" | grep -oP '<tag>.*</tag>'

Matches entire string

Right (lazy):

Code:

echo "<tag>content</tag><tag>more</tag>" | grep -oP '<tag>.*?</tag>'

Matches each tag separately

3. Not anchoring patterns:

Wrong:

Code:

grep '\d{3}' file.txt

Matches 123 in "abc123def"

Right:

Code:

grep '^\d{3}$' file.txt

Matches only lines with exactly 3 digits

4. Catastrophic backtracking:

Dangerous:

Code:

(a+)+b

Can hang on "aaaaaaaaaaaaaaa" (no b at end)

Better:

Code:

a+b

5. Case sensitivity:

Remember to use:

Code:

grep -i 'pattern'           # Case-insensitive grep
sed 's/pattern/replace/gi'  # Case-insensitive sed
re.IGNORECASE              # Python flag

---

Performance Best Practices

1. Anchor patterns when possible:

Code:

^pattern    # Faster - only checks start
pattern$    # Faster - only checks end
pattern     # Slower - checks entire line

2. Use character classes efficiently:

Code:

[0-9]       # Good
\d          # PCRE only, same speed
[0123456789] # Slower, harder to read

3. Avoid unnecessary capture groups:

Code:

(?:pattern)  # Non-capturing (faster)
(pattern)    # Capturing (slower if not needed)

4. Compile patterns in scripts:

Python:

Code:

pattern = re.compile(r'\d+')  # Compile once
for line in file:
    pattern.search(line)       # Reuse

5. Use specific quantifiers:

Code:

\d{10}      # Better - specific count
\d{10,10}   # Worse - range with same min/max
\d+         # Worst for known length

---

Testing and Debugging

1. Test incrementally:

Code:

# Start simple
\d+

# Add complexity
\d+\.\d+

# Add anchors
^\d+\.\d+$

# Add validation
^(?:\d{1,3}\.){3}\d{1,3}$

2. Use verbose mode for complex patterns:

Python:

Code:

pattern = re.compile(r'''
    ^                 # Start of line
    (?P<ip>           # IP address group
        (?:\d{1,3}\.){3}  # First 3 octets
        \d{1,3}       # Last octet
    )
    \s+               # Whitespace
    (?P<port>\d+)     # Port number
    $                 # End of line
''', re.VERBOSE)

3. Online testers:

regex101.com (explains matches step-by-step)
regexr.com (visual interface)
debuggex.com (railroad diagrams)

4. Test edge cases:

Code:

# Test with:
- Empty strings
- Maximum length inputs
- Special characters
- Unicode characters
- Whitespace variations

---

Real-World System Administration Scripts

1. Find large files modified recently:

Code:

find /var/log -type f -mtime -7 -size +100M | grep -E '\.(log|txt)$'

2. Monitor failed logins:

Code:

#!/bin/bash
tail -f /var/log/auth.log | grep --line-buffered -E 'Failed password' | \
while read line; do
    ip=$(echo "$line" | grep -oP 'from \K[0-9.]+')
    echo "Failed login from: $ip"
done

3. Extract errors from multiple logs:

Code:

find /var/log -name "*.log" -exec grep -H -E 'ERROR|CRITICAL' {} \; | \
grep -oP '^[^:]+:\K.*'

4. Validate config files:

Code:

#!/bin/bash
for file in /etc/*.conf; do
    if grep -qP '[^[:print:]]' "$file"; then
        echo "Non-printable characters in $file"
    fi
done

5. Parse and summarize access logs:

Code:

#!/bin/bash
echo "Top 10 IP addresses:"
grep -oP '^\d+\.\d+\.\d+\.\d+' /var/log/apache2/access.log | \
    sort | uniq -c | sort -rn | head -10

echo -e "\nTop 10 requested URLs:"
grep -oP 'GET \K[^ ]+' /var/log/apache2/access.log | \
    sort | uniq -c | sort -rn | head -10

---

Quick Reference Card

Anchors:

Code:

^       Start of line
$       End of line
\b      Word boundary
\<      Start of word
\>      End of word

Character Classes:

Code:

.       Any character
\d      Digit
\w      Word character
\s      Whitespace
[abc]   Any of a, b, c
[^abc]  Not a, b, c
[a-z]   Range a to z

Quantifiers:

Code:

*       Zero or more
+       One or more
?       Zero or one
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m
*?      Lazy zero or more
+?      Lazy one or more

Groups:

Code:

(...)       Capture group
(?:...)     Non-capturing group
(?<name>...)  Named group
\1          Backreference
(?=...)     Positive lookahead
(?!...)     Negative lookahead
(?<=...)    Positive lookbehind
(?<!...)    Negative lookbehind

Flags:

Code:

grep -i     Case-insensitive
grep -E     Extended regex
grep -P     Perl regex
grep -o     Only matching
sed -E      Extended regex
sed -i      In-place edit

---

Final Recommendations

1. Start simple, add complexity gradually

2. Always test patterns on sample data first

3. Document complex regex (use comments)

4. Consider readability over brevity

5. Use the right tool:

Simple patterns: BRE (grep/sed)
Medium complexity: ERE (grep -E)
Complex patterns: PCRE (grep -P) or Python

6. Keep a personal regex library for common patterns

7. When in doubt, search existing solutions (but understand them!)

---

Conclusion

Regular expressions are powerful but can be complex. Master the basics first, then gradually add advanced features as needed. The key is practice - the more you use regex, the more intuitive it becomes.

Remember: A simple, readable regex that works is better than a clever, compact one that nobody can maintain.

---

Series Complete!

You now have a comprehensive guide to regular expressions across multiple flavors. Refer back to these parts as needed for syntax reference and practical examples.

---

There... whew! How is that for the longest -- most difficult too read article you ever read? ( over 3 weeks in the making )

regex

dos2unix

Well-Known Member

dos2unix

Well-Known Member

dos2unix

Well-Known Member

dos2unix

Well-Known Member

dos2unix

Well-Known Member

dos2unix

Well-Known Member

Similar threads

Follow Linux.org

Members online

Latest posts