What Is a Regular Expression?
A regular expression (regex or regexp) is a sequence of characters that defines a search pattern. Regex is used for text validation, searching, and pattern matching across virtually every programming language. For example, you can use regex to validate email addresses, extract phone numbers, or find and replace text patterns.
Regex Syntax Basics
Basic regex building blocks:
| Symbol | Meaning | Example |
|---|---|---|
| . | Any single character (except newline) | c.t matches cat, cut, cot |
| * | 0 or more of preceding character | ca*t matches ct, cat, caat |
| + | 1 or more of preceding character | ca+t matches cat, caat (not ct) |
| ? | 0 or 1 of preceding character | ca?t matches ct, cat (not caat) |
| ^ | Start of string | ^hello matches hello at the beginning |
| $ | End of string | end$ matches end at the end |
| [abc] | Any one character in brackets | [aeiou] matches any vowel |
| [a-z] | Range of characters | [a-z] matches any lowercase letter |
| \\d | Digit (0-9) | \\d+ matches 123, 456 |
| \\w | Word character [a-zA-Z0-9_] | \\w+ matches words |
| \\s | Whitespace character | \\s matches spaces, tabs |
| | | OR | cat|dog matches cat or dog |
Worked Example: Email Validation
Pattern: ^[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
^— Start of string[a-zA-Z0-9._%-]+— One or more letters, digits, dots, underscores, percent, or hyphen (username part)@— Literal @[a-zA-Z0-9.-]+— One or more letters, digits, dots, or hyphens (domain)\.— Literal dot (escaped)[a-zA-Z]{2,}— 2 or more letters (TLD like .com, .org)$— End of string
Matches: john@example.com, user+tag@domain.co.uk
Does not match: invalid.email@, @nodomain.com
Flags and Modifiers
Flags modify how regex behaves:
- g (global): Find all matches, not just the first. Required to find multiple matches.
- i (case-insensitive): Match ignores uppercase/lowercase differences.
- m (multiline): ^ and $ match line beginnings/endings, not just string start/end.
- s (dotall): . (dot) also matches newline characters.
- u (unicode): Treat pattern as Unicode sequence.
Capture Groups and Backreferences
Parentheses create capture groups:
Pattern: (\d{3})-(\d{2})-(\d{4})
Text: 123-45-6789
Group 1: 123
Group 2: 45
Group 3: 6789Use captured groups in replacements: Replace "$2/$3" to reformat.
Common Mistakes
- Forgetting ^ and $: Without them, pattern matches anywhere. /hello/ matches "say hello world".
- Not escaping special characters: Use \\ to escape . ? + * [ ] ( ) ^ $ |
- Greedy vs lazy matching: .* is greedy (matches as much as possible). .*? is lazy (matches as little as possible).
- Not using raw strings: In some languages, use r"" or // to avoid double-escaping.
Performance Considerations
- Catastrophic backtracking: Overly complex patterns can be slow. Test performance on large inputs.
- Anchors improve performance: ^ and $ help regex engines exit early.
- Be specific: [a-zA-Z] is faster than . (dot).
- Use character classes: \\d is faster than [0-9].
Tools for Regex
- Regex101.com: Interactive regex tester with explanations and flags.
- Regex generators: Tools that build regex from examples.
- IDE support: Most editors highlight regex syntax and test matches.
- Language docs: Each language (JavaScript, Python, Java) has specific regex variations.
References
- MDN Regular Expressions: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
- Regex101: https://regex101.com/
- RegexOne Tutorial: https://regexone.com/