Master Regular Expressions with Python's re Module! A Complete Guide to String Searching and Replacing
To run Python from the command prompt or PowerShell on your PC, you need to download and install Python.
If you haven’t installed it yet, please refer to the article Setting Up Python and Development Environment to install Python.
When creating websites or processing data, you often encounter situations where you think, "I want to extract only strings of a specific pattern from this text!" or "I want to batch format data with inconsistent formats!" This is where "regular expressions" unleash their immense power. Do they seem difficult? Not at all! With Python's `re` module, even beginners can surprisingly easily become masters of string manipulation!
In this article, we'll explain everything from the basics to the applications of regular expressions using Python's `re` module in a thoroughly easy-to-understand manner, with plenty of copy-paste-ready code examples. First, try running the code to experience that "Aha, so that's how you use it!" moment. Let's dive into the world of regular expressions together!
What Are Regular Expressions and the `re` Module, Anyway?
A regular expression is, in short, "a special string for representing patterns in text." It's like a common language for searching and replacing strings with ambiguous conditions, such as "a sequence of three numbers" or "something that looks like an email address."
And the standard feature (library) for handling these regular expressions in Python is the `re` module. Just by importing the `re` module, you can accomplish complex string processing in just a few lines of code.
First, the Basics! Four Functions for "Finding" Strings
The `re` module has several functions for searching strings, but let's master the four most commonly used ones first. These functions take a "regular expression pattern" and a "target string" as arguments.
1. `re.search()` - Returns the First Match Found
`re.search()` scans through the entire string and returns information about the first part that matches the pattern. It returns `None` if no match is found. It's the most common and easy-to-use search function.
For example, let's try to find the word "Python" in a sentence.
<?php
import re
text = "Hello, this is a pen. I love Python programming!"
pattern = "Python"
# Search for the pattern within the string
match = re.search(pattern, text)
if match:
print(f"Found!: '{match.group()}'")
print(f"Start index: {match.start()}, End index: {match.end()}")
else:
print("Not found.")
# Execution Result:
# Found!: 'Python'
# Start index: 31, End index: 37
?>
2. `re.match()` - Checks if the "Beginning" of the String Matches
`re.match()` is similar to `re.search()`, but with a major difference: it only checks if the pattern matches from the beginning of the string. Patterns in the middle of the string are ignored.
Let's use `re.match()` on the previous sentence. It will match "Hello" but not "Python".
<?php
import re
text = "Hello, this is a pen. I love Python programming!"
# Pattern 1: "Hello" (at the beginning of the string)
match1 = re.match("Hello", text)
if match1:
print(f"Search result for 'Hello': Matched! -> {match1.group()}")
else:
print("Search result for 'Hello': No match.")
# Pattern 2: "Python" (in the middle of the string)
match2 = re.match("Python", text)
if match2:
print(f"Search result for 'Python': Matched! -> {match2.group()}")
else:
print("Search result for 'Python': No match.")
# Execution Result:
# Search result for 'Hello': Matched! -> Hello
# Search result for 'Python': No match.
?>
3. `re.findall()` - Returns "All" Matches in a List
`re.findall()` finds all parts that match the pattern and returns them as a list of strings. It's extremely useful when you want to extract all numbers from a sentence, for example.
Here, let's use the regex pattern `\d+`. This means "a sequence of one or more digits."
<?php
import re
text = "Item A costs 100 for 1 piece, and Item B costs 250 for 3 pieces. The order number is 8801."
# \d+ is a regex pattern that means 'one or more consecutive digits'
pattern = r"\d+"
# Get all matching parts as a list
numbers = re.findall(pattern, text)
print(f"List of numbers found: {numbers}")
# Execution Result:
# List of numbers found: ['1', '100', '3', '250', '8801']
?>
※ The `r` before the regex pattern stands for "raw string," which prevents the backslash `\` from being treated as an escape character. It's a good practice to use it when writing regular expressions to avoid unnecessary errors.
4. `re.finditer()` - Returns All Matches as an "Iterator"
`re.finditer()` is similar to `re.findall()`, but it differs in that it returns the results as an "iterator" instead of a list. An iterator is suitable for processing elements one by one, such as in a for loop.
It's useful when you want not just the matched string, but also the match object (which contains information like the start position) that `search` or `match` would return.
<?php
import re
text = "My birthday is on December 31, 1995, and his birthday is on May 5, 2003."
# Find 4-digit numbers
pattern = r"\d{4}"
# Get the matches as an iterator
matches = re.finditer(pattern, text)
print("Found years:")
for m in matches:
print(f"- {m.group()} (position: {m.start()}-{m.end()})")
# Execution Result:
# Found years:
# - 1995 (position: 6-10)
# - 2003 (position: 21-25)
?>
Freely "Replace" Strings - re.sub()
Another powerful feature of regular expressions is "replacement." With `re.sub()`, you can bulk replace parts that match a pattern with another string.
For example, let's replace phone numbers in a text with the string "(redacted)" for privacy. We'll use a regex `\d{2,4}-\d{2,4}-\d{4}` that matches phone number formats like 080-1234-5678.
<?php
import re
text = "For inquiries, contact support staff Sato (080-1111-2222). Or, contact sales representative Suzuki (03-3333-4444)."
# Regex to match phone numbers
pattern = r"\d{2,4}-\d{2,4}-\d{4}"
replacement = "(redacted)"
# Replace the parts that match the pattern
new_text = re.sub(pattern, replacement, text)
print(new_text)
# Execution Result:
# For inquiries, contact support staff Sato ((redacted)). Or, contact sales representative Suzuki ((redacted)).
?>
Advanced: Replacing with Groups
The true power of `re.sub()` is unleashed when combined with the "group" feature of regular expressions. By enclosing parts of the pattern in `()`, you can reuse those parts (groups) in the replacement string.
For example, let's swap names that are in "Last-First" order to "First Last" order.
<?php
import re
text = "Characters: Tanjiro-Kamado, Zenitsu-Agatsuma, Inosuke-Hashibira"
# (\w+) is a group. The first one matches the last name, the second one matches the first name.
pattern = r"(\w+)-(\w+)"
# \2 refers to the second group (first name), \1 refers to the first group (last name)
replacement = r"\2 \1"
new_text = re.sub(pattern, replacement, text)
print(new_text)
# Execution Result:
# Characters: Kamado Tanjiro, Agatsuma Zenitsu, Hashibira Inosuke
?>
【Try It Live】Let's Build a Regex Checker!
Using the knowledge we've gained so far, let's build a "Regex Checker" that allows you to check the behavior of regular expressions in real-time in your browser!
Copy the entire HTML code below, save it in a file named something like `checker.html`, and open it in your browser. Enter your text in the textarea, your desired pattern (e.g., `\d+` or `[A-Za-z]+`) in the regex input field, and press the "Run Highlight" button. The matched parts will be highlighted in light blue. This is a perfect sample to experience it working live!
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Regex Highlighter Checker</title>
<style>
body {
background-color: #202124;
color: #e8eaed;
font-family: sans-serif;
line-height: 1.6;
padding: 20px;
}
.container {
max-width: 800px;
margin: 0 auto;
}
h1 {
color: #669df6;
border-bottom: 1px solid #5f6368;
padding-bottom: 10px;
}
textarea, input[type="text"] {
width: 100%;
padding: 10px;
margin-bottom: 10px;
background-color: #3c4043;
color: #e8eaed;
border: 1px solid #5f6368;
border-radius: 4px;
box-sizing: border-box; /* Include padding in width calculation */
}
button {
padding: 10px 20px;
background-color: #8ab4f8;
color: #202124;
border: none;
border-radius: 4px;
cursor: pointer;
font-weight: bold;
}
button:hover {
opacity: 0.9;
}
#result {
margin-top: 20px;
padding: 15px;
border: 1px solid #5f6368;
border-radius: 4px;
white-space: pre-wrap; /* Display newlines as is */
word-wrap: break-word; /* Wrap long words */
}
.highlight {
background-color: #3367d6; /* Highlight with light blue */
color: #ffffff;
border-radius: 3px;
padding: 0 2px;
}
label {
display: block;
margin-bottom: 5px;
font-weight: bold;
}
</style>
</head>
<body>
<div class="container">
<h1>Regex Highlighter Checker</h1>
<label for="text-input">Text to test:</label>
<textarea id="text-input" rows="8">Python 3.10 is the latest version. My email is sample-user@example.com. Please call me at 090-1234-5678. The event is on 2025/07/26.</textarea>
<label for="regex-input">Regular Expression Pattern:</label>
<input type="text" id="regex-input" value="\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b">
<button onclick="highlightMatches()">Run Highlight</button>
<label for="result" style="margin-top: 20px;">Result:</label>
<div id="result"></div>
</div>
<script>
function highlightMatches() {
const text = document.getElementById('text-input').value;
const regexPattern = document.getElementById('regex-input').value;
const resultDiv = document.getElementById('result');
if (!text || !regexPattern) {
resultDiv.textContent = 'Please enter text and a regular expression pattern.';
return;
}
try {
// Add g flag (global search) and i flag (case-insensitive)
const regex = new RegExp(regexPattern, 'gi');
const highlightedText = text.replace(regex, (match) => {
return `<span class="highlight">${match}</span>`;
});
resultDiv.innerHTML = highlightedText;
} catch (e) {
resultDiv.textContent = 'Invalid regular expression pattern: ' + e.message;
}
}
// Also run on initial load
document.addEventListener('DOMContentLoaded', highlightMatches);
</script>
</body>
</html>
Tips and Tricks to Stand Out!
Finally, let's cover some techniques for using regular expressions more effectively and the common pitfalls that beginners often fall into.
Greedy vs. Non-Greedy Matching
By default, regex quantifiers like `*` and `+` behave in a "Greedy" manner. This means they try to match the longest possible string.
For example, suppose you want to extract just the content from between `<p>` and `</p>`. What happens if you use the pattern `<.*>` on a string like `<p>first paragraph</p><p>second paragraph</p>`?
<?php
import re
html = "<p>first paragraph</p><p>second paragraph</p>"
# Greedy Match
greedy_match = re.search(r"<.*>", html)
print(f"Greedy match: {greedy_match.group()}")
# Non-Greedy Match: add ? after *
non_greedy_match = re.search(r"<.*?>", html)
print(f"Non-Greedy match: {non_greedy_match.group()}")
# Execution Result:
# Greedy match: <p>first paragraph</p><p>second paragraph</p>
# Non-Greedy match: <p>
?>
With the greedy match, it captured everything from the first `<` to the very last `>`. On the other hand, `*?`, which has a `?` after the `*`, is "Non-Greedy," meaning it tries to match the shortest possible string. This allows the match to end as soon as the first `>` is encountered. This difference is crucial when parsing HTML or XML.
Improve Performance with `re.compile()`
If you use the same regular expression pattern repeatedly in your program, it's recommended to "compile" the pattern beforehand using `re.compile()`.
This saves Python the effort of parsing the pattern each time, improving processing speed. The compiled pattern object has methods like `search()` and `findall()`.
<?php
import re
# Compile the email address pattern
email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
# Search using the compiled object
text1 = "The contact is a@b.com."
match1 = email_pattern.search(text1)
if match1:
print(f"Found in Text1: {match1.group()}")
text2 = "The support desk is support@example.co.jp."
match2 = email_pattern.search(text2)
if match2:
print(f"Found in Text2: {match2.group()}")
# Execution Result:
# Found in Text1: a@b.com
# Found in Text2: support@example.co.jp
?>
It also has the benefit of organizing your code and making it more readable.
Summary
In this article, we introduced the basics of regular expressions using Python's `re` module. It might seem a bit tricky at first, but once you discover its power, you won't be able to imagine text processing without regular expressions.
- `re.search()`: Finds the first occurrence of a pattern in a string.
- `re.match()`: Checks if the beginning of a string matches a pattern.
- `re.findall()`: Gets all occurrences of a pattern as a list.
- `re.sub()`: Replaces parts of a string that match a pattern.
By combining these basics and mastering metacharacters, your data processing skills will improve dramatically. Please feel free to copy and paste the code from this article and try out various patterns to get comfortable with regular expressions!
Next Steps
Now that you can freely handle text data with regular expressions, why not try your hand at file operations next? CSV files, in particular, are an important format used in many web applications and data analysis. Master how to read and write CSV files using Python's `csv` module in the following article.