Extracting Matched Patterns with Awk
awk
is a powerful text-processing tool commonly used in Unix-like operating systems. Beyond simple text filtering and manipulation, awk
excels at pattern matching and extracting specific portions of text that conform to a given pattern. This tutorial will focus on how to use awk
to identify and print only the matched portion of a pattern within a file.
Understanding the Basics
At its core, awk
operates by reading a file line by line and applying a set of rules to each line. These rules consist of a pattern and an action. If the pattern matches the current line, the associated action is executed.
The general syntax is:
awk 'pattern { action }' file
Where:
pattern
is a regular expression (or other condition) to match.action
is a set of commands to execute when the pattern is matched.file
is the input file to process.
Simple Pattern Matching and Printing the Entire Line
The most basic form of pattern matching in awk
involves searching for a literal string or a simple regular expression. When a match is found, you can print the entire line using the print $0
action.
awk '/pattern/ { print $0 }' file
This command searches for lines containing the string "pattern" and prints the entire line if a match is found. While functional, this often isn’t what you need when you only want to extract the matched portion of the text.
Extracting the Matched Portion: The match()
Function
awk
provides the match()
function specifically designed for extracting matched substrings. This function searches for a regular expression within a string and, if found, stores information about the match in built-in variables.
The syntax is:
match(string, regexp [, array])
string
: The string to search within. Usually$0
representing the current line.regexp
: The regular expression to search for.array
(optional): An array to store information about the match. If provided, the first element (array[0]
) will contain the entire matched string. If the regular expression contains capturing groups (parts enclosed in parentheses), subsequent elements (array[1]
,array[2]
, etc.) will contain the matched substrings for each capturing group.
Here’s how to use match()
to extract the matched portion:
awk 'match($0, /regex/) { print substr($0, RSTART, RLENGTH) }' file
RSTART
: This built-in variable stores the starting position of the match within the string.RLENGTH
: This built-in variable stores the length of the matched substring.substr(string, start, length)
: Thisawk
function extracts a substring of a specified length from a given string, starting at a given position.
Example:
Let’s say you have a file data.txt
containing the following:
xxx yyy zzz
abc def ghi
yyy pqr stu
And you want to extract only the occurrences of "yyy". You can use the following awk
command:
awk 'match($0, /yyy/) { print substr($0, RSTART, RLENGTH) }' data.txt
This will output:
yyy
yyy
Using Capturing Groups for More Complex Extraction
If you need to extract specific parts of the matched string, you can use capturing groups within your regular expression.
Example:
Suppose you have a file with lines like:
name=Alice
age=30
city=New York
And you want to extract the values associated with each key. You can use the following awk
command:
awk 'match($0, /([^=]+)=([^ ]+)/) { print $2 }' data.txt
This regular expression ([^=]+)=([^ ]+)
matches:
([^=]+)
: One or more characters that are not equal signs (=
). This is the first capturing group (representing the key).=
: The equal sign.([^ ]+)
: One or more characters that are not spaces. This is the second capturing group (representing the value).
The print $2
action prints the contents of the second capturing group (the value), effectively extracting the value associated with each key.
GNU Awk Specific Features
GNU awk
(often referred to as gawk
) offers more advanced features. In particular, you can directly access the matched substrings within an array when using the match()
function with three arguments.
awk '{ match($0, /regex/, a); print a[0] }' file
In this case, a[0]
will contain the entire matched string, while a[1]
, a[2]
, etc. will contain the substrings corresponding to any capturing groups in the regular expression. This can simplify your code and make it more readable.