Working with Text Nodes in XPath

XPath is a powerful language for navigating XML documents. A common task is searching for elements containing specific text. However, understanding how XPath handles text nodes – the actual textual content within XML elements – is crucial for writing effective queries. This tutorial will explain how XPath interacts with text nodes and demonstrate how to correctly search for elements based on their textual content.

Understanding Text Nodes

XML documents are structured as a tree of nodes. Elements are nodes, attributes are nodes, and importantly, the text within those elements is represented as text nodes. An element can have multiple text nodes as children. This is particularly true when an element contains mixed content—text combined with other elements.

For example, consider the following XML:

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/> <br/> ABC</Comment>
    </Addr>
</Home>

In this example, the <Comment> element doesn’t simply contain the string "BLAH BLAH BLAH

ABC". Instead, it contains a sequence of nodes: a text node with "BLAH BLAH BLAH ", a <br/> element node, another <br/> element node, and finally a text node with "ABC".

Basic Text Searching with `contains()`

The contains() function in XPath checks if a string contains a specific substring. It’s tempting to use it directly with text() to find elements containing certain text:

//*[contains(text(), 'ABC')]

This query appears to select all elements containing the string "ABC". However, its behavior can be subtle and sometimes unexpected. The reason is how contains() interacts with node-sets returned by text().

When text() is used within contains(), it returns a node-set containing all text nodes that are children of the current node. XPath then converts this node-set into a string. The conversion process determines which text node is used for comparison.

In XPath 1.0, contains() converts the node-set to a string by taking the string value of the first node in the node-set. This means that if the first text node doesn’t contain the substring you’re looking for, the element won’t be selected, even if other text nodes within that element do contain it.

In XPath 2.0 and later, attempting to pass a sequence of more than one item to contains() will result in an error. This is a deliberate change to address the counter-intuitive behavior of XPath 1.0.

Correctly Searching with Descendant Text Nodes

To reliably find elements containing specific text, regardless of the number of text nodes they contain, you need to explicitly specify that you want to check all text nodes within the element. The most common and effective approach is to use the following XPath expression:

//*[text()[contains(., 'ABC')]]

Let’s break this down:

//*: Selects all elements in the document.
[text()]: This is a predicate that filters the elements. text() selects all text node children of the current element. The predicate will only pass if the element has at least one text node child.
[contains(., 'ABC')]: This is another predicate that further filters the elements.
- . represents the current text node being considered.
- contains(., 'ABC') checks if the current text node’s string value contains "ABC".

This approach ensures that the contains() function is applied to each individual text node within each element, effectively searching all text content.

Searching Only Direct Text Node Children

If you only want to search within the direct text node children of an element (and not within nested elements), you can use the following expression:

//*[text()[contains(., 'ABC')]]

This is identical to the previous example but achieves the desired filtering, as it ensures contains() is applied to the direct text nodes only.

Example

Using the sample XML from earlier, the XPath query //*[text()[contains(., 'ABC')]] will correctly select both <Street> and <Comment> elements. The <Street> element has a text node containing "ABC". The <Comment> element contains a text node containing "ABC" as well, even though it also contains <br/> elements.