Home/Blog/XPath for Developers: The Complete Guide to XML and HTML Selection
🔍
xpath tester onlinexpath tutorialxpath expressions explained

XPath for Developers: The Complete Guide to XML and HTML Selection

Master XPath expressions for web scraping, browser automation, and XML parsing. Covers axes, predicates, functions, and real-world patterns with examples for Selenium, Python, and XSLT.

May 18, 20268 min readby ToolNinja

XPath Fundamentals

XPath (XML Path Language) is the standard query language for selecting nodes from XML and HTML document trees. It's the engine behind XSLT transformations, Selenium locators, web scraping libraries like lxml, and XML parsing in virtually every programming language.

The Document Tree Model

XPath treats every document as a tree of nodes:

Node TypeExampleXPath to select
Element<h1>Title</h1>//h1
Attributeclass="nav"//@class or //div/@class
TextThe text inside a tag//h1/text()
Comment<!-- note -->//comment()
Processing instruction<?xml version="1.0"?>//processing-instruction()
RootThe document itself/

Basic Path Expressions

/root                    # root element
/root/child              # direct child
//element               # element anywhere in document
/root/child[1]           # first child (1-indexed, not 0)
/root/*                  # all child elements
/root/child/@attr        # attribute value
//element/text()         # text content of element

The Seven XPath Axes

Axes are the most powerful part of XPath — they define the direction you travel through the tree. Understanding axes is what separates basic XPath from expert-level selectors.

AxisSelectsExample
child::Direct childrenchild::div (same as div)
parent::Parent elementparent::div
ancestor::All ancestorsancestor::table
descendant::All descendantsdescendant::td
following-sibling::Siblings after currentfollowing-sibling::li
preceding-sibling::Siblings before currentpreceding-sibling::li
self::Current nodeself::div
ancestor-or-self::Ancestors including selfancestor-or-self::form

Practical Axis Examples

# Find the table that contains a specific cell
//td[text()='Total']/ancestor::table

# Get the label for a form input (the label before it)
//input[@id='email']/preceding-sibling::label

# All list items after the active one
//li[@class='active']/following-sibling::li

# The parent form of a submit button
//button[@type='submit']/parent::form

Predicates

Predicates narrow down which nodes to select. They appear in square brackets after a node test.

Position Predicates

//li[1]                  # first li
//li[last()]             # last li
//li[last()-1]           # second to last
//li[position() < 4]     # first three lis
//tr[position() mod 2=0] # even rows

Attribute Predicates

//div[@class='active']         # exact class match
//a[@href]                     # any a with an href
//input[@type='checkbox']      # checkboxes
//div[@data-id]                # has data-id attribute
//img[not(@alt)]               # images missing alt text

Text Predicates

//button[text()='Submit']                    # exact text
//p[contains(text(), 'error')]               # partial text
//h2[starts-with(text(), 'Chapter')]         # text prefix
//td[normalize-space(text())='Active']       # ignores whitespace

Multi-Condition Predicates

//input[@type='text' and @required]          # and
//div[@class='note' or @class='warning']     # or
//tr[td[1]='Admin' and td[2]='Active']       # row where cols match

XPath Functions Reference

XPath 1.0 includes a set of built-in functions for strings, numbers, and node sets.

String Functions

FunctionDescriptionExample
contains(str, sub)True if str contains subcontains(@class, 'active')
starts-with(str, pre)True if str starts with prestarts-with(@id, 'btn-')
normalize-space(str)Strips leading/trailing whitespace, collapses internalnormalize-space(text())
string-length(str)Length of stringstring-length(@id) > 5
substring(str, start, len)Extract substringsubstring(@class, 1, 4)
translate(str, chars, replace)Character substitutionCase-insensitive matching
concat(s1, s2, ...)String concatenationconcat('http://', @href)

Number Functions

FunctionDescriptionExample
count(nodeset)Count of nodescount(//li)
sum(nodeset)Sum of node valuessum(//td[@class='price'])
round(n)Round to nearest integerround(3.7) = 4
floor(n)Round downfloor(3.9) = 3
ceiling(n)Round upceiling(3.1) = 4

Node Functions

FunctionDescriptionExample
name()Tag name of current nodename() = 'input'
local-name()Tag name without namespaceUseful for namespaced XML
position()Position in node setposition() = 1
last()Last positionposition() = last()
not(expr)Boolean negationnot(@disabled)

Web Scraping Patterns

These are the XPath patterns that come up most often in real scraping projects.

Extracting Structured Data

# All product names on a page
//h2[@class='product-title']/text()

# All prices (normalize strips whitespace)
//span[@class='price']/normalize-space(text())

# All links in main navigation
//nav//a/@href

# Table rows (skip header)
//table[@id='results']//tr[position() > 1]

# Second column of every data row
//tbody/tr/td[2]/text()

Finding Elements by Partial Match

# Elements where class contains a word (like CSS class contains)
//div[contains(concat(' ', @class, ' '), ' active ')]

# IDs that follow a pattern
//div[starts-with(@id, 'product-')]

# Links to external sites
//a[starts-with(@href, 'http') and not(contains(@href, 'mysite.com'))]

# Inputs of any type except hidden
//input[not(@type='hidden')]

Navigating Relationships

# The price that follows a product heading
//h3[text()='Widget Pro']/following-sibling::span[@class='price']

# The form that contains a specific button
//button[contains(text(),'Checkout')]/ancestor::form

# The row where the first cell says "Admin"
//tr[td[1][text()='Admin']]

# All cells in the same row as the selected cell
//td[text()='Target']/parent::tr/td

XPath in Browser Automation

Selenium (Python)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://example.com")

# Find by XPath
element = driver.find_element(By.XPATH, "//button[@type='submit']")
element.click()

# Wait for element
wait = WebDriverWait(driver, 10)
el = wait.until(EC.presence_of_element_located(
    (By.XPATH, "//div[@id='results']")
))

# Find multiple elements
rows = driver.find_elements(By.XPATH, "//table[@id='data']//tr")
for row in rows:
    cells = row.find_elements(By.XPATH, ".//td")
    print([c.text for c in cells])

Note the .//td — starting with . means "relative to the current element", not the document root.

Playwright (Python)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    
    # XPath locator
    page.locator("xpath=//button[@type='submit']").click()
    
    # Get all matching texts
    titles = page.locator("xpath=//h2[@class='title']").all_text_contents()
    
    browser.close()

Python lxml (XML/HTML parsing)

from lxml import etree, html

# XML
tree = etree.fromstring(xml_string)
nodes = tree.xpath("//product[@active='true']/name/text()")

# HTML
doc = html.fromstring(html_string)
prices = doc.xpath("//span[@class='price']/text()")

# With namespace
ns = {"ns": "http://example.com/schema"}
results = tree.xpath("//ns:item", namespaces=ns)

JavaScript (Browser / Node)

// Browser: document.evaluate()
const result = document.evaluate(
  "//div[@class='price']",
  document,
  null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
  null
);

for (let i = 0; i < result.snapshotLength; i++) {
  console.log(result.snapshotItem(i).textContent);
}

// Single node
const node = document.evaluate(
  "//h1",
  document,
  null,
  XPathResult.FIRST_ORDERED_NODE_TYPE,
  null
).singleNodeValue;

XPath for XML with Namespaces

Working with namespaced XML (like SOAP, RSS, Atom, SVG) requires namespace declaration in your XPath context.

The Namespace Problem

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>My Post</title>
  </entry>
</feed>

This XPath fails because title is in the Atom namespace:

//title           # finds nothing
//entry/title     # finds nothing

The Solution: Register Namespaces

# lxml — register namespace prefixes
ns = {"atom": "http://www.w3.org/2005/Atom"}
titles = tree.xpath("//atom:title/text()", namespaces=ns)

# Multiple namespaces
ns = {
    "soap": "http://schemas.xmlsoap.org/soap/envelope/",
    "ns1": "http://example.com/service"
}
body = tree.xpath("//soap:Body/ns1:GetResponse", namespaces=ns)

local-name() Workaround

When you can't register namespaces, use local-name() to ignore the namespace entirely:

//*[local-name()='title']           # any title regardless of namespace
//*[local-name()='Body']/*          # children of any Body element

This is a practical workaround but select all namespaces — use registered namespaces when possible for correctness.


XPath vs CSS Selectors

Knowing when to use each tool saves debugging time.

FeatureXPathCSS
Navigate up to parentYes (parent::, ancestor::)No
Select by text contentYes (text(), contains())No (in standard CSS)
Sibling traversalBoth directionsForward only (~, +)
Attribute selectionYes (@attr)Yes ([attr])
Position-based selectionYes ([1], [last()])Yes (:nth-child())
Works with XML namespacesYesNo
Browser supportUniversal (via evaluate())Universal
ReadabilityVerboseConcise
Performance in browsersSlightly slowerFaster

Rule of thumb: use CSS selectors for simple forward-only HTML selection in browsers. Use XPath when you need upward navigation, text content matching, namespace handling, or work with XML (not HTML).


XPath Cheat Sheet

Quick Reference

# Document root
/

# Any element anywhere
//*

# Element by tag
//div

# Element by ID
//div[@id='main']

# Element by class (exact)
//div[@class='container']

# Element by class (contains)
//div[contains(@class, 'btn')]

# Element by text
//button[text()='Submit']

# Element containing text
//p[contains(text(), 'error')]

# First / last / nth
//li[1]   //li[last()]   //li[3]

# All attributes
//@*

# Specific attribute value
//a/@href

# Parent
//span/parent::div

# Ancestor
//td/ancestor::table

# Following sibling
//dt/following-sibling::dd[1]

# Count
count(//input[@required])

# Exists check
boolean(//div[@id='error'])

Common Mistakes

Off-by-one errors — XPath is 1-indexed, not 0-indexed. //li[1] is the first element, //li[0] returns nothing.

Greedy descendant axis//tr searches the entire document every time. Inside a loop, use relative paths (.//td) relative to each row element.

Text node vs element content//h1/text() returns the text node as a string. //h1 returns the element node — use .text_content() in lxml or .textContent in JavaScript to get the text.

Whitespace in text() — HTML frequently includes leading/trailing whitespace in text nodes. Use normalize-space(text())='Target' instead of text()='Target' for reliable matching.

Namespace forgetting — A namespace declaration on the root element affects all descendants. If your XPath finds nothing on namespaced XML, that's almost always the cause.

Dynamic attributes — Generated class names or IDs change on every build. Prefer stable semantic attributes like data-testid, name, role, or structural positions over dynamic ones.

Share:𝕏 Twitterin LinkedIn

Frequently Asked Questions

What is the difference between / and // in XPath?

A single / selects a direct child — /root/child means child must be a direct child of root. Double // selects descendants at any depth — //child finds all child elements anywhere in the document tree.

How do I select an element by its text content in XPath?

Use the text() node test with a predicate: //h1[text()='Welcome'] selects h1 elements whose text is exactly 'Welcome'. For partial matches use contains(): //h1[contains(text(), 'Welcome')].

Why does my XPath work in the browser console but not in Selenium?

Selenium uses XPath 1.0 evaluated against the live DOM. Common causes of mismatch: dynamic content loaded after page load, iframes that require switching context, shadow DOM elements, or namespace differences in XML vs HTML mode.

What is the difference between XPath and CSS selectors?

CSS selectors can only traverse downward (parent to child). XPath navigates in any direction — upward to parents and ancestors, sideways to siblings, and downward to children. XPath also supports text content matching and works natively with XML documents.

🥷 ToolNinja