Regular Expressions

Regular expressions are central to using Emacs and Elisp.

Major Topic Index

  1. Backslashing
  2. Anchoring
  3. Taming (rx and other tools)
  4. Query replacing
  5. Restoring (the match data)

Other Material on Elisp Regular Expressions

They are well covered in the Elisp info pages, so as usual "read the info pages" is sound advice. On the Elisp info pages look for the topic regexp.

The emacswiki.org page Regular Expressions gives a nice summary of the ins and outs of Elisp regular expressions.

Nomenclature

In Elisp, regular expressions do not have a special type; but are simply placed in strings. Below I may use the term regex string informally to indicate a string value intended to be used as a regex.

Abbreviations for "Regular Expression"

Emacs/Elisp documentation and function names usually abbreviate "regular expression" as "regexp" (but occasionally "re", as in re-search-forward).
Others (including myself) prefer "regex" over "regexp", as being slightly shorter and more pronounceable.

Outside of Elisp, I would like "rx" even better as a short abbreviation. In Elisp however, rx indicates an alternative representation of regular expressions, using s-expressions instead of strings. So if your Elisp regular expressions are in string form, I would suggest "regex" or perhaps "regexp".

To abbreviate the plural form "regular expressions": emacs uses "regexps", I use "regexs", and a few use "regexen" (Γ  la oxen, children; and of course emacsen, meaning either emacs or xemacs which was an active alternative fork of emacs).

'string' Notation to Denote Unprocessed Literal Strings

In the text and code comments of my explanations below, I adopt the single quote notation 'string' as in Bash or Perl.
So for example "\n", denotes the string holding the newline character, but '\n' denotes the length 2 string consisting of \ followed by n.

Note this is completely different than the use of the single quote character, as in 'y in Elisp code, which is shorthand for (quote y) and serves to protect literal symbols from evaluation.
As in:

(set 'x 17);  x ← 17
(set 'x 'y);  x ← 'y
(set  x 17);  y ← 17
Although somewhat string-like, symbol names are a different type of thing and not normally interpreted as regular expressions.
In summary, the single quotes in '\n' below are just my notation for explanation, not Elisp code!

Frustration when using Elisp regexs

Whatever you call them, emacs/Elisp regular expressions can be frustrating to use for the following reasons.
  1. Idiosyncratic syntax
  2. String literal regexs frequently require multiple backslashes

Fortunately, Elisp also provides tools to ameliorate these problems. Below I elaborate and try to give some helpful hints which make using regular expressions in Elisp okay.

Idiosyncratic Syntax

If you have done any computer programming (on *nix systems at least) you probably are familiar with Perl Compatible Regular Expressions (PCRE), or something similar. PCRE is of course not limited to Perl, but is now adopted (more or less fully) by many programming languages and software tools. The Elisp regex syntax was however, was designed long before Perl brought regexs to the masses in the 1990's, and it is unfortunately quite different than PCRE.

For example:

;;  Match "cat" or "dog" and save the match for future reference
PCRE      (cat|dog)
Elisp   \(cat\|dog\)

;;  Match  (πšπšŽπšπšžπš—)  or  (πšπšŽπšπš–πšŠπšŒπš›πš˜)
;;  One way: both altenatives include parens to be matched (grouping not involved).
PCRE     \(defun\)|\(defmacro\)     ;;   '(', ')' are special unless escaped.
Elisp     (defun)\|(defmacro)       ;;   '|' needs to be escaped to be special.

#  Another way: place alternative inside a group.
PCRE      \((defun|defmacro)\)      ;; unescaped parens   (  )  indicate a group
Elisp      (\(defun\|\(defmacro\))  ;;   escaped parens  \( \)  indiate a group

#  Third way.  When matching will save "un" or "macro" in group 1.
PCRE      \(def(un|macro)\)
Elisp      (def\(un\|macro\))

#  The same, except does not save the group.
PCRE      \(def(?:un|macro)\)
Elisp      (def\(?:un\|macro\))
Not surprisingly perhaps, the Elisp regex syntax is optimized for matching literal parens. This can be nice when looking for patterns in lisp code, but overall Elisp regular expressions tend to require more escaping with '\' than PCRE's do.

String literal regexs frequently require multiple backslashes

Languages such as Perl provide special syntax for literal regular expressions, but the Elisp reader only knows about string literals.
(Here "reader" means the parser routines which translate Elisp source code into an internal representation).

And unfortunately the backslash character '\' also has a special meaning in string literals. For example, "\n" indicates the newline character (ascii code 10).
So to represent the length 2 string consisting of the characters '\' and 'n', one needs to use an additional backslash: "\\n".

Elisp literal regular expressions are string literals interpreted after string backslash substitution. So for example,

;; the following two are equivalent
(string-match   "(cat|dog)"  var);  string evaluates to:  (cat|dog)
(string-match "\(cat\|dog\)" var);    also evaluates to:  (cat|dog)
;; Both match only the length 9 string '(cat|dog)'

;; To match 𝚌𝚊𝚝 or 𝚍𝚘𝚐 and save the match use \\
(string-match "\\(cat\\|dog\\)" var);  string evaluates to:  \(cat\|dog\)
;;

Regexs from literal Elisp strings sometimes need \\\\

When trying to match a backslash in a regular expression; for example to parse TeX documents.
;; Trying to use regex to search for '\' followed by a letter in [a-z]
;; e.g. strings such as: '\a', '\b', ... '\z'.

;;              string literal      regex          matches
(re-search-forward     "[a-z]");    [a-z]       any 1 char in a-z
(re-search-forward    "\[a-z]");    [a-z]       any 1 char in a-z
(re-search-forward   "\\[a-z]");   \[a-z]        string '[a-z]'
(re-search-forward  "\\\[a-z]");   \[a-z]        string '[a-z]'
(re-search-forward "\\\\[a-z]");  \\[a-z]        strings '\a', '\b', ..., '\z'     voilΓ 
Note the perhaps surprising equivalences:

(string= "\\[a-z]" "\\\[a-z]");  t   both are read as: '\[a-z]'
(string= "\\["     "\\\[");      t   both are read as: '\['
So when using a string literal in Elisp code, to indicate matching a backslash character '\', one needs to use "\\\\".

Commands don't need so many backslashes

As seen above, when using the function re-search-forward in an Elisp code literal string, one needs to use "\\\\" to obtain a regular expression matching '\'.
This is also true when the code is input in the minibuffer via the eval-expression command.

However, re-search-forward can also be invoked as a command when bound to a key or via execute-extended-command (by default bound to M-:). In which case it prompts the user to enter a regular expression in the minibuffer. It turns out that the text typed into the minibuffer is passed to the re-search-forward without performing string literal backslash escaping.

So even though "|" and "\|" are read the same in Elisp source code:

(string= "|" "\|");   --> t
When pressing M-x and typing re-search-forward to invoke it as a command we have
#USER ENTERS      Matches at
|                 next '|' in buffer
\|                anywhere
\\|               next '\|' in buffer
The regex '\|' means:γ€€empty regex OR empty regexγ€€— so it matches anywhere.

For an example involving searching for newlines, compare:

#  When invoked as a plain function using M-:
#USER ENTERS               string read as   Cursor moves to
(re-search-forward     "n")    n             n
(re-search-forward    "\n")    ␀             ␀
(re-search-forward   "\\n")    \n            n
(re-search-forward  "\\\n")    \␀            ␀
(re-search-forward "\\\\n")   \\n            \ n
#
#  When invoked as a command
#  after pressing M-x and entering re-search-forward
#USER ENTERS            Cursor moves to
n                       n
\n                      n
\\n                     \n
\\\n                    \n
\\\\n                   \\n
Where again I use the Unicode character ␀ as a way to denote the newline character (ascii code 10).

The difference between the way the argument to re-search-forward looks as a literal string in code
versus the string read from the minibuffer (via the command re-search-function, ultimately by the builtin function read-from-minibuffer),
is that the Elisp reader performs backslash substitution on string literals, but read-from-minibuffer does not.

So entering \n yields the length two string '\n' instead of a newline character. To search for a newline, just enter a newline.

Great, but how can I enter a newline in the minibuffer?

Simply pressing the return key won't work because it enters what you have input so far.

Under default keybindings, the straightforward way to do this is to press C-q C-j (when entering a regex in the minibuffer), which should insert a newline in the minibuffer. This is all a casual emacs user would need to know. But you are on the road to becoming an intense emacs user!
The details are that C-q invokes the command quoted-insert which uses the builtin function read-char which interprets C-j as the number 10 and 10 is the ascii (and utf8) code for the newline character (Consulting an ascii table, one can see that "j" relates to 10 because j is the 10th letter of the Roman alphabet).
Another way to insert a newline into the minibuffer is to press C-o to invoke the command open-line.

A general principle here is that the minibuffer can be edited more or less like any other buffer, the main limitation being that only commands bound to keys can be invoked conveniently (one cannot simply use execute-extended-command or eval-expression when already in the minibuffer).

Regex Character Classes Need a Total of Four Brackets, as in [[:space:]]

Elisp offers some some precanned character classes, such as [:space:] which holds white space characters.
Note the brackets here are part of the name of the character class, and do not serve as the brackets used to indicate character alternatives in regexs.
For example:
(re-search-forward "[:space:]");       Like PCRE  [:aceps]
(re-search-forward "[[:space:]]");     Looks for white space.  Pry what you want.
(re-search-forward "[e[:space:]]fun")  Looks for "efun" or "fun" preceded by white space.
(re-search-forward "[^[:space:]]");    Looks for anything but white space.

Regex Anchors

Regular expression anchors are constructs match the empty string, don't match at all, depending on some contextual information.
Elisp regexs support several common PCRE anchors, although sometimes with different names.
Other Elisp anchors involving character "syntax" or `point' are unique to Elisp.

Here I have organized the anchors described in the Elisp info pages in tabular form:

         Matches              Usage Restrictions
\=       at   point              Buffer only
^     after   BEG or ␀         Start of clause
\`    after   BEG
$    before   END or ␀           End of clause
\'   before   END
\b       at   wordBEG,wordEND,BEG,END
\B   not at   wordBEG,wordEND,BEG,END
\<   before   wordBEG
\>    after   wordEND
\_<  before   symbolBEG
\_>   after   symbolEND
Where BEG and END denote the beginning or end of the target (string or buffer) being matched against; and again I use ␀ to denote the newline character 10.

The "Start of clause" restriction on ^ indicates that the anchor can only be used at the beginning of a regular expression, an alternative, or a group.
In other words, as an anchor, ^ can only be preceded by: \| or \( or \(?:.
Note however that in regular expressions (Elisp or PCRE) the character ^ does double duty.
In addition to serving as a "start of clause" anchor, ^ also plays the role of negating a set of characters in a character alteration.

"^[a-z]"  Matches a single character in {a,b,...,z}, after a newline, or at the beginning of the target.
"[^a-z]"  Matches a single character *not* in {a,b,...,z}  (anywhere)

The "End of clause" restriction on $ indicates that the anchor can only be used at the end of a regular expression, an alternative, or a group.
In other words, the anchor $ can only be followed by: \) or \|.

\= only makes sense when matching against a buffer. For example:

(re-search-forward    "[[:space:]]+");  Skips to end of the next patch of space characters.
(re-search-forward "\\=[[:space:]]+");  Skips past any space characters immediately after point.
Note that \= has a completely different special meaning in documentation strings.
To read about that use of \=, search for "keys in documentation strings" in the Elisp info pages.

Mind the newlines

Mishandling of newlines is a frequent source of error when writing regexs.  I don't have a silver bullet for this; just some advice.

Make a habit of conscientiously selecting between \` vs ^ and \' vs $ instead of just lazily using ^ and $ all the time because they are more familiar.

Note that the \' anchor doesn’t do anything useful when combined with the start-idx argument of string-match. One might think that passing a start-idx to string-match would act as if the text argument string starts at position start-idx - but that is not how it works.

;; Imagine you want only want matches starting at position 4
(string-match-p  "cat"  "the cat in the hat"  )  -->  4
(string-match-p  "cat"  "the cat in the hat" 4)  -->  4        Yes we want "cat" at pos 4
(string-match-p  "cat"  "the hat in the cat" 4)  --> 15        But here one at 15 also matches

;; Unsuccessful attempts to use the only match at position 4
(string-match-p  "\\`cat"  "the cat in the hat" 4)  --> nil    Bad. Does not match, cuz 4 is not the start of the string.
(string-match-p  "^cat"    "the cat in the hat" 4)  --> nil    Bad. Does not match for the same reason.
(string-match-p  "^cat"   "the\ncat in the hat" 4)  --> 4      Different; matches due to the newline.
Instead use substring.
(defun regex/match-at? (regex text pos)
  "Does REGEX match TEXT at position POS?"
  (string-match-p
   (concat "\\`" regex)
   (substring text pos)
   ))
(regex/match-at?  "cat"  "the cat in the hat"  4)  --> 0     OK, a true value.
(regex/match-at?  "cat"  "the cat in the hat"  2)  --> nil   false, as desired.

Another important detail be mindful of is that, in Elisp regexs, the special character . matches any single character except a newline.
(In PCRE regexs . usually excludes the newline character, but not always. The PCRE construct \N is more reliably like the Elisp . special character.)

This begs the question of how to match any single character in Elisp regex. One might be tempted to try [.\n] or similar, but that won't work because '.' has no special meaning inside a character class.
Some options include:

 STRING LITERAL        REGEX         Matches one of
    "[^z-a]"           [^z-a]      anything.  The range z-a is empty, so its negation includes all characters.  This slick way is used by rx.
  "\\(.\\|\n\\)"      \(.\|␀\)     anything (and captures it as a group)
"\\(?:.\\|\n\\)"    \(?:.\|␀\)     anything (without capturing)
     "."                .          anything except a newline character
 "[[:print:]]"      [[:print:]]    Most chars, but not chars with ascii code below 32, notably line feed ^L

The s-expression based rx however does provide a symbol anything to match any single character, and separately the symbol not-newline or nonl to match anything but the newline.

(string-match-p (rx "cat" anything "dog") "catdog")   -->  nil.  no char between cat and dog
(string-match-p (rx "cat" anything "dog") "cat:dog")  -->  0.  (matches at position 0)
(string-match-p (rx "cat" anything "dog") "cat\ndog") -->  0.  (matches at position 0)
(string-match-p (rx "cat"   nonl   "dog") "cat:dog")  -->  0.  (matches at position 0)
(string-match-p (rx "cat"   nonl   "dog") "cat\ndog") -->  nil.  char between cat and dog is a newline.

Word boundary anchor frequently useful

\b is an anchor matching the empty string at word boundaries. It is frequently useful when searching through text.
For example, to find lines in a buffer holding the word "rust", but excluding words such as "trust" and "rustic", we can enter \brust\b at the prompt of the occur command.
In Elisp code string literals, such backslashes need to be doubled as usual:
(re-search-forward "\\brust\\b");  Search for word rust

Moreover, you should think about whether you want \b or the pair \<, \>. The pair obviously differ from \b, in that that \< matches at the beginning of words and \> and the end; while \b does both.
More subtly, they also differ in how they treat the beginning and end of their target string or buffer.

;; search for word CAT               returns
(string-match-p "\\bCAT\\b"  "CAT");       0
(string-match-p "\\<CAT\\>"  "CAT");       0
(string-match-p "\\bCAT\\b"  " CAT ");     1
(string-match-p "\\<CAT\\>"  " CAT ");     1

;; search for word CAT in between 2 words
(string-match-p "\\b CAT \\b"  "Dog CAT Pig");  3
(string-match-p "\\> CAT \\<"  "Dog CAT Pig");  3

(string-match-p "\\b CAT \\b"  " CAT ");   0
(string-match-p "\\< CAT \\>"  " CAT ");   nil
The last case differs because the pair \<, \> treat the target ends (start or finish) as word boundaries only if the ultimate (first or final) character is a word character;
but \b always matches at target ends.

Helpful Tools for Taming Regex in Elisp

As detailed above, for historical reasons using regexs in emacs lisp can be somewhat frustrating at times.
Thankfully there are some tools available to help.

Incrementally building regexs

Incrementally building things is a typical approach in computer programming.
Elisp provides the command re-builder to interactively construct regular expressions while seeing what text it matches.
In a buffer with some text you want to match, execute the re-builder command and try typing regexs in the buffer.

The variable reb-re-syntax controls how the text typed into the *RE-Builder* buffer is interpreted.

;; To see the regex as in an Elisp string literal, e.g. "\\\\" to match \
(setq reb-re-syntax 'read);  enter \\\\ to match \

;; To see the same but after string backslash substitution, e.g. "\\" to match \
(setq reb-re-syntax 'string);  enter \\ to match \
You can also use reb-change-syntax {C-c C-i} to switch between those values. To learn more use describe-keymap to see what commands re-builder provides.

As an aside, many modes like this bind their commands to C-c C-somekey (because emacs recommends that).
However I find I often accidentally type C-c somekey instead. So I often add key bindings without the second cntl key press.
For example:

(bind-keys
 :map reb-mode-map
 ("\C-c b" . reb-change-target-buffer)
 ("\C-c c" . reb-toggle-case)
 ("\C-c e" . reb-enter-subexp-mode)
 ("\C-c i" . reb-change-syntax)
 ("\C-c q" . reb-quit)
 ("\C-c r" . reb-prev-match)
 ("\C-c s" . reb-next-match)
 ("\C-c u" . reb-force-update)
 ("\C-c w" . reb-copy)
 )

Help when using regexs for plain string matching

By "plain string matching", I mean the task of looking for identical (sub)strings in a query string matched (possibly case insensitively) to a string pattern.

This is a special case of regex matching and in fact the typical way to do plain string matching in Elisp is to use the Elisp regex matching machinery.
For example string-match-p does regex matching, not just string matching.
But often what you want is plain string matching. The two functions regexp-quote, regexp-opt are helpful in this case.

;; Obtain a regex string matching   '\textit'
;; written as an Elisp literal string  "\\textit"

(string-match-p "\\back" "red ack");                 -->  4.  Oops, matches 'ack' after a word boundary
(string-match-p (regexp-quote "\\back") "red ack");  --> nil. No '\back' in 'red ack'
(string-match-p (regexp-quote "\\back") "\\back");   -->  0. matches '\back'

;; Obtain a regex string matching either '\bad' or '\boy'
(regexp-opt '("\\bad" "\\boy")) --> "\\(?:\\\\b\\(?:ad\\|oy\\)\\)"

;; Same, except the regex also captures the match.
(regexp-opt '("\\bad" "\\boy") t)   "\\(\\\\b\\(?:ad\\|oy\\)\\)"

Translating between Elisp and Perl regexs

The package pcre2el, available from elpa, provides the functions rxt-pcre-to-elisp and rxt-elisp-to-pcre to translate back and forth between Elisp and Perl regexs.
; Perl --> Elisp
(rxt-pcre-to-elisp "(cat|dog)");    -->  "\\(\\(?:cat\\|dog\\)\\)"
;; Not sure why it does not return "\\(cat\\|dog\\)"

; Elisp --> Perl
(rxt-elisp-to-pcre "\(cat\|dog\)"); -->  "\\(cat\\|dog\\)"
Quite useful for folks more familiar with PCRE.

Regexs in rx notation

Elisp provides a completely different, S-expression based, representation of regexs called rx. For example:
(rxt-pcre-to-rx "([cC]at|[dD]og)");
;; returns list
(submatch
 (or
  (seq
   (any 67 99)
   "at")
  (seq
   (any 68 100)
   "og")))

;; Note in Elisp characters are integers
;; 67,99; 68,100 are the ascii codes for c,C,d,D

The rx form is often much easier to read than the string form, even for people experienced using regular expressions in string form.

I find the command rxt-explain extremely useful for checking regular expressions Elisp code.
For example, to check this (mistaken) code:

(looking-at "\\\\begin\\(\\s*{[^}\n]*}\\)"β–ˆ))
Invoking rxt-explain at β–ˆ above, gives the message:  rxt-parse-atom/el: Invalid regexp: "Invalid syntax class `\\\\s '"
catching a mistake I made thinking that '\s' would match whitespace - correct for PCRE, but wrong for Elisp.

Editing the code to fix that yields:

(looking-at "\\\\begin\\([[:space:]]*{[^}\n]*}\\)"β–ˆ ))

Invoking rxt-explain at β–ˆ now pops up a buffer  "* Regexp Explain *"  with the following contents:
\\begin\([[:space:]]*{[^}
]*}\)

(seq "\\begin"
     (submatch
      (zero-or-more space)
      "{"
      (zero-or-more
       (not
        (any ?\n ?})))
      "}"))
Which is what I wanted. Note the string literal "\\begin" is read as the string '\begin' as intended (with just one backslash "\begin" would be read as the backspace character ?\b followed by 'egin').

To get help like this from rxt-explain, the buffer should be in the emacs-lisp-mode or lisp-interaction-mode major mode (which should be the case for filenames ending in .el) and `point' placed inside or near the string. Actually rxt-explain just dispatches to rxt-explain-elisp or rxt-explain-pcre. So, in any buffer, if you want an Elisp string-form regex explained, but rxt-explain asks for a PCRE regex:, you can simply call rxt-explain-elisp directly.

A final note regarding rxt-explain, is that it displays the plain space characters (ascii code 32) as \s.

(looking-at-p "a cat"β–ˆ)
Pops up:
a\scat

"a\scat"

The \s here is not a regular expression construct, but rather an alternative way to represent a literal space character in Elisp code.
For example:

;;          --> <-- plain space character
(=  32  ?\s  ?\ )             ;; true

(string= "a cat"  "a\scat")   ;; true

I do not know why rxt-explain displays space characters in this manner, but it may be useful to distinguish between plain space characters and other space characters defined in Unicode. For example

(looking-at-p "2 plain spaces(  ) one fill-width-space(γ€€)"β–ˆ)
Pops up:
2\splain\spaces(\s)\sone\sfill-width-space(γ€€)

"2\splain\spaces(\s)\one\sfill-width-space(γ€€)"
Allowing one to distinguish between two plain spaces and one "full-width" space character (Unicode code point #x3000), which otherwise might look the same. One puzzle remains. Why do the two plain spaces inside the parens produce only a single \s in the rxt-explain output? In other words why not:
2\splain\spaces(\s\s)\sone\sfill-width-space(γ€€)

"2\splain\spaces(\s\s)\sone\sfill-width-space(γ€€)"
I have no idea why. Perhaps it is a bug in this version (emacs 28.2) of pcre2el.el?
In any case rxt-explain is still really useful for checking regular expressions.

I have not yet explored other uses of the rx representation of regular expressions, but I expect it should be much more "lisp friendly" than the string representation; and therefore probably easier to work with when writing Elisp code which modifies or generates regexs.

Using regexs with query-replace-regexp

query-replace-regexp is a flexible command which not only allows one to do typical query replace operations provided by most editors,
but also has the flexibility to allow programmers to use their skills while editing.

The following example and discussion is similar to a nice blog article written by Protesilaos Stavrou.
In the past I used caps for HTML tags, as in <TT> and </B> instead of <tt> and </b>. I think the caps make tags easier to spot; but the convention seems to be to use lower case.

So let’s say I decide to conform and convert my upper case tags to lower case. One way to do this is to use the code evaluating feature of query-replace-regex.
Place the cursor at the top of an HTML file, invoke query-replace-regex entering:

Query replace regexp        </?[A-Z]+>
Query replace regexp with   \,(downcase \&)
The \, construct allows us to evaluate arbitrary code to obtain the replacement text.
The \& denotes the matching string; if our regex had groups, we could access them with \1, \2 etc.

Be aware of the effects of case-fold-search on query-replace-regex

A caveat to keep in mind is that the variable case-fold-search affects the behavior of query-replace-regex.
When I use emacs I usually have case-fold-search on (set to t).
When writing this I initially confused myself by doing the following.
with case-fold-search true, I invoked query-replace-regex like this:
Query replace regexp        </?[a-z]+>
Query replace regexp with   \,(downcase \&)
Producing the behavior that the regex matched where I intended it to, but the replaced text was still in upper case!
So for example, when running query-replace-regex, the cursor would stop at say <TT>,
but when I pressed the y key to do the replacement, the replacement was still <TT>.

What happened?
Notice my regex mistakenly included [a-z] instead of [A-Z].
With case-fold-search set to nil, the regex would simply have failed to match "<TT>". When case-fold-search is set to a true value it will match "<TT>";
but will remember the match really was in upper case and therefore convert the replacement text to upper case.
This behavior is often useful, but not in this particular situation.

The remedy? The simplest advice is to be aware of this behavior and make sure your regexs are correct;
but you could also define a variant of query-replace-regex which temporarily turns off case folding:

(defun query-replace-regexp/fold-not ()
  "Call query-replace-regex with case folding suppressed"
  (interactive)
  (let (case-fold-search)
    (call-interactively #'query-replace-regexp)
    ))

Another query-replace-regexp example

Above, our code used the function downcase; but other code can be used, sometimes for the purposes of side effects.

For example, one might want to systematically adjust some numbers in some text.
Suppose you are editing some TikZ source code and want to shift some objects one centimeter to the right. You might do this by adding one to every occurrence of xshift=NUMBER. query-replace-regexp is suitable for this sort of task.

Buffer contents before

\begin{scope}[xshift=7cm, yshift=15cm]...
\begin{scope}[xshift=1cm, yshift=12cm]...
\begin{scope}[xshift=13cm,yshift=9cm]...
Then execute query-replace-regexp
Query replace regexp        \(xshift=\)\([0-9]+\)
Query replace regexp with   \1\,(1+ (string-to-number \2))
Buffer contents after
\begin{scope}[xshift=8cm, yshift=15cm]...
\begin{scope}[xshift=2cm, yshift=12cm]...
\begin{scope}[xshift=14cm,yshift=9cm]...
When matching the text "xshift=13";
in the replacement "xshift=" is substituted for \1
and (1+ (string-to-number \2)) evaluates to (1+ (string-to-number "13")).

Return type of \,(...) need not be a string

Note that although we need to convert "13" to a number before passing it to the function 1+, we do not need to convert the number returned by 1+ to a string.

This type flexibility on the part of query-replace-regexp is convenient (albeit somewhat surprising); any insertable type seems to work.
Try for example:

Query replace regexp        pig
Query replace regexp with   \,(make-vector (length \&) 'hund)

Piggybacking on the query-replace mechanism

The bulk of the work of query-replace is performed by the function perform-replace, which can be conveniently use to define your own commands behaving in a similar way. For example to swap occurrences of two strings, one can define a command like this.
(defun ph/query-swap-string-occurrences (s1 s2)
  "Query replace occurrences of string S1 with S2, and string S2 with S1."
  (interactive
   (list
    (read-string "Replace string: ")
    (read-string "with string: ")
    ))
  (perform-replace
   (regexp-opt (list s1 s2));  FROM-STRING
   (list (lambda (_ _) (if (looking-back-p (regexp-quote s1)) s2 s1)));  REPLACEMENT
   t;  QUERY?  Yes, ask for confirmation
   t;  REGEXP?  Yes, we search with a regex matching S1 or S2
   nil;  DELIMITED?
   nil;  REPEAT-COUNT
   nil;  MAP
   (if (region-active-p) (region-beginning) (point)); START
   (if (region-active-p) (region-end)       (point));   END
   ))
Where looking-back-p described below is a match data clean version of looking-back.

With its numerous parameters, the call to perform-replace may seem a bit intimidating to new (and not so new) Elisp programmers. In particular the second argument (named REPLACEMENTS) involving the lambda expression with two dummy arguments; the code above uses _ for both of them. Indeed, although the docstring of perform-replace is perfectly accurate, I must confess it took some trial and error for me to get this to work.

The Match Data

As explained in the emacs info pages and elsewhere, the matching positions of Elisp regular expression matching operations are stored in a single temporary place known as the match data, which is implemented in C, but accessible via several lisp functions.

For example, the function match-data returns the information start in the match data.

(progn;                                     0123456789012345
  (string-match  "\\(pill\\).*\\(pill\\)"  "caterpillar pill")
  (match-data)
  ); returns:     (5 16 5 9 12 16)
  ;; [span) of     all   \1   \2
  ;;

Keep in mind the fact that the match data only stores positions. For a general explanation, I refer readers to the relevant Elisp info pages.

Here I just mention a caveat or two about using string-match followed by match-string (directly or indirectly).

No check is made that the strings are the same

;; Matched "cat", but returns "dog"
(progn
  (string-match "cat" "acat")
  (match-string 0 "adog")
  )

No check that the you remembered the string argument

;; Matched "cat", but returns whatever is in position 2..4 of the buffer.
(progn
  (string-match "cat" "acat")
  (match-string 0);  Logical error.  Elisp quietly returns part of the buffer.
  )

The Match Data is Fleeting!

The match data is constantly being overwritten. Even if you know that, it is easy to forget.

An example in which match data overwrite confused me

Here is an example which confused me.
In class, I often write impromptu Elisp code offhand to demonstrate something --- using eval-last-sexp as I go to show the result of the code.
(A sort of poor man’s REPL; maybe I should ielm instead...)
Anyway, I usually get what I expect, but the following confused me.
(string-match "cat" "acat")β–ˆ Invoke eval-last-sexp  returns 1
(match-string 0 "acat")β–ˆ Invoke eval-last-sexp  KABOOM! (args-out-of-range "acat" 72499 72543)
Where β–ˆ indicates the positions at which I invoked eval-last-sexp.
What happened? Confused, I did the following experiment
(progn
  (string-match "cat" "acat")
  (match-string 0 "acat")
)β–ˆ Invoke eval-last-sexp  returns "cat"
As expected.
Weird; in both cases the call to match-string directly follows the call to string-match, ... or does it?
Evidently eval-last-sexp is the culprit, and indeed it was overwriting the match data before my call to match-string.

Strategies to minimize match data overwrite related bugs

Prefer string-match-p over string-match

string-match-p preserves the match data, so use it if you only need to know whether or not there is a match.
Likewise when possible prefer looking-at-p over looking-at and looking-back-p over looking-back.

Except looking-back-p is excluded from Elisp because looking back can be very slow, depending on how it is used. I think the more useful approach would be to let the programmers decide, so I myself trivially defined looking-back-p as:

(defun looking-back-p (regexp)
  "Same as `looking-back' except this function does not change the match data."
  (let ((inhibit-changing-match-data t))
    (looking-back regexp)
  ))
(Also available on gitlab)

Match and capture in a single function call

When using only the most basic builtin Elisp commands; regular expression matching via string-match (or looking-at, etc.) is a separate operation from subsequent access to the matches via match-string.  This opens the door for the match data being overwritten between calls, or the string argument used with match-string to differ from the one used for matching, as in the "acat" and "adog" example or be missing entirely (defaulting to extracting text from the buffer).

Generally speaking a safer approach is to use functions which do matching and capturing in a single step, and politely restore the match data.
Such as s-match from the s.el library:

(progn
  (insert "\n")
  (string-match "cat" "acat")
  (insert "s-match: " (car (s-match "web" "webster")) "\n")
  (insert "outside: " (match-string 0 "acat"))
  )β–ˆ invoking eval-last-sexp inserts:
s-match: web
outside: cat
Note the car, since s-match returns a list.

Use save-match-data to clean up after yourself

If you write a function which modifies the match data, wrap it in a save-match-data scope:

For example a regex/match? function like this one provides a simple matching function which restores the match data to its original state.

(defun regex/match? (regex text &optional drop-props?)
  "If REGEX matches TEXT, return the first one.  Otherwise nil.
Drop text properties when DROP-PROPS? is true.

Restores match data.
  See also `s-match'"
  (save-match-data
    (when (string-match regex text)
      (if drop-props?
        (match-string-no-props 0 text)
          (match-string 0 text)
        ))))

Regexs in Common Lisp

Regexs are not part of the common lisp standard, but a library CL-PPCRE implementing PCRE is available.