jueves, 16 de mayo de 2013

Guía para usar expresiones regulares en NotePad y ejemplos


How to use regular expressions in Notepad++ (tutorial)

In case you have the plugins installed, try Ctrl+R or in the TextFX -> TextFX Quick -> Find/Replace to get a sophisticated dialogue including a drop down for regular expressions and multi line search/replace.
This tutorial was based on an earlier, far more limited regular expression syntax. The examples are still the same at the date of writing, they require additions or upgrading to the new ways.
Notepad++ regular expressions use the standard PCRE (Perl) syntax, only departing from it in very minor ways. Complete documentation on the precise implementation is to be found the implementer's website.
Another great tutorial is provided online at http://www.regular-expressions.info .
A french Sourceforge user, guy038, made a tutorial available in the French language. This is hosted at ici in a variety of formats.

Contents

 [hide]
In a regular expression (shortened into regex throughout), special characters interpreted are:

Single-character matches

.\c
Matches any character. If you check the box which says ". matches newline", the dot will indeed do that, enabling the "any" character to run over multiple lines. With the option unchecked, then . will only match characters within a line, and not the line ending characters (\r and \n)
\X
Matches a single non-combining characer followed by any number of combining characters. This is useful if you have a Unicode encoded text with accents as separate, combining characters.
\Г
This allows you to use a character Г that would otherwise have a special meaning. For example, \[ would be interpreted as [ and not as the start of a character set. Adding the backslash (this is called escaping) works the other way round, as it makes special a character that otherwise isn't. For instance, \d stands for "a digit", while "d" is just an ordinary letter.

Non ASCII characters
\xnn
Specify a single chracter with code nn. What this stands for depends on the text encoding. For instance, \xE9 may match an é or a θ depending on the code page in an ANSI encoded document.
\x{nnnn}
Like above, but matches a full 16-bit Unicode character. If the document is ANSI encoded, this construct is invalid.
\Onnn
A single byte character whose code in octal is nnn.
[[.collating sequence.]]
The character the collating sequence stands for. For instance, in Spanish, "ch" is a single letter, though it is written using two characters. That letter would be represented as [[.ch.]]. This trick also works with symbolic names of control characters, like [[.BEL.]] for the character of code 0x07. See also the discussion on character ranges.

Control characters
\a
The BEL control character 0x07 (alarm).
\b
The BS control character 0x08 (backspace). This is only allowed inside a character class definition. Otherwise, this means "a word boundary".
\e
The ESC control character 0x1B.
\f
The FF control character 0x0C (form feed).
\n
The LF control character 0x0A (line feed). This is the regular end of line under Unix systems.
\r
The CR control character 0x0D (carriage return). This is part of the DOS/Windows end of line sequence CR-LF, and was the EOL character on Mac 9 and earlier. OSX and later versions use \n.
\R
Any newline character.
\t
The TAB control character 0x09 (tab, or hard tab, horizontal tab).
\Ccharacter
The control character obtained from character by stripping all but its 6 lowest order bits. For instance, \C1\CA and \Ca all stand for the SOH control character 0x01.

Ranges or kinds of characters

[...]
This indicates a set of characters, for example, [abc] means any of the characters ab or c. You can also use ranges, for example [a-z] for any lower case character. You can use a collating sequence in character ranges, like in [[.ch.]-[.ll.]] (these are collating sequence in Spanish).
[^...]
The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first A,B or C (or a, b or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].
[[:name:]]
The whole character class named name. Most of the time, there is a single letter escape sequence for them - see below.
Recognised classes are:
  • alnum : ASCII letters and digits
  • alpha : ASCII letters
  • blank : spacing which is not a line terminator
  • cntrl : control characters
  • d , digit : decimal digits
  • graph : graphical character
  • l , lower : lowercase letters
  • print : printable characters
  • punct : punctuation characters: , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _ { } | ~
  • s , space : whitespace
  • u , upper : uppercase letters
  • unicode : any character with code point above 255
  • w , word : word character
  • xdigit : hexadecimal digits
\pshort name,\p{name}
Same as [[:name:]]. For instance, \pd and \p{digit} both stand for a digit, \d.
\Pshort name,\P{name]
Same as [^[:name:]] (not belonging to the class name).
Note that Unicode categories like in \p{Sc} or \p{Currency_Symbol}, they are flagged as an invalid regex in v6.3.2. This is because support would draw a large library in, which would have other uses.
\d
A digit in the 0-9 range, same as [[:digit:]].
\D
Not a digit. Same as [^[:digit]].
\l
A lowercase letter. Same as [a-z] or [[:lower:]].
NOTE: this will fall back on "a word character" if the "Match case" search option is off.
\L
Not a lower case letter. See note above.
\u
An uppercase letter. Same as [[:uper:]]. See note about lower case letters.
\U
Not an uppercase letter. Same note applies.
\w
A word character, which is a letter, digit or underscore. This appears not to depend on what the Scintilla component considers as word characters. Same as [[:word:]].
\W
Not a word character. Same as :alnum: with the addition of the underscore.
\s
A spacing character: space, EOLs and tabs count. Same as [[:space:]].
\S
Not a space.
\h
Horizontal spacing. This only matches space, tab and line feed.
\H
Not horizontal whitespace.
\v
Vertical whitespace. This encompasses the The VT, FF and CR control characters: 0x0B (vertical tab), 0x0D (carriage return) and 0x0C (form feed).
\V
Not vertical whitespace.
[[=primary key=]]
All characters that differ from primary key by case, accent or similar alteration only. For example [[=a=]] matches any of the characters: a, À, Á, Â, Ã, Ä, Å, A, à, á, â, ã, ä and å.

Multiplying operators

+
This matches 1 or more instances of the previous character, as many as it can. For example, Sa+m matches SamSaamSaaam, and so on. [aeiou]+ matches consecutive strings of vowels.
*
This matches 0 or more instances of the previous character, as many as it can. For example, Sa*m matches SmSamSaam, and so on.
?
Zero or one of the last character. Thus Sa?m matches Sm and Sam, but not Saam.
*?
Zero or more of the previous group, but minimally: the shortest matching string, rather than the longest string as with the "greedy" * operator. Thus, m.*?o applied to the text margin-bottom: 0; will match margin-bo, whereas m.*o will match margin-botto.
+?
One or more of the previous group, but minimally.
{n}
Matches n copies of the element it applies to.
{n,}
Matches n' or more copies of the element it applies to.
{m,n}
Matches m to n copies of the element it applies to, as much it can.
{n,}?,{m,n}?
Like the above, but match as few copies as they can. Compare with *? and friends.
*+,?+,++,{n,}+,{m,n}+
These so called "possessive" variants of greedy repeat marks do not backtrack. This allows failures to be reported much earlier, which can boost performance significantly. But they will eliminate matches that would require backtracking to be found.
Example: matching ".*" against "abc"x will find "abc", because
  • " then abc"x then $ fails
  • " then abc" then x fails
  • " then abc then " succeeds.
However, matching "*+" against "abc"x will fail, because the possessive repeat factor prevented backtracking.

Anchors

Anchors match a position in the line, rather than a particular character.
^
This matches the start of a line (except when used inside a set, see above).
$
This matches the end of a line.
\<
This matches the start of a word using Scintilla's definitions of words.
\>
This matches the end of a word using Scintilla's definition of words.
\b
Matches either the start or end of a word.
\B
Not a word boundary.
\A\'
The start of the matching string.
\z\`
The end of the matching string.
\Z
Matches like \z with an optional sequence of newlines before it. This is equivalent to (?=\v*\z), which departs from the traditional Perl meaning for this escape.

Groups

(...)
<Parentheses mark a subset of the regular expression. The string matched by the contents of the parentheses ( ) can be re-used as a backreference or as part of a replace operation; see Substitutions, below.
Groups may be nested.
(?<some name>...)(?'some name'...),(?(some name)...)
Names this group some name.
\gn , \g{n}
The n-th subexpression, aka parenthesised group. Uing the second form has some small benefits, like n being more than 9, or disambiguating when n might be followed by digits. When n' is negative, groups are counted backwards, so that \g-2 is the second last matched group.
\g{something},\k<something>
The string matching the subexpression named something.
\digit
Backreference: \1 matches an additional occurence of a text matched by an earlier part of the regex. Example: This regular expression: ([Cc][Aa][Ss][Ee]).*\1 would match a line such as Case matches Case but not Case doesn't match cASE. A regex can have multiple subgroups, so \2\3, etc can be used to match others (numbers advance left to right with the opening parenthesis of the group). So \n is a synonym for\gn, but doesn't support the extension syntax for the latter.

Readability enhancements

(:...)
A grouping construct that doesn't count as a subexpression, just grouping things for easier reading of the regex.
(?#...)
Comments. The whole group is for humans only and will be ignored in matching text.
Using the x flag modifier (see section below) is also a good way to improve readability in complex regular expressions.

Search modifiers

The following constructs control how matches condition other matches, or otherwise alter the way search is performed. For those readers familiar with Perl, \G is not supported.
\Q
Starts verbatim mode (Perl calls it "quoted"). In this mode, all characters are treated as-is, the only exception being the \E end verbatim mode sequence.
\E
Ends verbatim mode. Ths, "\Q\*+\Ea+" matches "\*+aaaa".
(?:flags-not-flags ...)(?:flags-not-flags:...)
Applies flags and not-flags to search inside the parentheses. Such a construct may have flags and may have not-flags - if it has neither, it is just a non-marking group, which is just a readability enhancer. The following flags are known:
   i : case insensitive (default: off)
   m : ^ and $ match embedded newlines (default: as per ". matches newline")
    s: dot matches newline (default: as per ". matches newline")
    x: Ignore unescaped whitespace in regex (default: off)
(?|expression using the alternation | operator)
If an alternation expression has subexpressions in some of its alternatives, you may want the subexpression counter not to be altered by what is in the other branches of the alternation. This construct will just do that.
For example, you get the following subexpressioncounter values:
# before  ---------------branch-reset----------- after
/ ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1            2         2  3        2     3     4
Without the construct, (p(q)r) would be group #3, and (t) group #5. With the constuct, they both report as group #2.

Control flow

Normally, a regular expression parses from left to right linerly. But you may need to change this behaviour.
|
The alternation operator, which allows matching either of a number of options, like in : one|two|three to match either of "one", "two" or "three". Matches are attempted from left to right. Use (?:) to match an empty string in such a construct.
(?n)(?signed-n)
Refers to subexpression #n. When a sign is present, go to the signed-n-th expression.
(?0)(?R)
Backtrack to start of pattern.
(?&name)
Backtrack to subexpression named name.
(?assertionyes-pattern|no-pattern)
Mathes yes-pattern if assertion is true, and no-pattern otherwise if provided. Supported assertions are:
  • (?=assert) (positive lookahead)
  • (?!assert) (negative lookahead)
  • (?(R)) (true if inside a recursion)
  • (?(Rn) (true if in a recursion to subexpression numbered n
PCRE doesn't treat recursion expressions like Perl does:
In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives  and  there  is a subsequent matching failure.
\K
Resets matched text at this point. For instance, matching "foo\Kbar" will not match bar". It will match "foobar", but will pretend that only "bar" matches. Useful when you wish to replace only the tail of a matched subject and groups are clumsy to formulate.

Assertions

These special groups consume no characters. Their succesful matching counts, but when they are done, matching starts over where it left.
(?=pattern
If pattern matches, backtrack to start of pattern. This allows using logical AND for combining regexes.
For instance,
(?=.*[[:lower:]])(?=.*[[:upper:]]).{6,}
tries finding a lowercase letter anywhere. On success it backtracks and searches for an uppercase letter. On yet another success, it checks whether the subject has at least 6 characters.
'"q(?=u)i" doesn't match "quit", because, as matching 'u' consumes 0 characters, matching "i" in the pattern fails at "u" i the subject.
(?!pattern
Matches if pattern didn't match.
(?<=pattern)
Asserts that pattern matches before some token.
(?<pattern)
Asserts that pattern does not match before some token.
NOTE: pattern has to be of fixed length, so that the regex engine knows where to test the assertion.
(?>pattern)
Match pattern independently of surrounding patterns, and don't backtrack into it. Failure to match will caus the whole subject not to match.

Substitutions

\a,\e,\f,\n,\r,\t,\v
The corresponding control character, respectively BEL, ESC, FF, LF, CR, TAB and VT.
\Ccharacter"\xnn,\x{nnnn</i>}
Like in search patterns, respectively the control character with the same low order bits, the character with code 'nn and the character with code nnnn (requires Unicode encoding).
\l
Causes next character to output in lowercase
\L
Causes next characters to be output in lowercase, until a \E is found.
\u
Causes next character to output in uppercase
\U
Causes next characters to be output in uppercase, until a \E is found.
\E
Puts an end to forced case mode initiated by \L or \U.
$&$MATCH${^MATCH}
The whole matched text.
$`$PREMATCH${^PREMATCH}
The text between the previous and current match, or the text before the match if this is the first one.
$", $POSTMATCH, ${$POSTMATCH}
Everything that follows current match.
$LAST_SUBMATCH_RESULT$^N
Returns what the last matching subexpression matched.
$+$LAST_PAREN_MATCH
Returns what matched the last subexpression in the pattern.
$$
Returns $.
$n${n}\n
Returns what matched the subexpression numbered n. Negative indices are not alowed.
$+{name}
Returns what matched subexpression named name.

Zero length matches

While, in normal or extended mode, there would be no point in looking for text of length 0, this can very normally happen with regula expressions. For instance, to add something at the beginning of a line, you'll search for "^" and replace with whatever is to be added.
Notepad++ would select the match, bt there is no sensible way to select a stretch zero character long. Whe this happens, a tooltip very similar to function call tips is displayed instea, with a caret pointing upwards to the empty match.
Image:Zero.png
A match was found at the first column of line 5.


Examples

These examples come from an earlier version of this page: Notepad++ RegExp Help, by Author : Georg Dembowski


Add more examples using advanced features of PCRE
IMPORTANT
  • You have to check the box "regular expression" in search & replace dialog
  • When copying the strings out of here, pay close attention not to have additional spaces in front of them! Then the RegExp will not work!

Example 0

How to replace/delete full lines according to a regex pattern? Let's say you wish to delete all the lines in a file that contain the word "unused", without leaving blank lines in their stead. This means you need to locate the line, remove it all, and additionally remove its terminating newline.
So, you'd want to do this:: Find: ^.*?unused.*?$\R Replace with: nothing, not even a space The regular expression appears to always work is to be read like this:
  • assert the start of a line
  • match some characters, stopping as early as required for the expression to match
  • the string you search in the file, "unused"
  • more characters, again stopping at the earliest necessary for the expression to match
  • assert line ends
  • A newline character or sequence
Remember that .* gobbles everything to the end of line if ". matches newline" is off, and to the end of file if the option is on!
Well, why is appears above in bold letters? Because this expression assumes each line ends with an end of line sequence. This is almost always true, and may fail for the last line in the file. It won't match and won't be deleted.
But the remedy is fairly simle: we translate in regex parlance that the newline should match if it is there. So the correct expression actually is:
^.*?unused.*?$\R?

Example 1

You use a MediaWiki (e.g. Wikipedia, Wikitravel) and want to make all headings one "level higher", so a H2 becomes a H1 etc.
    • Search ^=(=)
    • Replace with \1
    • Click "Replace all"

      You do this to find all headings2...9 (two equal sign characters are required) which begin at line beginning (^) and to replace the two equal sign characters by only the last of the two, so eleminating one and having one remaining.
    • Search =(=)$
    • Replace with \1
    • Click "Replace all"

      You do this to find all headings2...9 (two equal sign characters are required) which end at line ending ($) and to replace the two equal sign characters by only the last of the two, so eleminating one and having one remaining.
== title == became = title =, you're done :-)

Example 2

You have a document with a lot of dates, which are in German date format (dd.mm.yy) and you'd like to transform them to sortable format (yy-mm-dd). Don't be afraid by the length of the search term – it's long, but consiting of pretty easy and short parts.
Do the following:
  • Search ([^0-9])([0123][0-9])\.([01][0-9])\.([0-9][0-9])([^0-9])
  • Replace with \1\4-\3-\2\5
  • Click "Replace all"
You do this to fetch
  • the day, whose first number can only be 0, 1, 2 or 3
  • the month, whose first number can only be 0 or 1
  • but only if the separator is . and not 'any character' ( . versus \. )
  • but only if no numbers are sourrounding the date, as then it might be an IP address instead of a date
and to write all of this in the opposite order, except for the surroundings. Pay attention: Whatever SEARCH matches will be deleted and only replaced by the stuff in the REPLACE field, thus it is mandatory to have the surroundings in the REPLACE field as well!
Outcome:
  • 31.12.97 became 97-12-31
  • 14.08.05 became 05-08-14
  • the IP address 14.13.14.14 did not change
You're done :-)

Example 3

You have printed in windows a file list using dir /b/s >filelist.txt to the file filelist.txt and want to make local URLs out of them.
  1. Open filelist.txt with Notepad++
    • Search \\
    • Replace with /
    • Click "Replace all" to change windows path separator char \ into URL path separator char /
    • Search ^(.*)$
    • Replace with file:///\1
    • Click "Replace all" to add file:/// in the beginning of all lines
According on your requirements, preceed to escape some characters like space to %20 etc. C:\!\aktuell.csv became file:///C:/!/aktuell.csv
You're done :-)

Example 4

Another Search Replace Example
[Data]
AS AF AFG 004 Afghanistan
EU AX ALA 248 Åland Islands
EU AL ALB 008 Albania, People's Socialist Republic of
AF DZ DZA 012 Algeria, People's Democratic Republic of
OC AS ASM 016 American Samoa
EU AD AND 020 Andorra, Principality of
AF AO AGO 024 Angola, Republic of
NA AI AIA 660 Anguilla
AN AQ ATA 010 Antarctica (the territory South of 60 deg S)
NA AG ATG 028 Antigua and Barbuda
SA AR ARG 032 Argentina, Argentine Republic
AS AM ARM 051 Armenia
NA AW ABW 533 Aruba
OC AU AUS 036 Australia, Commonwealth of
  • Search for: ([A-Z]+) ([A-Z]+) ([A-Z]+) ([0-9]+) (.*)
  • Replace with: \1,\2,\3,\4,\5
  • Hit "Replace All"
Final Data:
AS,AF,AFG,004,Afghanistan
EU,AX,ALA,248,Åland Islands
EU,AL,ALB,008,Albania, People's Socialist Republic of
AF,DZ,DZA,012,Algeria, People's Democratic Republic of
OC,AS,ASM,016,American Samoa
EU,AD,AND,020,Andorra, Principality of
AF,AO,AGO,024,Angola, Republic of
NA,AI,AIA,660,Anguilla
AN,AQ,ATA,010,Antarctica (the territory South of 60 deg S)
NA,AG,ATG,028,Antigua and Barbuda
SA,AR,ARG,032,Argentina, Argentine Republic
AS,AM,ARM,051,Armenia
NA,AW,ABW,533,Aruba
OC,AU,AUS,036,Australia, Commonwealth of

Example 5

How to recognize a balanced expression, in mathematics or in programming?
Let's first explicitly describe what we wish to match. An expression is balanced if and only if all areas delineatd by parentheses contain a balanced expression. Like in: 1+f(x+g())-h(2).
This leads to define the following kinds of groups: balanced ::= no_paren paren ... no_paren
no_paren = [^()]* -- a possibly empty group of characters without a single parenthesis
paren ::= ( balanced )
Can we represent this as a regex? We cannot as-is.
The first hurdle is that there is no primitive construct to represent an alternating sequence of tokens. A common trick then is to represent the sequence as a repetition of the repeating pattern - here, no_paren followed by paren -, with any odd stuff at the end added.
So we have a more manageable, although slightly more complex, representation:
balanced ::= simple* no_paren
simple ::= no_paren paren
no_paren ::= [^()]*
paren = ( balanced )

A second hurdle is that parentheses are not ordinary characters. That's ok, we'll escape them as \( and \) respectively.
The third one is more interesting. How do we represent the whole of an expression inside a nested sub-expression? This smacks of recursion. PCRE has recursion. The simplest form of it is tgoing back to the start of the search pattern - not the searched text! - and doing it again. It writes as (?R). You remember seeing this one in the main list, right?
So:
  • we know how to match a no_paren. It will be nicer to give it an explicit name. This we'll do in the embelishments section below.
  • we jusrtr discovered how to write a paren\((?R)\)
This gives us the following hard to read, but correct regex:
([^()]*\((?R)\))*[^()]*
Try it, it works. But it is about as hard to decrypt as a badly indented piece of code without a comment and with unpromising, unclear identifiers. This is only one of the reasons why old Perl earned itself the rare qualifier of "write-only language".

Embellishments
First of all, let's add some spacing so that we can identify the components of the regex. Spacing can be added using the x modifier flag, which is off by default.
So we can write something more legible:
(?x:  ([^ ( ) ]* \( (?R) \) )* [^()]* )
Now let's add some commenting
(?x:  ([^ ( ) ]* \( (?# The next group means "start matching the 
beginning of the regex")(?R) \) )* [^()]* )
In Perl, we could go further by assigning names to groups. However, in PCRE this will not work, because any named group, once matched, won't change. This is obviously not what we want.

Adapted to MediaWiki format by CChris


No hay comentarios:

Publicar un comentario

Jesús Moreno - Ingeniero Ténico Informático - consultor Informático

Hola, soy Jesús Moreno Ingeniero Técnico Informático en sistemas por la US y propietario de éste blog. Mi trabajo en los ultimos años se ...