Members of "word character" set?

Tips on writing regular expressions for searching the post list

Moderators: Quade, dexter

Members of "word character" set?

Postby bobkoure » Sun Jan 21, 2007 2:00 pm

Other than a-Z,A-Z,0-9, which characters are considered to be part of the "word char" ( \w ) set in the regex used by newsbin?

I'm asking because, for instance, the underscore char '_' is part of the \w set, but the underscore is often used as a separator char, so, if you were looking for the artist "john doe", and were semi-clever with regex, you might use the filter "john\W*doe" - but that misses "john_doe" - so you use "john[\W_]doe" (or "john[^\w_]*doe" if using a negated set in another set strikes you as weird).

So... I'm wondering what other characters might be part of \w - especially those that might be commonly used as separators.

... and thanks!
Bob
bobkoure
 

Postby FrizzleFry » Sun Jan 21, 2007 11:01 pm

According to the RE Cheat Sheet

\w
Matches any word character. Equivalent to the Unicode character categories [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9].
User avatar
FrizzleFry
Seasoned User
Seasoned User
 
Posts: 702
Joined: Sun Oct 09, 2005 12:04 am

Registered Newsbin User since: 05/04/03

Postby bobkoure » Mon Jan 22, 2007 4:26 pm

Care to decipher that?
For instance, \p{P} is "punctuation", but \p{Pc} is...?
I assume you do know what the symbols you're quoting mean...
bobkoure
 


Return to Regular Expressions

Who is online

Users browsing this forum: No registered users and 3 guests