c# - Regex Latin characters filter and non latin character filer -
i developing program ,where need filter words , sentences non-latin character. problem is, found latin character words , sentences , not found words , sentences mixed latin characters , non-latin characters. example, "hello" latin letter word, , can match using code:
match match = regex.match(line.line, @"[^\u0000-\u007f]+", regexoptions.ignorecase); if (match.success) { line.line = match.groups[1].value; }
but not found example mixed non-latin letter word or sentences : "hellø sømthing" .
also, explain regexoptions.none or regexoptions.ignorecase , stand for?
the 4 "latin" blocks (from http://www.fileformat.info/info/unicode/block/index.htm):
basic latin u+0000 - u+007f
latin-1 supplement u+0080 - u+00ff
latin extended-a u+0100 - u+017f
latin extended-b u+0180 - u+024f
so regex "include" of them be:
regex.match(line.line, @"[\u0000-\u024f]+", regexoptions.none);
while regex catch outside block be:
regex.match(line.line, @"[^\u0000-\u024f]+", regexoptions.none);
note feel doing regex "by block" little wrong, when use latin blocks, because example in basic latin block have control characters (like new line, ...), letters (a-z, a-z), numbers (0-9), punctation (.,;:...), other characters ($@/&...) , on.
for meaning of regexoptions.none
, regexoptions.ignorecase
their name quite clear
you try googling them on msdn
from https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx:
regexoptions.none: specifies no options set
regexoptions.ignorecase: specifies case-insensitive matching.
the last 1 means if regex.match(line.line, @"abc", regexoptions.ignorecase)
match abc
, abc
, abc
, ... , option works on character ranges [a-z]
match both a-z
, a-z
. note useless in case because blocks suggested should contain both uppercase , lowercase "variation" of letters both uppercase , lowercase.
Comments
Post a Comment