regex - Python removing punctuation from unicode string except apostrophe -
i found several topics of , found solution:
sentence=re.sub(ur"[^\p{p}'|-]+",'',sentence) this should remove every punctuation except ', problem strips else sentence.
example:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, , music." >>> sentence=re.sub(ur"[^\p{p}']+",'',sentence) >>> print sentence ' of course want keep sentence without punctuation, , "warhol's" stays is
desired output:
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film , music" "austro-hungarian empire" edit: tried using
tbl = dict.fromkeys(i in xrange(sys.maxunicode) if unicodedata.category(unichr(i)).startswith('p')) sentence = sentence.translate(tbl) but strips every punctuation
specify elements don't want removed, i.e. \w, \d, \s, etc. ^ operator means in square brackets. (matches except)
>>> import re >>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, , music." >>> print re.sub(ur"[^\w\d'\s]+",'',sentence) warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film , music >>>
Comments
Post a Comment