regex - Python removing punctuation from unicode string except apostrophe -


i found several topics of , found solution:

sentence=re.sub(ur"[^\p{p}'|-]+",'',sentence) 

this should remove every punctuation except ', problem strips else sentence.

example:

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, , music." >>> sentence=re.sub(ur"[^\p{p}']+",'',sentence) >>> print sentence ' 

of course want keep sentence without punctuation, , "warhol's" stays is

desired output:

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film , music" "austro-hungarian empire" 

edit: tried using

tbl = dict.fromkeys(i in xrange(sys.maxunicode)     if unicodedata.category(unichr(i)).startswith('p'))  sentence = sentence.translate(tbl) 

but strips every punctuation

specify elements don't want removed, i.e. \w, \d, \s, etc. ^ operator means in square brackets. (matches except)

>>> import re >>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, , music." >>> print re.sub(ur"[^\w\d'\s]+",'',sentence) warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film , music >>>  

Comments