Python - regex with Japanese letters matches only one character -
i'm trying find words in japanese addresses can scrub them. if there single character, regex works fine, don't seem find strings 2 characters or more:
import re add = u"埼玉県川口市金山町12丁目1-104番地"  test = re.search(ur'["番地"|"丁目"]',add) print test.group(0)  丁 i can use re.findall instead of re.search, puts of findings tuple, have parse tuple. if that's best way can live figure i'm missing something.
in example above, want swap "丁目" dash , remove trailing "番地", address reads thusly:
埼玉県川口市金山町12-1-104
you're using | inside character classes ([....]). match characters listed there; not want.
specify pattern without character classes. (also without ")
>>> import re >>> add = u"埼玉県川口市金山町12丁目1-104番地" >>> test = re.search(ur'番地|丁目', add) >>> test.group(0) u'\u4e01\u76ee' >>> print test.group(0) 丁目 to want, use str.replace (unicode.repalce) , re.sub.
>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-')) 埼玉県川口市金山町12-1-104 $ used match @ end of string. if position of 番地$ not matter, regular expression not needed. str.replace enough:
>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-')) 埼玉県川口市金山町12-1-104 
Comments
Post a Comment