Python - regex with Japanese letters matches only one character -


i'm trying find words in japanese addresses can scrub them. if there single character, regex works fine, don't seem find strings 2 characters or more:

import re add = u"埼玉県川口市金山町12丁目1-104番地"  test = re.search(ur'["番地"|"丁目"]',add) print test.group(0)  丁 

i can use re.findall instead of re.search, puts of findings tuple, have parse tuple. if that's best way can live figure i'm missing something.

in example above, want swap "丁目" dash , remove trailing "番地", address reads thusly:

埼玉県川口市金山町12-1-104

you're using | inside character classes ([....]). match characters listed there; not want.

specify pattern without character classes. (also without ")

>>> import re >>> add = u"埼玉県川口市金山町12丁目1-104番地" >>> test = re.search(ur'番地|丁目', add) >>> test.group(0) u'\u4e01\u76ee' >>> print test.group(0) 丁目 

to want, use str.replace (unicode.repalce) , re.sub.

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-')) 埼玉県川口市金山町12-1-104 

$ used match @ end of string. if position of 番地$ not matter, regular expression not needed. str.replace enough:

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-')) 埼玉県川口市金山町12-1-104 

Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -