Python - regex with Japanese letters matches only one character -
i'm trying find words in japanese addresses can scrub them. if there single character, regex works fine, don't seem find strings 2 characters or more:
import re add = u"埼玉県川口市金山町12丁目1-104番地" test = re.search(ur'["番地"|"丁目"]',add) print test.group(0) 丁
i can use re.findall
instead of re.search
, puts of findings tuple, have parse tuple. if that's best way can live figure i'm missing something.
in example above, want swap "丁目" dash , remove trailing "番地", address reads thusly:
埼玉県川口市金山町12-1-104
you're using |
inside character classes ([....]
). match characters listed there; not want.
specify pattern without character classes. (also without "
)
>>> import re >>> add = u"埼玉県川口市金山町12丁目1-104番地" >>> test = re.search(ur'番地|丁目', add) >>> test.group(0) u'\u4e01\u76ee' >>> print test.group(0) 丁目
to want, use str.replace
(unicode.repalce
) , re.sub
.
>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-')) 埼玉県川口市金山町12-1-104
$
used match @ end of string. if position of 番地$
not matter, regular expression not needed. str.replace
enough:
>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-')) 埼玉県川口市金山町12-1-104
Comments
Post a Comment