Python - regex with Japanese letters matches only one character -

July 15, 2015

i'm trying find words in japanese addresses can scrub them. if there single character, regex works fine, don't seem find strings 2 characters or more:

import re add = u"埼玉県川口市金山町１２丁目１－１０４番地"  test = re.search(ur'["番地"|"丁目"]',add) print test.group(0)  丁

i can use re.findall instead of re.search, puts of findings tuple, have parse tuple. if that's best way can live figure i'm missing something.

in example above, want swap "丁目" dash , remove trailing "番地", address reads thusly:

埼玉県川口市金山町１２－１－１０４

you're using | inside character classes ([....]). match characters listed there; not want.

specify pattern without character classes. (also without ")

>>> import re >>> add = u"埼玉県川口市金山町１２丁目１－１０４番地" >>> test = re.search(ur'番地|丁目', add) >>> test.group(0) u'\u4e01\u76ee' >>> print test.group(0) 丁目

to want, use str.replace (unicode.repalce) , re.sub.

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'－')) 埼玉県川口市金山町１２－１－１０４

$ used match @ end of string. if position of 番地$ not matter, regular expression not needed. str.replace enough:

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'－')) 埼玉県川口市金山町１２－１－１０４

Search This Blog

UIO

Python - regex with Japanese letters matches only one character -

Comments

Post a Comment

Popular posts from this blog

How to dequeue messages from RabbitMQ in a scheduled time -

Python Kivy ListView: How to delete selected ListItemButton? -

ruby - How do I merge two hashes into a hash of arrays? -