python - BeautifulSoup extract XPATH or CSS Path of node -
i want extract data html , able highlight extracted elements on client side without modifying source html. , xpath or css path looks great this. is possible extract xpath or css path directly beautifulsoup?
right use marking of target element , lxml lib extract xpath, bad performance. know bsxpath.py
-- it's not work bs4. solution rewriting use native lxml lib not acceptable due complexity.
import bs4 import cstringio import random lxml import etree def get_xpath(soup, element): _id = random.getrandbits(32) e in soup(): if e == element: e['data-xpath'] = _id break else: raise lookuperror('cannot find {} in {}'.format(element, soup)) content = unicode(soup) doc = etree.parse(cstringio.stringio(content), etree.htmlparser()) element = doc.xpath('//*[@data-xpath="{}"]'.format(_id)) assert len(element) == 1 element = element[0] xpath = doc.getpath(element) return xpath soup = bs4.beautifulsoup('<div id=i>hello, <b id=i test=t>world!</b></div>') xpath = get_xpath(soup, soup.div.b) assert '//html/bodydiv/b' == xpath
it's pretty easy extract simple css/xpath. same lxml lib gives you.
def get_element(node): # xpath have count nodes same type! length = len(list(node.previous_siblings)) + 1 if (length) > 1: return '%s:nth-child(%s)' % (node.name, length) else: return node.name def get_css_path(node): path = [get_element(node)] parent in node.parents: if parent.name == 'body': break path.insert(0, get_element(parent)) return ' > '.join(path) soup = bs4.beautifulsoup('<div></div><div><strong><i>bla</i></strong></div>') assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'
Comments
Post a Comment