python - BeautifulSoup extract XPATH or CSS Path of node -

August 15, 2011

i want extract data html , able highlight extracted elements on client side without modifying source html. , xpath or css path looks great this. is possible extract xpath or css path directly beautifulsoup?
right use marking of target element , lxml lib extract xpath, bad performance. know bsxpath.py -- it's not work bs4. solution rewriting use native lxml lib not acceptable due complexity.

import bs4 import cstringio import random lxml import etree   def get_xpath(soup, element):   _id = random.getrandbits(32)   e in soup():     if e == element:       e['data-xpath'] = _id       break   else:     raise lookuperror('cannot find {} in {}'.format(element, soup))   content = unicode(soup)   doc = etree.parse(cstringio.stringio(content), etree.htmlparser())   element = doc.xpath('//*[@data-xpath="{}"]'.format(_id))   assert len(element) == 1   element = element[0]   xpath = doc.getpath(element)   return xpath  soup = bs4.beautifulsoup('<div id=i>hello, <b id=i test=t>world!</b></div>') xpath = get_xpath(soup, soup.div.b) assert '//html/bodydiv/b' == xpath

it's pretty easy extract simple css/xpath. same lxml lib gives you.

def get_element(node):   # xpath have count nodes same type!   length = len(list(node.previous_siblings)) + 1   if (length) > 1:     return '%s:nth-child(%s)' % (node.name, length)   else:     return node.name  def get_css_path(node):   path = [get_element(node)]   parent in node.parents:     if parent.name == 'body':       break     path.insert(0, get_element(parent))   return ' > '.join(path)  soup = bs4.beautifulsoup('<div></div><div><strong><i>bla</i></strong></div>') assert get_css_path(soup.i) == 'div:nth-child(2) > strong > i'

Search This Blog

UIO

python - BeautifulSoup extract XPATH or CSS Path of node -

Comments

Post a Comment

Popular posts from this blog

How to dequeue messages from RabbitMQ in a scheduled time -

Python Kivy ListView: How to delete selected ListItemButton? -

asp.net mvc 4 - A specified Include path is not valid. The EntityType '' does not declare a navigation property with the name '' -