I have some html that I want to extract text from. Here's an example of the html:
<p>TEXT I WANT <i> – </i></p>Now, there are, obviously, lots of <p> tags in this document. So, find('p') is not a good way to get at the text I want to extract. However, that <i> tag is the only one in the document. So, I thought I could just find the <i> and then go to the parent.
I've tried:
up = soup.select('p i').parentand
up = soup.select('i')
print(up.parent)and I've tried it with .parents, I've tried find_all('i'), find('i')... But I always get:
'list' object has no attribute "parent"What am I doing wrong?
4 Answers
find_all() returns a list. find('i') returns the first matching element, or None.
Thus, use:
try: up = soup.find('i').parent
except AttributeError: # no <i> elementDemo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<p>TEXT I WANT <i> – </i></p>')
>>> soup.find('i').parent
<p>TEXT I WANT <i> – </i></p>
>>> soup.find('i').parent.text
u'TEXT I WANT \u2013 ' 3 This works:
i_tag = soup.find('i')
my_text = str(i_tag.previousSibling).strip()output:
'TEXT I WANT'As mentioned in other answers, find_all() returns a list, whereas find() returns the first match or None
If you are unsure about the presence of an i tag you could simply use a try/except block
Both select() and find_all() return you an array of elements. You should do like follow:
for el in soup.select('i'): print el.parent.text 1 soup.select() returns a Python List. So you have 'unlist' the variable
e.g.:
>>> [up] = soup.select('i')
>>> print(up.parent)or
>>> up = soup.select('i')
>>> print(up[0].parent)