Saturday, November 14, 2015

Web Scraping with Python; Part 3

More web scraping with Python and Beautiful Soup!

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

The findAll and find functions are the two most common functions you will use when you're scraping a page; they both allow you to search through HTML to find specific tags, attributes, recursives, text, limits, and keywords.

Most of the time you will only need the tag and attribute arguments.

The tags argument takes text from the specific tags you search for; it could be headers h1, h2, h3. etc. Or paragraphs p, or lists li, or anything else you want to search for.

.findAll({"h1","h2","p,"li","div","header"})

Here we use the attributes arguments
.findAll("span", {"class":"blue", "class":"yellow"})

Let's look at the recursive argument now.
The recursive argument is a Boolean, meaning that it operates based upon a TRUE or FALSE argument. The argument itself is asking you "how deep would you like to look into the document to obtain the information you want?" If recursive is set to TRUE, it will look into children, childrens children, etc. for the tags you want. Sometimes we don't want to look that deep into an argument because you may only want a limited set of information within a parent, or a single child or whatever.

Now let's look at the text argument. This argument will search for a specific word surrounded by tags, for instance, let's say we are scraping a page with a large text file in the form of a book. Perhaps we are scraping the Bible on a webpage and we want to find the word 'Moses':

nameList = bsObj.findAll(text="Moses")
print(len(nameList))

This argument will retrieve all the instances the word Moses is used.

The keyword argument will select tags that contain a specific attribute:

allText = bsObj.findAll(id="text")
print(allText[0].get_text())

Some additional notes on the keyword function:
Keyword is particularly clumsy, so you cannot simply use it like this

bsObj.findAll(class="green")

Instead, type either of the following to find a particular keyword:

bsObj.findAll(class_="green")
or
bsObj.findAll("", {"class":"green"}

That's all for now until next time!

No comments:

Post a Comment