Monday, November 11, 2013

Web Scraping with Scrapy

Web Scraping with Scrapy
How to extract data from websites


Months without posting nothing, and than... BAM!!! Posting a lot of stuff, this is for you guys to see that im still doing stuff but the time to put all that in a blog post is limit, so let me take the opportunity given in the present time, and write...
The following months i’ve been using a lot the Scrapy Framework (those who follow me on twitter should realize that), and this article is about that... use the scrapy framework to extract relevant data.
Extract data means that we want to take unordered info from one (or more than one) website, parser that and use it for our wishs.
Wikipedia[1] has a good explanation about what i just said above:


“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.
Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.” - from Wikipedia

Well, there’s a lot of tools we can use to extract data from websites, but i find scrapy very good and easy one.

What’s Scrapy[2]?
It’s a framework. A framework that’ll help you extract data and others end. It’s written in python[3], which means it’s far from great!! hahahahaaha :D
Scrapy there a lot of clients[4] already using it, which makes the framework even more tested, check the list on their website. One client use scrapy for Data Mining[5], this is the stuff you could do, or perhaps just scrap the photos of one site you want to!

How to use...
To use scrapy there’s some basics stuff to do, for the simplicity of this article, im using GNU/Linux, in particular Linux Mint 15[6].
Since linux is great! Python is pre-install on the Mint distro, so, no need to run anykind of installation procedure.
Scrapy is a third party framework, so we need to install, i recommend the use of pip[7] to install python packages. If you don’t know pip, take a few minutes to understand how it works, and the wonderfull it can do for you.
To install pip on the system (if already haven’t install) use synaptic to search for it’s package, after the installation type on the shell to install Scrapy:

~ $ pip install scrapy

This way, the framework will install and be ready to use. BeautifulSoup4 is another great tool to handle HTML, it can parse documents and access items in easy way, good tool to make some pos-processing on items that Scrapy collect and record on a database.
To install it type:

~ $ pip install beautifulsoup4

For fast example, i’ll do the scrapping of the news from the website of PMC[9] (Prefeitura Municipal de Campinas).
A default scrapy project will be created to do the job, you can find info about this in the documentation of the Scrapy[10].
To start a new Scrapy project do:

~$ scrapy startproject noticias_pmc

A estructure will be created on the folder that you execute the command (in this case a folder is created in the root directory of the default logged user).

  • noticias_pmc/
    • scrapy.cfg
    • noticias_pmc/
      • __init__.py
      • items.py
      • pipelines.py
      • settings.py
      • spiders/
        • __init__.py

scrapy.cfg: the project configuration file
noticias_pmc/: the project’s python module, you’ll later import your code from here.
noticias_pmc/items.py: the project’s items file.
noticias_pmc/pipelines.py: the project’s pipelines file.
noticias_pmc/settings.py: the project’s settings file.
noticias_pmc/spiders/: a directory where you’ll later put your spiders.”

First step is define a estructure item (the information the we want to extract and put on use). Open the file noticias_pmc/items.py:

<code>
from scrapy.item import Item, Field

class NewsPmcItem(Item):
title = Field()
data = Field()
text = Field()
image_urls = Field()
images = Field()
</code>

Done! Now let’s build out spider! To do data, inside the folder noticias_pmc/spiders/ created a file named NewsPMCSpider.py. Noticed that Scrapy use the xpath[11] syntax to locate elements inside the parser HTML, another library’s, as the installed BeautifulSoup4 can use others means to access those elements.

<code>
# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from noticias_pmc.items import NoticiasPmcItem
import urlparse

# Class, CrawlSpider is the super class (in python we do this way)
class NewsPMCSpider(CrawlSpider):
# name our spider
name = 'noticias_pmc'
# allowed domains, we don’t want the spider to read the entire web, do we??
allowed_domains = ['campinas.sp.gov.br']
# wich url we should start the read
# Rules, which urls format we sould read, the callback that will parse the response and follow to tell our spider to keep going to another urls!
rules = (
# Extract links and parse them with the spiders method parse_item
Rule(SgmlLinkExtractor(allow=['http://campinas.sp.gov.br/noticias.php', 'http://campinas.sp.gov.br/noticias-integra.php']), callback='parse_item', follow=True),
)

# This do all the work
def parse_item(self, response):
# Create a news items!
item = NewsPmcItem()
# Parse the response of the server, so we can access the elements
hxs = HtmlXPathSelector(response)
# xpath to find and get the elments, ah! we want only the string of the text here (no html tags!)
titulo = hxs.select('//div[@class="itens"]/h3').select('string()').extract()
# If there’s a title, it may be a valide news!
if titulo:
# Get the news date
data = hxs.select('//div[@class="itens"]/p[@class="data"]').select('string()').extract()
# The body text
texto = hxs.select('//div[@class="itens"]/p[@align="justify"]').select('string()').extract()
# Clean up
item['titulo'] = titulo[0].strip()
item['data'] = data[0].strip()
item['texto'] = "".join(texto).strip()
# Make the parser of images that scrapy will save automatic on the folder defined on settings.py
item['image_urls'] = self.parse_imagens(response.url, hxs.select('//div[@id="sideRight"]/p/a/img'))
return item

def parse_imagens(self, url, imagens):
image_urls = []
for imagem in imagens:
try:
# Image path
src = imagem.select('@src').extract()[0]
# If it is a relative path we must put the prefix http://www.campinas.sp.gov.br before the link
if 'http' not in src:
src = urlparse.urljoin(url, src.strip())
image_urls.append(src)
except:
pass
return image_urls
</code>

Before running our spider, we must change two other files, the settings.py and pipelines.py.
Add the following line on settings.py (no matter where):

<code>
# Nome da classe no arquivo de pipilines que irá fazer o parser das imagens
ITEM_PIPELINES = ['noticias_pmc.pipelines.MyImagesPipeline', ]
# O diretório no qual as imagens serão armazenadas
IMAGES_STORE = '<caminho interno>/noticias_pmc/images'
</code>

And on pipelines.py paste the MyImagesPipeline class:

<code>
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
try:
if item['image_urls']:
for image_url in item['image_urls']:
yield Request(image_url)
except:
pass

def item_completed(self, results, item, info):
item['image_urls'] = [{'url': x['url'], 'path': x['path']} for ok, x in results if ok]
return item
</code>

Done again! Let’s now run our spider an see the results. When you run scrapy will show you the url read and the values catch and put on the items class the pages found!
Inside the Scrapy project folder, type:

~$ scrapy crawl noticias_pmc

Look!!! a spider on the web.... hahahahahahahhahahaha

If you have some problem using Scrapy, leave a message, if i could help, i will!!! There’s much more on Scrapy documentation take a minute (more than one) and read it!!!


[2] Scrapy: http://scrapy.org/
[4] Companies using Scrapy: http://scrapy.org/companies/
[6] Linux Mint: http://www.linuxmint.com/

0 comments :

Post a Comment