Monday, December 16, 2013

Video Reviewː "The Death of the universe - Renée Hlozek"


Interesting animated video about the possible way that our universe could and... nice to see and to learn a little more.

I like very much about the theory about dark matter, but something tells (call it gut felling) that in the end, the universe full with dark matter will colapse in itself, giving origin to a new big bang... 


Others videos worth watching are from Michio Kaku[1] and from the series "How the Universe Works"[2], documentaries from the Discovery Channel.
There are a lot of others videos and papers about it, and you could find even more scientific papers...

Explore!!

Monday, November 11, 2013

Web Scraping with Scrapy

Web Scraping with Scrapy
How to extract data from websites


Months without posting nothing, and than... BAM!!! Posting a lot of stuff, this is for you guys to see that im still doing stuff but the time to put all that in a blog post is limit, so let me take the opportunity given in the present time, and write...
The following months i’ve been using a lot the Scrapy Framework (those who follow me on twitter should realize that), and this article is about that... use the scrapy framework to extract relevant data.
Extract data means that we want to take unordered info from one (or more than one) website, parser that and use it for our wishs.
Wikipedia[1] has a good explanation about what i just said above:


“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.
Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.” - from Wikipedia

Well, there’s a lot of tools we can use to extract data from websites, but i find scrapy very good and easy one.

What’s Scrapy[2]?
It’s a framework. A framework that’ll help you extract data and others end. It’s written in python[3], which means it’s far from great!! hahahahaaha :D
Scrapy there a lot of clients[4] already using it, which makes the framework even more tested, check the list on their website. One client use scrapy for Data Mining[5], this is the stuff you could do, or perhaps just scrap the photos of one site you want to!

How to use...
To use scrapy there’s some basics stuff to do, for the simplicity of this article, im using GNU/Linux, in particular Linux Mint 15[6].
Since linux is great! Python is pre-install on the Mint distro, so, no need to run anykind of installation procedure.
Scrapy is a third party framework, so we need to install, i recommend the use of pip[7] to install python packages. If you don’t know pip, take a few minutes to understand how it works, and the wonderfull it can do for you.
To install pip on the system (if already haven’t install) use synaptic to search for it’s package, after the installation type on the shell to install Scrapy:

~ $ pip install scrapy

This way, the framework will install and be ready to use. BeautifulSoup4 is another great tool to handle HTML, it can parse documents and access items in easy way, good tool to make some pos-processing on items that Scrapy collect and record on a database.
To install it type:

~ $ pip install beautifulsoup4

For fast example, i’ll do the scrapping of the news from the website of PMC[9] (Prefeitura Municipal de Campinas).
A default scrapy project will be created to do the job, you can find info about this in the documentation of the Scrapy[10].
To start a new Scrapy project do:

~$ scrapy startproject noticias_pmc

A estructure will be created on the folder that you execute the command (in this case a folder is created in the root directory of the default logged user).

  • noticias_pmc/
    • scrapy.cfg
    • noticias_pmc/
      • __init__.py
      • items.py
      • pipelines.py
      • settings.py
      • spiders/
        • __init__.py

scrapy.cfg: the project configuration file
noticias_pmc/: the project’s python module, you’ll later import your code from here.
noticias_pmc/items.py: the project’s items file.
noticias_pmc/pipelines.py: the project’s pipelines file.
noticias_pmc/settings.py: the project’s settings file.
noticias_pmc/spiders/: a directory where you’ll later put your spiders.”

First step is define a estructure item (the information the we want to extract and put on use). Open the file noticias_pmc/items.py:

<code>
from scrapy.item import Item, Field

class NewsPmcItem(Item):
title = Field()
data = Field()
text = Field()
image_urls = Field()
images = Field()
</code>

Done! Now let’s build out spider! To do data, inside the folder noticias_pmc/spiders/ created a file named NewsPMCSpider.py. Noticed that Scrapy use the xpath[11] syntax to locate elements inside the parser HTML, another library’s, as the installed BeautifulSoup4 can use others means to access those elements.

<code>
# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from noticias_pmc.items import NoticiasPmcItem
import urlparse

# Class, CrawlSpider is the super class (in python we do this way)
class NewsPMCSpider(CrawlSpider):
# name our spider
name = 'noticias_pmc'
# allowed domains, we don’t want the spider to read the entire web, do we??
allowed_domains = ['campinas.sp.gov.br']
# wich url we should start the read
# Rules, which urls format we sould read, the callback that will parse the response and follow to tell our spider to keep going to another urls!
rules = (
# Extract links and parse them with the spiders method parse_item
Rule(SgmlLinkExtractor(allow=['http://campinas.sp.gov.br/noticias.php', 'http://campinas.sp.gov.br/noticias-integra.php']), callback='parse_item', follow=True),
)

# This do all the work
def parse_item(self, response):
# Create a news items!
item = NewsPmcItem()
# Parse the response of the server, so we can access the elements
hxs = HtmlXPathSelector(response)
# xpath to find and get the elments, ah! we want only the string of the text here (no html tags!)
titulo = hxs.select('//div[@class="itens"]/h3').select('string()').extract()
# If there’s a title, it may be a valide news!
if titulo:
# Get the news date
data = hxs.select('//div[@class="itens"]/p[@class="data"]').select('string()').extract()
# The body text
texto = hxs.select('//div[@class="itens"]/p[@align="justify"]').select('string()').extract()
# Clean up
item['titulo'] = titulo[0].strip()
item['data'] = data[0].strip()
item['texto'] = "".join(texto).strip()
# Make the parser of images that scrapy will save automatic on the folder defined on settings.py
item['image_urls'] = self.parse_imagens(response.url, hxs.select('//div[@id="sideRight"]/p/a/img'))
return item

def parse_imagens(self, url, imagens):
image_urls = []
for imagem in imagens:
try:
# Image path
src = imagem.select('@src').extract()[0]
# If it is a relative path we must put the prefix http://www.campinas.sp.gov.br before the link
if 'http' not in src:
src = urlparse.urljoin(url, src.strip())
image_urls.append(src)
except:
pass
return image_urls
</code>

Before running our spider, we must change two other files, the settings.py and pipelines.py.
Add the following line on settings.py (no matter where):

<code>
# Nome da classe no arquivo de pipilines que irá fazer o parser das imagens
ITEM_PIPELINES = ['noticias_pmc.pipelines.MyImagesPipeline', ]
# O diretório no qual as imagens serão armazenadas
IMAGES_STORE = '<caminho interno>/noticias_pmc/images'
</code>

And on pipelines.py paste the MyImagesPipeline class:

<code>
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
try:
if item['image_urls']:
for image_url in item['image_urls']:
yield Request(image_url)
except:
pass

def item_completed(self, results, item, info):
item['image_urls'] = [{'url': x['url'], 'path': x['path']} for ok, x in results if ok]
return item
</code>

Done again! Let’s now run our spider an see the results. When you run scrapy will show you the url read and the values catch and put on the items class the pages found!
Inside the Scrapy project folder, type:

~$ scrapy crawl noticias_pmc

Look!!! a spider on the web.... hahahahahahahhahahaha

If you have some problem using Scrapy, leave a message, if i could help, i will!!! There’s much more on Scrapy documentation take a minute (more than one) and read it!!!


[2] Scrapy: http://scrapy.org/
[4] Companies using Scrapy: http://scrapy.org/companies/
[6] Linux Mint: http://www.linuxmint.com/

Wednesday, November 6, 2013

"The Future of Programming"

"The Future of Programming"

This is more like a warning than a trully blog post.

Hello guys,
its been sometime without post anything... Since i was moving out my home to a new one and i got stuck in my day work i find it hard to write some big post.
Anyway, this is no scuse, i now...

I, as most of my days ways checking out the news and see older weekly email lists that i assign and saw this video between the links, i find this peace of audio and image very instructive, everbody who is a programmer or want to be should watch and begging to rethink the way we do our things.

The video speaks for itself... watch and get you own perspective.

Bret Victor - The Future of Programming [1].

Some key points that Bret says on his talks are very interesting to implement in now days, those that impress me are the following:
  • Grail System [2];
  • Actor Model [3];
  • SmallTalk Browser [4]:
    • Almost the same screen on his presentation [5];
Another article that is very enlightenment is this, it talks about why arrays start with 0 index or 1 index... i read it on the Guido van Rossum (father of python) blog, check both!

Citation Needed [6].
Guido van Rossum blog [7].

Thanks

[1]: http://www.youtube.com/v/8pTEmbeENF4?autohide=1&version=3&attribution_tag=EVRJF1Slx80pSNhBlW8y5g&autoplay=1&feature=share&showinfo=1&autohide=1
[2]: http://c2.com/cgi/wiki?GrailSystem
[3]: http://en.wikipedia.org/wiki/Actor_model
[4]: http://en.wikipedia.org/wiki/Class_browser
[5]: http://seaside.st/about/screenshots?_k=x4bar8gA
[6]: http://exple.tive.org/blarg/2013/10/22/citation-needed/
[7]: http://python-history.blogspot.ca/2013/10/why-python-uses-0-based-indexing.html

Monday, August 19, 2013

Writing a django json propery serializer

Writing a django json PROPERTY serializer.

So, what's up? This will be my first blog post in english, and for the rest of my writing here in this blog that the language i'll use. The main reason is because most of the traffic that is coming here is from people outside my home land (which is Brazil).

Anyway, this blog post is about the django framework, well not exactly the framework itself but a little part that i use daily.
But before the code! A little background showing the why i did this...

Most of my time i use PHP as my main programming language to write web software (not that i like that, but anyway... i have to). Right at the beginning we (means me and my coworkers) start using Zend Framework to write our web apps, and we return JSON strings as part of the response. Simply because we used ExtJs togheter! Pretty slick at that time, but also pretty heavy (ok, not that much).
Since then i've been learn more and more about python e django to the point that they bacame my main programming language (and framework, off course) used to make my hobbiest projects! :D

So, there came the need of write my own JsonSerializer, because django default serializer don't serialize propertys... for me, propertys are a way to enhance the data viewed by a user inside a grid.
Nowdays i dont use anymore ExtJs (perhaps ill use in future projects), but i build my own jquery grid! (ill put this code on github, since i make some cleanup!) But this little grid jquery plugin still eat up JSON as the main format.

And that's why i change the default JsonSerializer and made this PropertyJsonSerializer!
Anyway... here's the fucking code! And if you dear reader find any bugs on this... write that up on the comments or use the gist tool...

JsonPropertySerializer.py
https://gist.github.com/rdenadai/6095815

thanks a lot!

Monday, March 18, 2013

Raspberry Pi: build a home web server (BETTER WAY)!

Raspberry Pi: build a home web server (BETTER WAY)!

Long time no see people!!! Depois de praticamente 2 meses sem postar nada estou eu aqui novamente!
Como podem perceber o nome da postagem (BETTER WAY) quer dizer que eu aprendi muito nestes últimos 2 meses! Foram bem corridos, com o fim do ano e viagens. Enfim, vamos la!!

Bom, como vocês devem se lembrar do meu último post estava criando um servidor web local (intranet)  usando o raspberry pi. Não vou repassar toda a parte introdutória do post passado nesse, aqui vou direto ao ponto!

Minha stack final até o momento:
  1. Raspberry Pi (claro!!);
  2. Arch Linux Arm (2013-01-22);
  3. dnsmasq (i trully love this one!!!);
  4. nginx;
  5. supervisor;
  6. gunicorn;
  7. python2;
    1. flask;
Beleza, tudo só isso... vou mostrar os passos para instalar esses caras e deixa-los funcionando. Lembrando que, eu não me responsabilizo por quaisquer dados causados pela instalação e posterior uso de seus equipamentos.
Não estou levando em consideração precauções de segurança, no meu caso o acesso a esses dispositivos passa por um router, o qual fornece o DNS e outras implementações como firewall.

Instalação

Bom, o primeiro passo é instalar a versão do Arch Linux Arm no cartão de memória que será inserido no pi. Este passo já fora explicado no tutorial anterior (la eu usei o Arch Linux Arm), então não irei me alongar aqui.
Depois de instalar a imagem conecte o SD no pi e ligue-o na energia elétrica, se a tela do monitor ficar pequena (cortando o shell), bom nesse caso você precisa alterar o arquivo de configuração do pi. A melhor opção é desligar o pi, tirar o SD dele e conectar no pc novamente.
Altere o arquivo config.txt que estará em /boot/config.txt. Eu apenas habilitei e fui mudando os valores de framebuffer e overscan até ficarem legal no meu monitor. O problema de se alterar essas informações desta maneira é que cada outro terminal pode precisar de novas configs.

Feito isso o próximo passo é atualizar todo o sistema. Para logar no Arch Linux pela primeira vez o usuário e senha são root e root.
Após o login execute:
pacman -Syyu

Isso irá atualizar todo o sistema e os repositórios cadastrados.

Agora vamos instalar todos os packages que iremos utilizar! Todos mesmo!!! One line to rule them all!!! HAHAHAHAAHA

pacman -S sudo base-devel openssl pcre zlib libxml2 udev-automount dnsmasq nginx supervisor mongodb python2 python-pip2

Legal neh?? Mas o que fizemos/instalamos???

O principais são o base-devel que irá instalar mais um tonelada de packages e libs incluindo o compilador gcc.
O udev-automount que irá automagicamente montar as unidades que forem conectadas a porta usb do pi, facilitando nossa vida ao inserir um stick usb como unidade de armazenamento.
E claro todo o resto da stack. Se você não quiser instalar algum dos packages é só não coloca-lo na linha listada acima.

Ainda é necessário instalar o flask!! Para isso execute:
pip2 install flask Flask-Assets gunicorn pymongo mongoengine

Acabamos de instalar o flask, Flask-Assets (que junta todos os arquivos .js em um único javascript e css tb), gunicorn (nosso servidor WSGI local que executa código python) e os binds python para o servidor mongodb.

O próximo passo é definir para nosso raspberry pi um endereço de ip fixo, o motivo disso é que queremos que o dnsmasq consiga redirecionar algo do tipo mypi.lan.internal.server para o ip do nosso pi e chamar consequentemente nosso site web!
Como eu estou usando um route, entrei na configuração do mesmo e adicionei um endereço fixo para meu raspberry pi (192.168.1.48).
Agora é preciso alterar 2 arquivos de sistemas, ambos estão em /etc/ e são:

  • hosts
  • resolv.conf.head
No hosts (não hosts.conf hein!) basta acrescentar:
192.168.1.48       mypi        localhost

Note que o endereço é mypi, o restante do nome o dnsmasq irá colocar para nós (iremos configurar isso la).

No resolv.conf.head (crie ele se não existir!) devemos acrescentar:
nameserver 192.168.1.48

E finalmente vamos alterar o arquivo do dnsmasq para que ele funcione:
Linha 62:

local=/lan.internal.server/

Linha 99:
no-dhcp-interface=eth0
Linha 119:
expand-hosts
Linha 128:
domain=lan.internal.server


Feito essas alterações nos arquivos de configuração de rede e do dnsmasq devemos alterar as configurações do nginx. Para configurar um novo vhost procure pelo arquivo do nginx, ele vai estar em /etc/nginx/nginx.conf.
Embaixo esta a configuração que tenho utilizado juntamente com o gunicorn. Atenção apenas as pastas (você deve colocar o caminho para seus arquivos estáticos [/media/www/static], onde estão os arquivos do seu programa [/media/www/home] e claro o ip:port na qual o gunicorn será executado [no meu caso 192.168.0.25:8080]). O resto pode ser deixado como padrão... lembrando que aqui, não ainda não ativei compactação via gzip no nginx (e vai ficar para uma próxima).


server {
    listen       80;
    server_name  mypi.lan.internal.server;


    location ~ ^/(jpg|jpeg|gif|png|ico|css|zip|tgz|gz|rar|bz2|pdf|txt|tar|wav|bmp|rtf|js|flv|swf|images|flash|media|static)/

    {
        root   /media/www/static/;
        expires 1d;
    }


    location / {

        #root   /media/www/home/;
        #index  index.html index.htm;
    try_files $uri @proxy_to_app;
    }


location @proxy_to_app {

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://192.168.0.25:8080;
}
}


Depois disso é hora de configurar o supervisor para executar nosso processo do gunicorn que irá carregar nossa app em flask!!!

O bacana do supervisor é que com o comando echo_supervisord_conf > /etc/supervisord.conf você gera um arquivo de conf padrão! ;) bem simple não?
Entretanto apenas isso não garante que o gunicorn seja executado, por conseguinte devemos alterar o arquivo de conf criado acrescentando as seguintes linhas acima da explicação de como subir um programa usando o supervisor. Procure por [program:theprogramname] no arquivo e acrescente acima ou abaixo:


[program:gunicorn]
command=gunicorn app:app -b 192.168.0.25:8080
directory=/media/www/home
autostart=true
autorestart=true
redirect_stderr=True


Atenção para colocar o mesmo ip:port que especificado la no nginx hein!! Outro detalhe importante é o de colocar o nome do arquivo python que representa o app do flask. No meu caso, eu deixei o mesmo padrão que é descrito nos docs do flask, ou seja arquivo app.py, class name app, por isso do gunicorn app:app.

Pronto! Agora só subir todos os serviços e torcer pra tudo estar configurado corretamente!
systemctl enable dnsmasq
systemctl start dnsmasq
systemctl enable nginx
systemctl start nginx

systemctl enable supervisord
systemctl start supervisord


Bom, acesse a sua url padrão no navegador e você perceberá que não é carrega, não se esqueça de configurar o dns primário e secundário em sua conexão com a internet. Essa parte é importante caso seu roteador (assim como o meu) não permita configurar um dns interno.


Referências: