這裡大概整理一下目前寫完python的crawler筆記，大概使用了lxml, readability這兩個framework以及一個繁簡轉換的現成python程式，twisted打算等後面優化時再使用。

python常識

程式碼內含有中文時，在最上方加入編碼資訊：#coding: utf-8
import的三種方式

``` from lxml import * import lxml.html as H # alias import urllib2, urllib ```

若要參考外部 (external)的py檔案，直接把該程式檔放置相同目錄即可。例如下方的py檔案，我們可以import externalTest後，使用externalTest.test來呼叫

def test:
  print &quot;hello world&quot;

給與函式參數預設值

``` def test(str="world"): print "hello %s" % (str) ```

多行註解，使用三個單引號 (''')為首、尾包起來即可，並可搭配#在首尾的三個單引號前，加快註解、反註解的速度
預設執行函式

``` if __name__ == '__main__': print "hello"

<h2><span style="color: #3366ff;">lxml</span></h2>

from lxml import *
import lxml.html as H
import urllib2, urllib

quoted_query = urllib.quote("我是誰") # encode url character query
host = "https://www.google.com.tw/#hl=zh-TW&site=&source=hp&q=%s" % (quoted_query)
req = urllib2.Request(host)

headers

req.add_header('User-Agent', User_Agent)

response = urllib2.urlopen(req)
content = response.read()

variables for for-loop

doc = H.document_fromstring(content)
resultList = doc.xpath('//*[@id="rso"]/li[1]/div/table')

for result in resultList:

print result.text_content()

<h2><span style="color: #3366ff;">Readability</span></h2>
得先安裝兩個python的package：

easy_install readability-lxml

easy_install cssselect


範例程式碼：

html = urllib2.urlopen(newsUrl).read()
readable_article = Document(html).summary()

<h2><span style="color: #3366ff;">繁簡轉換</span></h2>
參考此篇論壇<a href="http://python.6.n6.nabble.com/-td2241030.html">文章</a>，與其<a href="http://pyswim.googlecode.com">google code官網</a>。下載<a href="http://pyswim.googlecode.com/files/langconv-0.0.1dev.tgz">langconv</a>後解壓縮，將其複製到欲使用的程式碼同樣目錄下即可。其範例程式碼如下：

from langconv import *
c=Converter('zh-hant')
c.convert(u'汉字')
u'\u6f22\u5b57'
print c.convert(u'汉字')
漢字
print c.convert(u'中文繁简转换')
中文繁簡轉換
print Converter('zh-hans').convert(u'中文繁簡轉換')
中文繁简转换


爾後打算使用multi-thread與其他方式使抓取大量資料時速度更快，完成後再打上筆記，或希望有經驗的開發者能提供更好的意見^^