Summary of crawling news

--downloading, parsing and more with news

Created by Junchao

Today's speech's outline

  • Downloading news: requests, requests and more
  • Parsing news: lxml vs BeautifulSoup
  • All in one: newspaper

Downloading news: requests, requests and more requests

  • What about User-Agent
  • What about Forms
  • What about Cookies
  • What about JavaScript
  • What about Speed
  • What about IP blocking

About User-Agent

  • curl or requests use headers of the underlying library
  • we do not want to get noticed when crawling
  • so we will usually fake User-Agent by customizing headers
  • requests.get(url, headers={'User-Agent': 'Faked'})
  • use random?

About Forms

  • urlencoded or json or anything, it is just data
  • so the ultimate way is to view requests sent by browser
  • sometimes need a bit of reverse engineering: read and debug the JavaScript code
  • Chrome Dev Tools is THE best friend, love it!

About Cookies

  • Used for login traditionally
  • with form submission: CSRF token
  • the old way is a dict or CookieJar
  • but seriously, with requests.Session, why you want to take the trouble by yourself?

About JavaScript

  • More and more websites using JavaScript
  • if simple AJAX request only, view raw request and reverse engineer
  • if some data manipulation, post-processing after downloading and parsing
  • if SPA, reverse engineering could be hard -- possible however
  • cont. can try selenium and pyvirtualdisplay
  • cont. but I do not personally suggest it: unreliable in many ways

About Speed: threading

  • Yes Python has GIL, so what?
  • Handles IO intensive tasks like downloading
  • Detail: CPython, when doing read/write (or socket recv and send), will release GIL

About Speed: multiprocessing

  • Unix community: does not like threads and prefer processes --The Art of Unix Programming
  • In the case of Python, communication by pickling and unpickling: another layer of slowness
  • Used with threading: CPU intensive and IO intensive tasks could be solved

About Speed: concurrent.futures (3k only)

  • Borrowed from Java
  • Wrapper for user-friendliness
  • Even more user-friendly than multiprocessing.Pool and threading

About Speed: gevent

  • coroutine-based methodology
  • it is just like going back to Windows 95 minus
  • involves monkey-patching, so potentially not safe
  • see it with requests in grequests

About Speed: asyncio (3k only)

  • based on gevent and twisted
  • even more complicated
  • and potentially even faster and more scalable: sanic

About: speed

Sequential: 5.255162239074707 7
Asyncio: 1.60213303565979 7
Thread: 1.849787950515747 7
Concurrent: 1.7955994606018066 7
    

About IP blocking

  • Use cloud: request for new servers after a while
  • Tor: restart a daemon client periodically

Parsing news: lxml vs BeautifulSoup

  • lxml
  • BeautifulSoup4
  • Differences and Comments

lxml: xml and html parsing

  • based on libxml2-dev libxslt-dev
  • Python is slow, but it is using C
  • known for being strict about html format
  • improved as libxml2 as improved
  • still not good at encoding though
  • use BeautifulSoup for encoding detection

lxml: xpath and cssselect

  • xpath is powerful, but just a bit hard to understand compared to css query string
  • use cssselect: lxml is using this package internally

lxml: pyquery

  • jquery is convenient to use, and somebody bring its query method to lxml

bs4: beautifulsoup4

  • first, bs4 and lxml can use each other
  • for pure html.parser in bs4, it uses regex, more tolerent to broken html
  • better as encoding detection
  • slower than lxml, but it is generally fine. but
  • one thing I really do not like: its select does not support all css features, not even nth-of-child

All in one: newspaper

  • The predecessor: python-goose and the ancestor: goose
  • newspaper parse
  • newspaper build
  • newspaper nlp

python-goose and goose

  • goose: Scala-based article extractor, lack of development now
  • python-goose: Python clone of goose, lack of developement now

newspaper

  • borrowed a lot of code for python goose
  • build on top of requests and lxml
  • uses Pillow for image extraction
  • uses python-dateutil for datetime parsing
  • uses nltk and jieba for NLP
  • uses feedfinder2 and feedparser for link discovery

newspaper parse

  • title, authors, meta_language, publish_date, text, video, images ...
  • What we can learn from it: title, text
  • What we may not need at all: video, images
  • What is too general: authors, but we may not need this, publish_date

newspaper build and nlp

  • build: parse for general news website, did not read the code
  • nlp: calculate keywords and summary, did not read the code

newspaper -- conclusion

  • it is for general news website, saves a lot of trouble
  • cannot be tailor made to other websites
  • the summary part for simple nlp could be useful later

Others, needs further study

Thank you