I'm here with Aral Balkan and we're working on scraping Twitter to do functions that the Twitter API doesn't currently support. Aral just releeased TwitAPI, a PHP regular expressions-based screen scraper. 
Aral's written some regular expressions to pull the data out of the direct messages out. I'm doing it with Python's BeautifulSoup. 
Here are the BeautifulSoup recipes ('n' is the B.S. instance, x is to be looped over). 
User URL: n.findAll(True, {"class": "status_actions"})[x].\
parent.contents[5].contents[1].contents[0]['href'] 
User Name: n.findAll(True, {"class": "status_actions"})[0]\
.parent.contents[5].contents[1].contents[0].contents 
Comment: n.findAll(True, {"class": "status_actions"})[0]\
.parent.contents[5].contents[2].string.strip() 
Fucked-up Twitter timecode: n.findAll(True, {"class": "status_actions"})[0].parent.contents[5].contents[3].contents[1].string.strip() 
Once I’ve figured out how to do HTTP Basic authorisation using urllib2, the Twitter parser can be released unto the world! 
Tags: barcamplondon2, screen scraping, beautifulsoup, python, aral balkan, twitter 
Comments | TrackBack 