Python and Web Scraping (using Scrapy)

Certainly the most extensible scripting language I have ever used, Python allows the user to build powerful programs ranging from web crawling to text mining to machine learning. With invaluable packages, NumPy and SciPy, Python is able to tackle complex modeling tasks, while at the same time, other packages such as BeautifulSoup and Scrapy allow for thorough data collection through web crawling and scraping.

In the Tableau Project below, I have provided an example (with code included on the second tab) of how web crawling and data collection work, by taking a snapshot of my old motorcycle model and comparing prices from two different markets. The data was scraped using Scrapy and exported into a CSV file which I imported into Tableau.

https://public.tableausoftware.com/javascripts/api/viz_v1.js

[su_heading]Here is the Spider code:[/su_heading]

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_mcy.items import CraigslistMcyItem
import re, string
			
     
			
class MySpider2(BaseSpider):
  name = "craigmcy2"
  allowed_domains = ["craigslist.org"]
  start_urls = ["http://minneapolis.craigslist.org/search/mca?query=vulcan 900",
                "http://phoenix.craigslist.org/search/mca?query=vulcan 900",
                "http://phoenix.craigslist.org/search/mca?query=vulcan 900&s=100"]

  def parse(self, response):
      hxs = HtmlXPathSelector(response)
      
      titles = hxs.select("//p[@class='row']")
      items = []
      for title in titles:
          item = CraigslistMcyItem()
          item ["title"] = title.select("span[@class='txt']/span[@class='pl']/a/text()").extract()
          item ["link"] = title.select("span[@class='txt']/span[@class='pl']/a/@href").extract()
          item ["postedDt"] = title.select("span[@class='txt']/span[@class='pl']/time/@datetime").extract()
          item ["price"] =title.select("a[@class='i']/span[@class='price']/text()").extract()
          item ["debug"] = "" #blank for now...before, it was: title.select("a[@class='i']").extract()
          item ["location"] = re.split('[s"] ',string.strip(str(hxs.select("//title/text()").extract())))
          items.append(item)
      return items	

[su_heading]Items code:[/su_heading]

from scrapy.item import Item, Field

class CraigslistMcyItem(Item):
  title = Field()
  link = Field()
  postedDt = Field()
  price = Field()
  debug = Field()
  location = Field()

[su_heading]Run code (aka “Main”):[/su_heading]


import os
import scrapy  # object-oriented framework for crawling and scraping


os.system('scrapy list & pause')
os.system('scrapy crawl craigmcy2 -o craigslist_peter.csv')

.

Leave a Reply

Your email address will not be published. Required fields are marked *