~akspecs/numbeo-scraping

139069be — Rebecca Medrano 7 months ago master
spiders/wiki_data.py: adjust toDecimalCoordinate

	- remove duplicate function and adjust function to handle
          edge cases
	- json2sqlite: wrap latitude and longitude in quotes tp fix
          error with null values
bf0e1887 — Rebecca Medrano 7 months ago
spiders: update wiki_data location scraping

 - get coordinates directly from start_urls; this fixes an issue
   where data was getting mixed up when scraping concurrently
feaa9a36 — Rebecca Medrano 7 months ago
spiders: scrape location data from wikipedia

 - rename wiki_images.py to wiki_data.py and scrape location
   coordinates

 - update json2sqlite.py to include latitude and longitude columns
   in 'cities' table
ba00d52f — Rebecca Medrano 7 months ago
json2sqlite.py: update table columns

 - add contributors to quality_of_life and timeseries tables

 - give timestamp columns unique names to facilitate natural joins

 - spiders: qol.py and climate.py now import datetime.datetime
   instead of entire datetime module
707a32f0 — Rebecca Medrano 7 months ago
spiders: add number of contributors to qol.py

 - add minimum and maximum number of contributors from Numbeo
   Quality of Life page to qol spider
 - update json2sqlite.py to replace '?' with NULL when inserting
   into quality_of_life table
647773df — Andrei Khartchenko 7 months ago
add LICENSE
45778d98 — Andrei Khartchenko 7 months ago
add README.md
8759d2e0 — Andrei Khartchenko 7 months ago
add requirements.txt
4a0cfc41 — Andrei Khartchenko 7 months ago
docs: add submitting_a_patchset.md
299b241c — Rebecca Medrano 7 months ago
spiders: add cost_of_living.py

 - scrapes data from the numbeo cost_of_living page
5961cdeb — Rebecca Medrano 7 months ago
spiders: add pollution.py

 - scrapes data from numbeo pollution page
24dcb50d — Rebecca Medrano 7 months ago
json2sqlite.py: add image_urls table

 - add image_url table to the database to store urls to wikipedia
   images
 - add wiki_url column in cities table

spiders/climate.py: minor correction (class name)
01843ada — Rebecca Medrano 7 months ago
spiders: add wiki_images.py

- this spider scrapes image urls from the corresponding wikipedia urls
  for each city
9153d503 — Rebecca Medrano 7 months ago
spiders: add wiki_urls.py

- scrapes Wikipedia urls for each city in acoli.db
63bbb1e1 — Rebecca Medrano 7 months ago
json2sqlite: add qol timestamps, timeseries table

- add column in quality_of_life table for timestamps from qoli.json
- add time_series_quality_of_life table to keep timeseries data
- ensure quality_of_life table contains most recent records
6636b9f0 — Andrei Khartchenko 7 months ago
add scrape2db.sh

this commit contains scrape2db.sh, an all-in-one script that intends to:
 - scrape
 - create and/or update the database with the scraped data
8658c538 — Rebecca Medrano 7 months ago
json2sqlite.py: include timestamps in quality_of_life and climate tables
8eb59d6c — Rebecca Medrano 7 months ago
spiders: add timestamps to climate.py and qol.py

- add timestamps to climate.py and qol.py output
- the output of climate.py now includes rows with null values for citys
  without data
3b0a509d — Andrei Khartchenko 7 months ago
json2sqlite.py: update description, improve style

this commit does the following:
 - improve the wording of the description
 - add a TODO for argument parsing for finer grained control over the
   json -> sqlite db conversion / update process
 - use double quotes where single quotes may cause issues

additional notes:
 - using single quotes will break sql statements/queries where the name
   of what is being queried within the python f string has a single
   quote as well
 - perhaps there a more robust way to do this, similar to the `?' dummy
   variable syntax
f1ad6e7c — Andrei Khartchenko 7 months ago
json2sqlite.py: fix number of `?' in climate table

corrects the number of `?' dummy variables (was 25, need 37) when
creating/inserting into the climate table
Next