Skip to content

Commit f087b33

Browse files
authored
Merge pull request #30 from midwire/develop
Merge `develop`, fixing state abbreviation uniqueness
2 parents 70ba22f + 1309164 commit f087b33

15 files changed

Lines changed: 261 additions & 45 deletions

README.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ This project is an automated solution for retrieving and collating US and worldw
66

77
## History
88

9-
In 2011, we originally pulled down all the US census data we could find, parsed it and exported it into 3 .csv files. Later, we wrote 3 rake tasks to automate this process.
9+
In 2011, we originally pulled down all the US census data we could find, parsed it and exported it into 3 .csv files.
1010

1111
In 2017 we began using [GeoNames](http://www.geonames.org) data, which is licensed under Creative Commons. We are grateful to [GeoNames](http://www.geonames.org) for sharing, and urge you to [visit their site](http://www.geonames.org) and support their work.
1212

13-
In 2018 we refactored the project and made it into a Ruby gem with a command-line executable for automating this process.
13+
In 2018 we refactored the project and made it into a Ruby gem with a unified command-line executable (`free_zipcode_data`) that handles downloading, processing, and database generation in a single step.
1414

1515
## What's Included
1616

@@ -22,7 +22,7 @@ See the GeoNames [readme.txt](http://download.geonames.org/export/zip/readme.txt
2222

2323
## Usage
2424

25-
First, you need to install Ruby and Rubygems. Though that is not a difficult task, it is beyond the scope of this README. A search engine of your choice will help discover how to do this. Once you have done that:
25+
First, you need to install Ruby 3.4+ and Rubygems. Though that is not a difficult task, it is beyond the scope of this README. A search engine of your choice will help discover how to do this. Once you have done that:
2626

2727
```bash
2828
$ gem install free_zipcode_data
@@ -61,8 +61,6 @@ $ free_zipcode_data --work-dir /tmp/work_dir --country US --generate-files
6161
$ free_zipcode_data --work-dir /tmp/work_dir --generate-files
6262
```
6363
64-
The rake tasks cascade, from the bottom up. So if you run `rake data:populate_db`, it will automatically call `rake data:build` if the .csv files are missing, which will call `rake data:download` if the .zip files are missing.
65-
6664
## SQLite3 Database
6765
6866
The executable will generate an SQLite3 database in the specified directory `--work-dir` but it will not generate the `.csv` files by default. Specify `--generate-files` if you want those as well.

lib/free_zipcode_data/country_table.rb

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ def build
2525

2626
def write(row)
2727
country_hash = country_lookup_table[row[:country]]
28-
return update_progress unless country_hash
28+
unless country_hash
29+
warn_once("Skipping unknown country '#{row[:country]}': not in country_lookup_table")
30+
return update_progress
31+
end
2932

3033
sql = <<-SQL
3134
INSERT INTO countries (alpha2, alpha3, iso, name)
@@ -37,8 +40,10 @@ def write(row)
3740

3841
begin
3942
database.execute(sql)
40-
rescue SQLite3::ConstraintException
41-
# Swallow duplicates
43+
rescue SQLite3::ConstraintException => e
44+
unless e.message.include?('UNIQUE')
45+
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
46+
end
4247
rescue StandardError => e
4348
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
4449
end

lib/free_zipcode_data/county_table.rb

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,14 @@ def build
2626
def write(row)
2727
return nil unless row[:county]
2828

29-
state_id = get_state_id(row[:short_state], row[:state])
30-
return nil unless state_id
29+
state_id = get_state_id(row[:country], row[:short_state], row[:state])
30+
unless state_id
31+
logger.verbose(
32+
"Skipping county '#{row[:county]}': no state found for " \
33+
"abbr='#{row[:short_state]}', country='#{row[:country]}'"
34+
)
35+
return nil
36+
end
3137

3238
sql = <<-SQL
3339
INSERT INTO counties (state_id, abbr, name)
@@ -39,8 +45,10 @@ def write(row)
3945

4046
begin
4147
database.execute(sql)
42-
rescue SQLite3::ConstraintException
43-
# swallow duplicates
48+
rescue SQLite3::ConstraintException => e
49+
unless e.message.include?('UNIQUE')
50+
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
51+
end
4452
rescue StandardError => e
4553
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
4654
end

lib/free_zipcode_data/db_table.rb

Lines changed: 48 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,18 @@ def update_progress
2424

2525
private
2626

27+
def logger
28+
Logger.instance
29+
end
30+
31+
def warn_once(message)
32+
@warned_messages ||= {}
33+
return if @warned_messages[message]
34+
35+
logger.warn(message)
36+
@warned_messages[message] = true
37+
end
38+
2739
def country_lookup_table
2840
@country_lookup_table ||=
2941
begin
@@ -44,9 +56,42 @@ def get_country_id(country)
4456
select_first(sql)
4557
end
4658

47-
def get_state_id(state_abbr, state_name)
48-
sql = "SELECT id FROM states
49-
WHERE abbr = '#{state_abbr}' OR name = '#{escape_single_quotes(state_name)}'"
59+
# Look up a state ID scoped to a country, trying progressively less specific
60+
# criteria: (1) abbr + name + country, (2) abbr + country, (3) name + country.
61+
# Returns nil if no match is found.
62+
def get_state_id(country, state_abbr, state_name)
63+
escaped_country = escape_single_quotes(country)
64+
return nil if escaped_country.empty?
65+
66+
escaped_abbr = escape_single_quotes(state_abbr)
67+
escaped_name = escape_single_quotes(state_name)
68+
country_cond = "c.alpha2 = '#{escaped_country}'"
69+
# Most specific lookup: abbr + name + country
70+
res = find_state_where("s.abbr = '#{escaped_abbr}'", "s.name = '#{escaped_name}'", country_cond)
71+
return res if res
72+
73+
# Fallback: abbr + country only
74+
res = find_state_where("s.abbr = '#{escaped_abbr}'", country_cond)
75+
if res
76+
logger.verbose("State fallback: abbr '#{state_abbr}' + country '#{country}' (name mismatch)")
77+
return res
78+
end
79+
# Fallback: name + country only
80+
res = find_state_where("s.name = '#{escaped_name}'", country_cond)
81+
if res
82+
logger.verbose("State fallback: name '#{state_name}' + country '#{country}' (abbr mismatch)")
83+
return res
84+
end
85+
logger.warn("State lookup failed: abbr='#{state_abbr}', name='#{state_name}', country='#{country}'")
86+
nil
87+
end
88+
89+
def find_state_where(*conditions)
90+
sql = <<-SQL
91+
SELECT s.id FROM states s
92+
INNER JOIN countries c ON s.country_id = c.id
93+
WHERE #{conditions.join(' AND ')}
94+
SQL
5095
select_first(sql)
5196
end
5297

lib/free_zipcode_data/state_table.rb

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,22 +17,27 @@ def build
1717

1818
ndx = <<-SQL
1919
CREATE UNIQUE INDEX "main"."unique_state"
20-
ON #{tablename} (abbr, country_id COLLATE NOCASE ASC);
20+
ON #{tablename} (abbr COLLATE NOCASE ASC, country_id);
2121
SQL
2222
database.execute_batch(ndx)
2323

2424
ndx = <<-SQL
2525
CREATE UNIQUE INDEX "main"."state_name"
26-
ON #{tablename} (name COLLATE NOCASE ASC);
26+
ON #{tablename} (name COLLATE NOCASE ASC, country_id);
2727
SQL
2828
database.execute_batch(ndx)
2929
end
3030

3131
def write(row)
32-
return nil unless row[:short_state]
32+
return nil unless synthesize_state(row)
3333

3434
row[:state] = 'Marshall Islands' if row[:short_state] == 'MH' && row[:state].nil?
3535
country_id = get_country_id(row[:country])
36+
unless country_id
37+
warn_once("Country '#{row[:country]}' not found in countries table, skipping state")
38+
return nil
39+
end
40+
3641
sql = <<-SQL
3742
INSERT INTO states (abbr, name, country_id)
3843
VALUES ('#{row[:short_state]}',
@@ -42,13 +47,37 @@ def write(row)
4247
SQL
4348
begin
4449
database.execute(sql)
45-
rescue SQLite3::ConstraintException
46-
# Swallow duplicates
50+
rescue SQLite3::ConstraintException => e
51+
unless e.message.include?('UNIQUE')
52+
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
53+
end
4754
rescue StandardError => e
4855
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
4956
end
5057

5158
update_progress
5259
end
60+
61+
private
62+
63+
# Synthesize state from country for stateless countries.
64+
# Mutates the row hash so downstream Kiba destinations (CountyTable, ZipcodeTable)
65+
# see the synthesized short_state and state values.
66+
def synthesize_state(row)
67+
if row[:short_state].nil? || row[:short_state] == ''
68+
country_entry = country_lookup_table[row[:country]]
69+
unless country_entry
70+
warn_once(
71+
"Cannot synthesize state for country '#{row[:country]}': " \
72+
'not in country_lookup_table'
73+
)
74+
return false
75+
end
76+
77+
row[:short_state] = row[:country]
78+
row[:state] = country_entry[:name]
79+
end
80+
row[:short_state]
81+
end
5382
end
5483
end

lib/free_zipcode_data/zipcode_table.rb

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,15 @@ def build
2929
def write(row)
3030
return nil unless row[:postal_code]
3131

32-
state_id = get_state_id(row[:short_state], row[:state])
32+
state_id = get_state_id(row[:country], row[:short_state], row[:state])
33+
unless state_id
34+
logger.verbose(
35+
"Skipping zipcode '#{row[:postal_code]}': no state found for " \
36+
"abbr='#{row[:short_state]}', country='#{row[:country]}'"
37+
)
38+
return nil
39+
end
40+
3341
city_name = escape_single_quotes(row[:city])
3442

3543
sql = <<-SQL
@@ -45,8 +53,10 @@ def write(row)
4553

4654
begin
4755
database.execute(sql)
48-
rescue SQLite3::ConstraintException => _e
49-
# there are some duplicates - swallow them
56+
rescue SQLite3::ConstraintException => e
57+
unless e.message.include?('UNIQUE')
58+
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
59+
end
5060
rescue StandardError => e
5161
raise "Please file an issue at #{ISSUE_URL}: [#{e}] -> SQL: [#{sql}]"
5262
end

spec/etl/csv_source_spec.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
rows = []
2929
source.each { |row| rows << row }
3030

31-
expect(rows.length).to eq(5)
31+
expect(rows.length).to eq(6)
3232
expect(rows.first).to be_a(Hash)
3333
expect(rows.first.keys).to include(:country, :postal_code, :city)
3434
end

spec/etl/free_zipcode_data_job_spec.rb

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
require 'etl/free_zipcode_data_job'
55

66
RSpec.describe ETL::FreeZipcodeDataJob do
7-
let(:db) { create_test_database(line_count: 5) }
7+
let(:db) { create_test_database(line_count: 6) }
88
let(:fixture_csv) { File.join(FreeZipcodeData.root, 'spec', 'fixtures', 'test_data.csv') }
99
let(:logger) { FreeZipcodeData::Logger.instance }
1010
let(:string_io) { StringIO.new }
@@ -92,5 +92,44 @@
9292
expect(lat).to be_within(0.01).of(40.7484)
9393
expect(lon).to be_within(0.01).of(-73.9967)
9494
end
95+
96+
it 'scopes duplicate state abbreviations by country' do
97+
us_ny = db.execute(<<-SQL)
98+
SELECT s.id, s.name, c.alpha2
99+
FROM states s
100+
JOIN countries c ON s.country_id = c.id
101+
WHERE s.abbr = 'NY' AND c.alpha2 = 'US'
102+
SQL
103+
ca_ny = db.execute(<<-SQL)
104+
SELECT s.id, s.name, c.alpha2
105+
FROM states s
106+
JOIN countries c ON s.country_id = c.id
107+
WHERE s.abbr = 'NY' AND c.alpha2 = 'CA'
108+
SQL
109+
expect(us_ny.length).to eq(1)
110+
expect(ca_ny.length).to eq(1)
111+
expect(us_ny[0][0]).not_to eq(ca_ny[0][0])
112+
expect(us_ny[0][1]).to eq('New York')
113+
expect(ca_ny[0][1]).to eq('Northern York')
114+
end
115+
116+
it 'links cross-country zipcodes to the correct state' do
117+
us_zip = db.execute(<<-SQL)
118+
SELECT z.code, s.name, c.alpha2
119+
FROM zipcodes z
120+
JOIN states s ON CAST(z.state_id AS INTEGER) = s.id
121+
JOIN countries c ON s.country_id = c.id
122+
WHERE z.code = '10001'
123+
SQL
124+
ca_zip = db.execute(<<-SQL)
125+
SELECT z.code, s.name, c.alpha2
126+
FROM zipcodes z
127+
JOIN states s ON CAST(z.state_id AS INTEGER) = s.id
128+
JOIN countries c ON s.country_id = c.id
129+
WHERE z.code = 'K0A'
130+
SQL
131+
expect(us_zip[0]).to eq(['10001', 'New York', 'US'])
132+
expect(ca_zip[0]).to eq(['K0A', 'Northern York', 'CA'])
133+
end
95134
end
96135
end

spec/fixtures/test_data.csv

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ US,10001,New York,New York,NY,New York,061,Manhattan,MN,40.7484,-73.9967,4
33
US,90210,Beverly Hills,California,CA,Los Angeles,037,,LA,34.0901,-118.4065,4
44
US,60601,Chicago,Illinois,IL,Cook,031,,CK,41.8819,-87.6278,4
55
CA,H2X,Montreal,Quebec,QC,,,Montreal,,45.5088,-73.5878,4
6+
CA,K0A,Almonte,Northern York,NY,Lanark,LNK,,,45.2260,-76.1840,4
67
GB,SW1A,London,England,ENG,Westminster,,City of Westminster,,51.5014,-0.1419,1

spec/free_zipcode_data/county_table_spec.rb

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,45 +32,51 @@
3232

3333
describe '#write' do
3434
it 'inserts a county row' do
35-
table.write({ county: 'Cook', short_county: '031', short_state: 'IL', state: 'Illinois' })
35+
table.write({ country: 'US', county: 'Cook', short_county: '031', short_state: 'IL',
36+
state: 'Illinois' })
3637
rows = db.execute('SELECT name, abbr FROM counties')
3738
expect(rows.length).to eq(1)
3839
expect(rows[0]).to eq(%w[Cook 031])
3940
end
4041

4142
it 'links the county to its state' do
42-
table.write({ county: 'Cook', short_county: '031', short_state: 'IL', state: 'Illinois' })
43+
table.write({ country: 'US', county: 'Cook', short_county: '031', short_state: 'IL',
44+
state: 'Illinois' })
4345
state_id = db.execute("SELECT id FROM states WHERE abbr = 'IL'")[0][0]
4446
county_state_id = db.execute('SELECT state_id FROM counties')[0][0]
4547
expect(county_state_id).to eq(state_id)
4648
end
4749

4850
it 'returns nil and skips when county is nil' do
49-
result = table.write({ county: nil, short_county: nil, short_state: 'IL', state: 'Illinois' })
51+
result = table.write({ country: 'US', county: nil, short_county: nil, short_state: 'IL',
52+
state: 'Illinois' })
5053
expect(result).to be_nil
5154
rows = db.execute('SELECT COUNT(*) FROM counties')
5255
expect(rows[0][0]).to eq(0)
5356
end
5457

5558
it 'returns nil when state cannot be found' do
56-
result = table.write({ county: 'Unknown', short_county: '999', short_state: 'ZZ',
59+
result = table.write({ country: 'US', county: 'Unknown', short_county: '999', short_state: 'ZZ',
5760
state: 'Nonexistent' })
5861
expect(result).to be_nil
5962
rows = db.execute('SELECT COUNT(*) FROM counties')
6063
expect(rows[0][0]).to eq(0)
6164
end
6265

6366
it 'silently ignores duplicate county entries' do
64-
table.write({ county: 'Cook', short_county: '031', short_state: 'IL', state: 'Illinois' })
67+
table.write({ country: 'US', county: 'Cook', short_county: '031', short_state: 'IL',
68+
state: 'Illinois' })
6569
expect do
66-
table.write({ county: 'Cook', short_county: '031', short_state: 'IL', state: 'Illinois' })
70+
table.write({ country: 'US', county: 'Cook', short_county: '031', short_state: 'IL',
71+
state: 'Illinois' })
6772
end.not_to raise_error
6873
rows = db.execute('SELECT COUNT(*) FROM counties')
6974
expect(rows[0][0]).to eq(1)
7075
end
7176

7277
it 'handles county names with single quotes' do
73-
table.write({ county: "Prince George's", short_county: '033', short_state: 'NY', state: 'New York' })
78+
table.write({ country: 'US', county: "Prince George's", short_county: '033', short_state: 'NY',
79+
state: 'New York' })
7480
rows = db.execute('SELECT name FROM counties')
7581
expect(rows[0][0]).to eq("Prince George's")
7682
end

0 commit comments

Comments
 (0)