Map Area
Las Vegas, Nevada, USA
拉斯维加斯是美国的赌城,娱乐之都,早就听说过,也非常想去,所以做项目时就想为此地做一下。
一、地图遇到的问题
发现有以下问题:
- 街道名称缩写、不正确;
- 不正确的邮政编码;
导入必要的库,生成原文件10%样本的文件sample.osm;
1 2 3 4 5 6 7 8 9 10 11
| import xml.etree.cElementTree as ET import pprint import re from collections import defaultdict import csv import cerberus import schema import codecs import string import schema
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| OSM_FILE = "las-vegas_nevada.osm" SAMPLE_FILE = "sample.osm" k = 10 def get_element(osm_file, tags=('node', 'way', 'relation')): """Yield element if it is the right type of tag Reference: http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python """ context = iter(ET.iterparse(osm_file, events=('start', 'end'))) _, root = next(context) for event, elem in context: if event == 'end' and elem.tag in tags: yield elem root.clear() with open(SAMPLE_FILE, 'wb') as output: output.write('<?xml version="1.0" encoding="UTF-8"?>\n') output.write('<osm>\n ') for i, element in enumerate(get_element(OSM_FILE)): if i % k == 0: output.write(ET.tostring(element, encoding='utf-8')) output.write('</osm>')
|
看数据中有多少标签
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| file_name = 'las-vegas_nevada.osm' def count_tags(filename): tags = {} for _,elem in ET.iterparse(filename): if elem.tag in tags: tags[elem.tag] +=1 else: tags[elem.tag] =1 return tags def test(): tags = count_tags('las-vegas_nevada.osm') pprint.pprint(tags) if __name__ == "__main__": test()
|
{'bounds': 1,
'member': 4333,
'nd': 1241838,
'node': 1048101,
'osm': 1,
'relation': 556,
'tag': 489437,
'way': 112116}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| lower = re.compile(r'^([a-z]|_)*$') lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$') problemchars = re.compile(r'[=\+/&<>;\'" \?%#$@\,\.\t\r\n]') def key_type(element, keys): if element.tag == "tag": for tag in element.iter("tag"): if lower.search(tag.attrib['k']): keys['lower'] +=1 elif lower_colon.search(tag.attrib['k']): keys['lower_colon'] += 1 elif problemchars.search(tag.attrib['k']): keys["problemchars"] += 1 else : keys["other"] += 1 return keys def process_map(filename): keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0} for _, element in ET.iterparse(filename): keys = key_type(element, keys) return keys keys = process_map(file_name) pprint.pprint(keys)
|
{'lower': 316935, 'lower_colon': 165174, 'other': 7327, 'problemchars': 1}
我们编写了两个函数key_type()和process_map(),这两个函数使用正则表达式匹配来分离我们的键值。
我们将我们的数据集分为4组:lower,lower_colon,problemchars,other。
1、街道名称缩写、不正确
一些街道名称被缩写,例如“Marco St”,更新为全拼;
一些街道名称不正确,例如“S Maryland Parkway Suite A-5-262”,进行了更正;
首先打印所有的街道名称,查看全部类型的街道名称
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE) expected = ["Avenue", "Commons", "Court", "Drive", "Lane", "Parkway", "Place", "Road", "Square", "Street", "Trail"] def audit_street_type(street_types, street_name): m = street_type_re.search(street_name) if m: street_type = m.group() if street_type not in expected: street_types[street_type].add(street_name) def is_street_name(elem): return (elem.attrib['k'] == "addr:street") def audit(osmfile): osm_file = open(osmfile, "r") street_types = defaultdict(set) for event, elem in ET.iterparse(osm_file, events=("start",)): if elem.tag == "node" or elem.tag == "way": for tag in elem.iter("tag"): if is_street_name(tag): audit_street_type(street_types, tag.attrib['v']) return street_types sort_street_types = audit(file_name) pprint.pprint(dict(sort_street_types))
|
{'1': set(['Spanish Ridge Ave., Suite 1', 'West Gary Avenue #1']),
'10': set(['S. Valley View Blvd. Ste 10']),
'100': set(['S Eastern Ave #100']),
'103': set(['South Pecos Road Suite 103']),
'106': set(['S Grand Canyon Dr #106']),
'107': set(['S Decatur Blvd #107']),
'11': set(['Chandler Ave #11']),
'110': set(['South Jones Boulevard Suite 110']),
'111-559': set(['8465 W Sahara Avenue Suite 111-559']),
'115': set(['North Hualapai Way #115']),
'1170': set(['Fremont Street Ste. 1170']),
'120': set(['E SILVERADO RANCH BLVD Suite 120', 'Silverado Ranch Blvd #120']),
'15': set(['E Sahara Blvd #15']),
'150': set(['West Horizon Ridge Parkway, STE 150']),
'16': set(['Polaris Ave #16']),
'170': set(['Sky Pointe Dr #170']),
'207': set(['E Sahara Ave #207']),
'275': set(['Camino Al Norte #275']),
'2D': set(['Spring Mountain Rd #2D']),
'3230': set(['W Sahara Ave #3230']),
'500': set(['Howard Hughes Parkway, Suite 500']),
'705': set(['West Ali Baba Lane #705']),
'790': set(['Howard Hughes Pkwy #790']),
'93': set(['Highway 93']),
'A': set(['E Tropicana Ave #A', 'West Post Road A']),
'A-5-262': set(['S Maryland Parkway Suite A-5-262']),
'AVE': set(['W PACIFIC AVE']),
'Alicante': set(['Via Alicante']),
'Apache': set(['S Fort Apache', 'S. Fort Apache']),
'Ave': set(['4760 S Polaris Ave',
'E Cheyenne Ave',
'E Sahara Ave',
'East Tropicana Ave',
'S Eastern Ave',
'S. Eastern Ave',
'W Cactus Ave',
'W Sahara Ave',
'W Twain Ave',
'W Washington Ave',
'W. Sahara Ave']),
'Ave.': set(['200 Hoover Ave.',
'6601 W. Twain Ave.',
'East Twain Ave.',
'Glendale Ave.',
'Hoover Ave.',
'W. Arby Ave.',
'West Sahara Ave.']),
'B': set(['Avenue B', 'West Sahara Avenue #B']),
'B12': set(['W Sahara Ave #B12']),
'Blvd': set(['Adams Blvd',
'Buchanan Blvd',
'Carmen Blvd',
'S Casino Center Blvd',
'S Rainbow Blvd',
'S Valley View Blvd',
'Thomas Ryan Blvd',
'W Charleston Blvd']),
'Blvd.': set(['E. Flamingo Blvd.',
'N. Las Vegas Blvd.',
'S Rainbow Blvd.',
'West Lake Mead Blvd.']),
'Bonneville': set(['Bonneville']),
'Boulevard': set(['Adams Boulevard',
'Centennial Center Boulevard',
'Decatur Boulevard',
'Del Webb Boulevard',
'E. Charleston Boulevard',
'East Charleston Boulevard',
'East Lake Mead Boulevard',
'East Silverado Ranch Boulevard',
'Lamb Boulevard',
'Las Vegas Boulevard',
'Montelago Boulevard',
'N Casino Center Boulevard',
'North Arroyo Grande Boulevard',
'North Decatur Boulevard',
'North Jones Boulevard',
'North Lamb Boulevard',
'North Las Vegas Boulevard',
'North Nellis Boulevard',
'North Rainbow Boulevard',
'North Rampart Boulevard',
'Nu Wav Kaiv Boulevard',
'Rainbow Boulevard',
'S Las Vegas Boulevard',
'South Casino Center Boulevard',
'South Decatur Boulevard',
'South Hollywood Boulevard',
'South Jones Boulevard',
'South Las Vegas Boulevard',
'South Martin L King Boulevard',
'South Moapa Valley Boulevard',
'South Nellis Boulevard',
'South Rainbow Boulevard',
'South Rampart Boulevard',
'South Valley View Boulevard',
'Sun City Boulevard',
'Valley View Boulevard',
'Wayne Newton Boulevard',
'West Charleston Boulevard',
'West Lake Mead Boulevard',
'West Oakey Boulevard']),
'Buckskin': set(['W Buckskin']),
'Canto': set(['Via Bel Canto']),
'Charleston': set(['W Charleston']),
'Cir': set(['Sego Glen Cir']),
'Circle': set(['Citadel Circle',
'Inner Circle',
'Mall Ring Circle',
'Village Center Circle']),
'Dr': set(['Club House Dr',
'Corporate Park Dr',
'S Highland Dr',
'Village Walk Dr']),
'East': set(['Town Square East']),
'Experience': set(['Fremont Street Experience']),
'G': set(['Avenue G']),
'Highway': set(['Boulder Highway',
'Nevada Highway',
'North Boulder Highway',
'South Boulder Highway',
'Valley of Fire Highway']),
'Lavender': set(['W Lavender']),
'Ln': set(['Linda Ln']),
'Ln.': set(['Norman Rockwell Ln.']),
'M2': set(['W Craig Rd #M2']),
'Mt.': set(['Spring Mt.']),
'North': set(['Las Vegas Boulevard North']),
'Pkwy': set(['3547 S Maryland Pkwy', 'Grand Central Pkwy']),
'Prado': set(['Paseo del Prado']),
'Rd': set(['2560 E Sunset Rd',
'El Camino Rd',
'Hillpointe Rd',
'Losee Rd',
'Paradise Rd',
'S Fort Apache Rd',
'S Pecos Rd',
'W Craig Rd',
'W Warm Springs Rd']),
'Rd.': set(['E Sunset Rd.']),
'Rd5': set(['Airport Rd5']),
'S': set(['Las Vegas Blvd S',
'Las Vegas Boulevard S',
'Silverado Ranch Boulevard;Las Vegas Boulevard S']),
'S.': set(['Las Vegas Boulevard S.']),
'Sahara': set(['W Sahara']),
'South': set(['Las Vegas Blvd South',
'Las Vegas Boulevard South',
'Market Place South']),
'St': set(['Fremont St', 'Marco St', 'S 3rd St', 'S Main St']),
'St.': set(['5070 Arville St.']),
'Vegas': set(['Wynn Las Vegas']),
'Way': set(['6823 W. Ponderosa Way',
'Brandywine Way',
'Conestoga Way',
'Crystal Water Way',
'Donovan Way',
'Nevada Way',
'North Tenaya Way',
'SLS Way',
'Westminster Way',
'Wiesner Way']),
'apache': set(['South Fort apache']),
'ave': set(['W Sahara ave']),
'blvd': set(['N Rainbow blvd',
'S Las Vegas blvd',
'W Charleston blvd',
'W Lake Mead blvd']),
'blvd.': set(['S Las Vegas blvd.']),
'ln': set(['Rockwell ln'])}
根据以上结果,修复街道名称缩写的问题
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
| mapping = { "Ct": "Court", "St": "Street", "st": "Street", "St.": "Street", "St,": "Street", "ST": "Street", "street": "Street", "Street.": "Street", "Ave": "Avenue", "Ave.": "Avenue", "ave": "Avenue", "AVE":"Avenue", "Rd.": "Road", "rd.": "Road", "Rd": "Road", "Hwy": "Highway", "HIghway": "Highway", "Pkwy": "Parkway", "Pl": "Place", "place": "Place", "Sedgwick": "Sedgwick Street", "Sq.": "Square", "Newbury": "Newbury Street", "Cir":"Circle", "S Maryland Parkway Suite A-5-262":"S Maryland Parkway", "Spanish Ridge Ave., Suite 1":"Spanish Ridge Avenue", "S Eastern Ave #100":"S Eastern Avenue", "South Pecos Road Suite 103":"South Pecos Road", "Chandler Ave #11":"Chandler Ave", } def update_name(name, mapping): for key in mapping: if key in name: name = string.replace(name,key,mapping[key]) return name for street_type, ways in sort_street_types.iteritems(): for name in ways: update_name(name, mapping)
|
2、不正确的邮政编码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| def audit_zipcode(invalid_zipcodes, zipcode): threeDigits = zipcode[0:3] if len(zipcode) != 6: invalid_zipcodes[threeDigits].add(zipcode) elif not threeDigits.isdigit(): invalid_zipcodes[threeDigits].add(zipcode) elif threeDigits != '891': invalid_zipcodes[threeDigits].add(zipcode) def is_zipcode(elem): return (elem.attrib['k'] == "addr:postcode") def audit_zip(osmfile): osm_file = open(osmfile, "r") invalid_zipcodes = defaultdict(set) for event, elem in ET.iterparse(osm_file, events=("start",)): if elem.tag == "node" or elem.tag == "way": for tag in elem.iter("tag"): if is_zipcode(tag): audit_zipcode(invalid_zipcodes,tag.attrib['v']) return invalid_zipcodes sort_zipcode = audit_zip(file_name) pprint.pprint(dict(sort_zipcode))
|
{'645': set(['6451112']),
'890': set(['89002',
'89005',
'89011',
'89012',
'89014',
'89014-2132',
'89015',
'89019',
'89025',
'89030',
'89031',
'89040',
'89044',
'89052',
'89070',
'89074',
'89081',
'89086']),
'891': set(['89101',
'89102',
'89102-4370',
'89103',
'89104',
'89104-1307',
'89105',
'89106',
'89107',
'89108',
'89108-7049',
'89109',
'89109-1907',
'89110',
'89113',
'89115',
'89117',
'89118',
'89119',
'89119-1001',
'89119-6304',
'89120',
'89121',
'89122',
'89123',
'89128',
'89128-6634',
'89129',
'89129-7260',
'89130',
'89131',
'89134',
'89135',
'89135-1020',
'89135-1038',
'89139',
'89142',
'89144',
'89145',
'89146',
'89146-2977',
'89147',
'89147-4111',
'89147-8491',
'89148',
'89149',
'89154',
'89156',
'89161',
'89166',
'89169',
'89178',
'89179',
'89183',
'89191']),
'892': set(['8929']),
'NV ': set(['NV 89014',
'NV 89030',
'NV 89031',
'NV 89052',
'NV 89101',
'NV 89107',
'NV 89109',
'NV 89117',
'NV 89119',
'NV 89123',
'NV 89124',
'NV 89129',
'NV 89134',
'NV 89135',
'NV 89142',
'NV 89191']),
'Nev': set(['Nevada 89113'])}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| def update_zip(zipcode): if zipcode[0:3] == 'NV ': zipcode = zipcode[3:] return zipcode elif zipcode[0:6] == 'Nevada': zipcode = zipcode[7:] return zipcode elif zipcode[0:3] != '891': zipcode = 'None' return zipcode elif zipcode[0:3] == '891' and len(zipcode) >6: zipcode = zipcode[0:5] return zipcode for street_type, ways in sort_zipcode.iteritems(): for name in ways: if update_zip(name): update_zip(name)
|
接下来,我们将清理文件并分隔数据;
tags_clean 函数用来设置id,key,value和type的值,在把街道名称和邮编的更改添加到字典时使用;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
| OSM_PATH = file_name NODES_PATH = "nodes.csv" NODE_TAGS_PATH = "nodes_tags.csv" WAYS_PATH = "ways.csv" WAY_NODES_PATH = "ways_nodes.csv" WAY_TAGS_PATH = "ways_tags.csv" NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp'] NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type'] WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp'] WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type'] WAY_NODES_FIELDS = ['id', 'node_id', 'position'] SCHEMA = schema.Schema LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+') PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]') def tags_clean(id , tag ): node_tagss = {} node_tagss['id'] = int(id) if tag.attrib['k'] == "addr:street": node_tagss['value'] = update_name(tag.attrib['v'], mapping) elif tag.attrib['k'] == "addr:postcode": node_tagss['value'] = update_zip(tag.attrib['v']) else: node_tagss['value'] = tag.attrib['v'] if ":" not in tag.attrib['k']: node_tagss['key'] = tag.attrib['k'] node_tagss['type'] = 'regular' else: pcolon = tag.attrib['k'].index(":") + 1 node_tagss['key'] = tag.attrib['k'][pcolon:] node_tagss['type'] = tag.attrib['k'][:pcolon - 1] return node_tagss def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS, problem_chars=PROBLEMCHARS, default_tag_type='regular'): """Clean and shape node or way XML element to Python dict""" node_attribs = {} way_attribs = {} way_nodes = [] tags = [] node_tagss = {} if element.tag == 'node': for node in NODE_FIELDS: node_attribs[node] = element.attrib[node] node_attribs['id']= int(node_attribs['id']) node_attribs['uid']= int(node_attribs['uid']) node_attribs['changeset']= int(node_attribs['changeset']) node_attribs['lon']= float(node_attribs['lon']) node_attribs['lat']= float(node_attribs['lat']) for tag in element.iter("tag"): tag_clean ={} if PROBLEMCHARS.search(tag.attrib['k']) == None: node_tagss = tags_clean(node_attribs['id'] , tag ) tags.append(node_tagss) if node_attribs: return {'node': node_attribs, 'node_tags': tags} else: return None elif element.tag == 'way': for way in WAY_FIELDS: way_attribs[way] = element.attrib[way] way_attribs['id']= int(way_attribs['id']) way_attribs['uid']= int(way_attribs['uid']) way_attribs['changeset']= int(way_attribs['changeset']) for tag in element.iter("tag"): tag_clean ={} if PROBLEMCHARS.search(tag.attrib['k']) == None: node_tagss = tags_clean(way_attribs['id'] , tag ) tags.append(node_tagss) count =0 for nodes in element.iter("nd"): wnd = {} wnd['id'] = int(way_attribs['id']) wnd['node_id'] = int(nodes.attrib['ref']) wnd['position'] = count count += 1 way_nodes.append(wnd) if way_attribs: return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags} else: return None def get_element(osm_file, tags=('node', 'way', 'relation')): """Yield element if it is the right type of tag""" context = ET.iterparse(osm_file, events=('start', 'end')) _, root = next(context) for event, elem in context: if event == 'end' and elem.tag in tags: yield elem root.clear() def validate_element(element, validator, schema=SCHEMA): """Raise ValidationError if element does not match schema""" if validator.validate(element, schema) is not True: field, errors = next(validator.errors.iteritems()) message_string = "\nElement of type '{0}' has the following errors:\n{1}" error_strings = ( "{0}: {1}".format(k, v if isinstance(v, str) else ", ".join(v)) for k, v in errors.iteritems() ) raise cerberus.ValidationError( message_string.format(field, "\n".join(error_strings)) ) class UnicodeDictWriter(csv.DictWriter, object): """Extend csv.DictWriter to handle Unicode input""" def writerow(self, row): super(UnicodeDictWriter, self).writerow({ k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems() }) def writerows(self, rows): for row in rows: self.writerow(row) def process_map(file_in, validate): """Iteratively process each XML element and write to csv(s)""" with codecs.open(NODES_PATH, 'w') as nodes_file, \ codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \ codecs.open(WAYS_PATH, 'w') as ways_file, \ codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \ codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file: nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS) node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS) ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS) way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS) way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS) nodes_writer.writeheader() node_tags_writer.writeheader() ways_writer.writeheader() way_nodes_writer.writeheader() way_tags_writer.writeheader() validator = cerberus.Validator() for element in get_element(file_in, tags=('node', 'way')): el = shape_element(element) if el: if validate is True: validate_element(el, validator) if element.tag == 'node': nodes_writer.writerow(el['node']) node_tags_writer.writerows(el['node_tags']) elif element.tag == 'way': ways_writer.writerow(el['way']) way_nodes_writer.writerows(el['way_nodes']) way_tags_writer.writerows(el['way_tags']) if __name__ == '__main__': process_map(OSM_PATH, validate=False)
|
创建las-vegas_nevada.db数据库,将各csv文件导入数据库中
二、探索数据
以下部分是对拉斯维加斯数据集的探索
文件大小
- las-vegas_nevada.osm : 215 MB
- las-vegas_nevada.db :120MB
- nodes.csv : 84.2 MB
- nodes_tags.csv : 2.28 MB
- ways.csv : 6.46 MB
- ways_nodes.csv : 29.7 MB
- ways_tags.csv : 14.3 MB
node节点数量
1 2 3
| sqlite> SELECT count(*) ...> FROM nodes; 1048101
|
道路数量
1 2 3
| sqlite> SELECT count(*) ...> FROM ways; 112116
|
用户(去重)数量
1 2 3 4 5 6 7
| sqlite> SELECT count(distinct(user.uid)) ...> FROM (SELECT uid ...> FROM nodes ...> UNION all ...> SELECT uid ...> FROM ways ) user; 1024
|
前10名贡献用户
1 2 3 4 5 6 7 8 9 10
| sqlite> SELECT user.user ,count(*) ...> FROM ( SELECT user ...> FROM nodes ...> UNION all ...> SELECT user ...> FROM ways) user ...> GROUP BY user.user ...> ORDER BY count(*) ...> DESC ...> LIMIT 10 ;
|
1 2 3 4 5 6 7 8 9 10
| alimamo|251497 tomthepom|121172 woodpeck_fixbot|70826 alecdhuse|66528 abellao|55639 gMitchellD|44635 robgeb|41070 nmixter|40093 TheDutchMan13|39304 Tom_Holland|33379
|
只提交过一次的用户数量
1 2 3 4 5 6 7 8 9 10 11
| sqlite> SELECT count(*) ...> FROM ...> (SELECT e.user, count(*) ...> FROM ( SELECT user ...> FROM nodes ...> UNION all ...> SELECT user ...> FROM ways ) e ...> GROUP BY e.user ...> HAVING count(*)=1) ; 244
|
三、其他数据探索
十大便利设施
1 2 3 4 5 6
| sqlite> SELECT value, count(*) ...> FROM nodes_tags ...> WHERE key ='amenity' ...> GROUP BY value ...> ORDER BY count(*) DESC ...> LIMIT 10 ;
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| restaurant|460 place_of_worship|294 fuel|282 fast_food|267 fountain|266 school|208 shelter|122 toilets|85 cafe|78 bar|72 翻译成汉语: 餐馆| 460 礼拜场所| 294 燃料| 282 快餐| 267 喷泉|266 学校|208 住所|122 厕所|85 网吧|78 酒吧|72
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| 果然在哪都是‘民以食为天’,餐馆第一,快餐第四; 好奇心,究竟赌场等场所有多少? 把上面的查询去掉限制前10,发现有: casino|10 nightclub|10 theatre|10 cinema|8 stripclub|3 即: 赌场|10 牙医|10 夜总会|10 影院|10 电影|8 脱衣舞俱乐部|3
|
十大美食
1 2 3 4 5 6 7 8 9 10
| sqlite> SELECT nodes_tags.value, count(*) ...> FROM nodes_tags ...> JOIN (SELECT DISTINCT(id) ...> FROM nodes_tags ...> WHERE value= 'restaurant') i ...> ON nodes_tags.id = i.id ...> WHERE nodes_tags.key = 'cuisine' ...> GROUP BY nodes_tags.value ...> ORDER BY count(*) DESC ...> LIMIT 10;
|
1 2 3 4 5 6 7 8 9 10
| mexican|41 pizza|31 american|20 italian|16 steak_house|16 burger|14 chinese|12 japanese|9 asian|8 buffet|7
|
1
| 墨西哥餐厅、披萨、美国餐厅占前三,此处墨西哥餐厅偏多;
|
四、关于数据集的其他想法
改进或分析数据的额外建议,实施改进的益处及一些预期的问题。
1.
- 建议:现在的地图数据是不完整的,看到很多用户只提交了一次,建议应该想办法增加用户的活跃度,使用用户激励手段,比如积分,贡献排行榜,游戏化任务,勋章,高贡献用户特权等;
- 好处:增加数据量,丰富地图数据;
- 问题:过程较为复杂,需要人员、资金充足;
2.
3.
- 建议:贡献量较少的用户较容易提供不规范的信息,我们可以先从贡献度少的用户开始清理;
- 好处:会提高数据的一致性和准确性;
- 问题:降低新用户的积极性;
4.
- 建议:数据的格式不统一,可以规范一下数据输入的格式;
- 好处:可以提高数据的规范性;
- 坏处:规则不够灵活,难以适应不同地区的情况;
5.
- 建议:由于本数据并不全面,我们可以通过外部的API,比如高德地图、谷歌地图、PokemonGo等,完善和校准本地图中的数据;
- 好处:可以填充信息,校准数据;
- 问题:数据可能重复度大,或者数据格式不同导入困难等
而且涉及商业合作,互为竞争对手,合作难度估计较大;