0.5 Introduction
I decided to look around at the excellent data provided by the official map authorities in Norway, Kartverket. This is available in their Central Register of Place Names dataset. We set an initial goal for our project: we’d like to use these place names for a password generator. My motivation was to write down “trivial” commands for beginners as well as a motivation for Bash.
Let’s get started!
1. Prerequisites
Note to Mac users like me: we need the GNU binaries of the commands, not the ones shipped with MacOS.
brew install coreutils findutils gnu-tar gnu-sed gawk gnutls gnu-indent gnu-getopt grep --with-default-names
You can now run the GNU equivalent of a command by adding g in front, e.g., gsed
or gawk
instead of sed
and awk
.
2. Curiosity
Download the big file as gml
format and unzip it. How to unzip using the terminal: zip bigzipfile.zip
.
How big is this file? ls -lh *.gml
(h
for human, l
for list, *
is a wildcard because I’m lazy: the file has a long name, and I’d rather not type it).
The file is about 6GB, from a compressed ~250MB! How many lines is it? wc
does wordcount, but we tell it to do to count lines instead: wc -l
(l
for Lines).
About 125e6 lines, neat.
Here is a small portion of the file, handcrafted for you:
<app:Sted xmlns:app="http://skjema.geonorge.no/SOSI/produktspesifikasjon/Stedsnavn/5.0" gml:id="Sted.1">
...
<app:stedsnavn>
<app:Stedsnavn>
...
<app:språk>nor</app:språk>
<app:stedsnavnnummer>1</app:stedsnavnnummer>
<app:skrivemåte>
<app:Skrivemåte>
<app:langnavn>Stornesodden</app:langnavn>
...
You might be tempted to open it in your favorite editor, but keep the size in mind when doing so. How to inspect just the top or bottom of this file? head filename
and tail filename
.
3. Parsing
This part is split in two: an advanced XML-parser and a simple regex-parser. The XML-parser is used to extract location names where its origin is Norwegian, not Sami or Finnish for instance. The regex parser is more straightforward and doesn’t use this information; it only looks at legal characters.
Alt 1: XML-parser
The regex parser misses information in the XML file about the location name origin. Norwegian place names include Sami places, which has weird letters that are difficult to type for non-Sami people. E.g., Jiří. Therefore, we only select place names (<app:langnavn>
) where the parent <app:språk>
is eng
or nor
. Python and lxml did wonders, although it takes about 10 minutes. The upside is a low memory footprint.
Run pip install lxml
before you save and run the code in the same directory as the *.gml
-file.
from lxml import etree
# https://stackoverflow.com/a/42193997
def iterate_xml(xmlfile):
doc = etree.iterparse(xmlfile, events=('start', 'end'))
_, root = next(doc)
start_tag = None
for event, element in doc:
if event == 'start' and start_tag is None:
start_tag = element.tag
if event == 'end' and element.tag == start_tag:
yield element
start_tag = None
root.clear()
# https://stackoverflow.com/a/42757094
def newline_generator(items):
for item in items:
yield item
yield '\n'
xmlfile = 'Basisdata_0000_Norge_25833_Stedsnavn_GML.gml'
outfile = 'filtered_names.txt'
namespaces = {
'app':'http://skjema.geonorge.no/SOSI/produktspesifikasjon/Stedsnavn/5.0',
'wfs':'http://www.opengis.net/wfs/2.0'
}
xpath_query = "*//app:stedsnavn[app:Stedsnavn/app:språk/text()=('nor')]" \
"/app:Stedsnavn/app:skrivemåte/app:Skrivemåte/app:langnavn/text()"
with open(outfile, 'w+') as output:
for x in iterate_xml(xmlfile):
matches = x.xpath(xpath_query, namespaces=namespaces)
output.writelines(newline_generator(matches))
Alt 2: Regex
We need to extract the place names from our xml
formatted dataset. regex is a formal rule language for selection or replacement in strings. This is a general concept which is implemented in many commands, e.g., sed
for replacement and grep / awk
for extraction.
Run
grep -Po "<app:langnavn>\K\w+(?=</app:langnavn>)" \
Basisdata_0000_Norge_25833_Stedsnavn_GML.gml > place_names.txt
Roughly, \K
means “look behind” for <app:langnavn>
, and ?=
is “look ahead” for </app:langnavn>
. \w
is any letter from the alphabet, +
means 1 or more. I.e., Look behind for the start tag, look ahead for the end tag: Return whatever letters that are between.
I usually use this service to construct my regex.
Notice the >
sign, which means “shove the output from left, into the file at the right. This is called redirection.
4. Sanitation
We now have a place_names.txt
! Check the file size and line count again. This file is safe to open in any editor.
The data must be cleaned before it can be used:
- Remove spaces with illegal characters
- Remove duplicates
- Convert all characters to lowercase
Illegal characters: sed
+ regex
We can remove names with illegal characters. Remove all:
sed -i '/[^A-Za-z]/d' place_names.txt
We can also replace illegal letters with their equivalent ASCII
alternative. E.g., Å→A, Ø→O, Æ→AE.
iconv
can do this, but not all characters are mappable, e.g., ǯ. Therefore, I choose to remove anything that isn’t A-Z or ÆØÅ.
sed -i '/[^A-Za-zæøåÆØÅ]/d' place_names.txt
The ^
means not, making the regex match anything that is weird. We use both A-Z and a-z because of capitalization.
Finally, map the letters:
iconv -f UTF8 -t US-ASCII//TRANSLIT place_names.txt > legal_names.txt
Lowercase
We have to make all letters lower case. tr
can do this, but it doesn’t like utf-8, so we use awk instead.
awk '{print tolower($0)}' legal_names.txt > lowered_places.txt
Duplicates
How do we remove duplicate strings from a file? You’d sort the places by name and then loop over all the names. Remove an element if it is the same as the previous element. Both these steps are done with
sort -u lowered_places.txt > unique_names.txt
Once again, check the file size and line count.
5. Generator
XKCD-password-generator is useful for us. We can feed this library a list of words, some rules, and it will generate some passwords for us. We’ve done most of the job by now!
Run pip install xkcdpass
and run this code:
from xkcdpass import xkcd_password as xp
import os
base = os.path.dirname(os.path.abspath(__file__))
wordfile = os.path.join(base, 'unique_names.txt')
place_names = xp.generate_wordlist(
wordfile=wordfile,
min_length=3,
max_length=10)
def generate_password(word_class):
password = xp.generate_xkcdpassword(word_class, numwords=3, case='random')
return password.replace(' ', '-')
if __name__ == '__main__':
print(generate_password(place_names))
Here is a small Flask service running as a demo.