Norwegian place names for generating passwords

0.5 Introduction

I decided to look around at the excellent data provided by the official map authorities in Norway, Kartverket. This is available in their Central Register of Place Names dataset. We set an initial goal for our project: we’d like to use these place names for a password generator. My motivation was to write down “trivial” commands for beginners as well as a motivation for Bash.

Let’s get started!

1. Prerequisites

Note to Mac users like me: we need the GNU binaries of the commands, not the ones shipped with MacOS.

brew install coreutils findutils gnu-tar gnu-sed gawk gnutls gnu-indent gnu-getopt grep --with-default-names

You can now run the GNU equivalent of a command by adding g in front, e.g., gsed or gawk instead of sed and awk.

2. Curiosity

Download the big file as gml format and unzip it. How to unzip using the terminal: zip bigzipfile.zip. How big is this file? ls -lh *.gml (h for human, l for list, * is a wildcard because I’m lazy: the file has a long name, and I’d rather not type it). The file is about 6GB, from a compressed ~250MB! How many lines is it? wc does wordcount, but we tell it to do to count lines instead: wc -l (l for Lines). About 125e6 lines, neat.

Here is a small portion of the file, handcrafted for you:

<app:Sted xmlns:app="http://skjema.geonorge.no/SOSI/produktspesifikasjon/Stedsnavn/5.0" gml:id="Sted.1">
  ...
  <app:stedsnavn>
    <app:Stedsnavn>
      ...
      <app:språk>nor</app:språk>
        <app:stedsnavnnummer>1</app:stedsnavnnummer>
        <app:skrivemåte>
          <app:Skrivemåte>
            <app:langnavn>Stornesodden</app:langnavn>
...

You might be tempted to open it in your favorite editor, but keep the size in mind when doing so. How to inspect just the top or bottom of this file? head filename and tail filename.

3. Parsing

This part is split in two: an advanced XML-parser and a simple regex-parser. The XML-parser is used to extract location names where its origin is Norwegian, not Sami or Finnish for instance. The regex parser is more straightforward and doesn’t use this information; it only looks at legal characters.

Alt 1: XML-parser

The regex parser misses information in the XML file about the location name origin. Norwegian place names include Sami places, which has weird letters that are difficult to type for non-Sami people. E.g., Jiří. Therefore, we only select place names (<app:langnavn>) where the parent <app:språk> is eng or nor. Python and lxml did wonders, although it takes about 10 minutes. The upside is a low memory footprint.

Run pip install lxml before you save and run the code in the same directory as the *.gml-file.

from lxml import etree

# https://stackoverflow.com/a/42193997
def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()


# https://stackoverflow.com/a/42757094
def newline_generator(items):
    for item in items:
        yield item
        yield '\n'

xmlfile = 'Basisdata_0000_Norge_25833_Stedsnavn_GML.gml'
outfile = 'filtered_names.txt'

namespaces = {
    'app':'http://skjema.geonorge.no/SOSI/produktspesifikasjon/Stedsnavn/5.0',
    'wfs':'http://www.opengis.net/wfs/2.0'
}

xpath_query = "*//app:stedsnavn[app:Stedsnavn/app:språk/text()=('nor')]" \
    "/app:Stedsnavn/app:skrivemåte/app:Skrivemåte/app:langnavn/text()"


with open(outfile, 'w+') as output:
    for x in iterate_xml(xmlfile):
        matches = x.xpath(xpath_query, namespaces=namespaces)
        output.writelines(newline_generator(matches))

Alt 2: Regex

We need to extract the place names from our xml formatted dataset. regex is a formal rule language for selection or replacement in strings. This is a general concept which is implemented in many commands, e.g., sed for replacement and grep / awk for extraction. Run

grep -Po "<app:langnavn>\K\w+(?=</app:langnavn>)" \
Basisdata_0000_Norge_25833_Stedsnavn_GML.gml > place_names.txt

Source

Roughly, \K means “look behind” for <app:langnavn>, and ?= is “look ahead” for </app:langnavn>. \w is any letter from the alphabet, + means 1 or more. I.e., Look behind for the start tag, look ahead for the end tag: Return whatever letters that are between.

I usually use this service to construct my regex.

Notice the > sign, which means “shove the output from left, into the file at the right. This is called redirection.

4. Sanitation

We now have a place_names.txt! Check the file size and line count again. This file is safe to open in any editor. The data must be cleaned before it can be used:

  • Remove spaces with illegal characters
  • Remove duplicates
  • Convert all characters to lowercase

Illegal characters: sed + regex

We can remove names with illegal characters. Remove all:

sed -i '/[^A-Za-z]/d' place_names.txt  

We can also replace illegal letters with their equivalent ASCII alternative. E.g., Å→A, Ø→O, Æ→AE. iconv can do this, but not all characters are mappable, e.g., ǯ. Therefore, I choose to remove anything that isn’t A-Z or ÆØÅ.

sed -i '/[^A-Za-zæøåÆØÅ]/d' place_names.txt

The ^ means not, making the regex match anything that is weird. We use both A-Z and a-z because of capitalization.

Finally, map the letters:

iconv -f UTF8 -t US-ASCII//TRANSLIT place_names.txt > legal_names.txt

Lowercase

We have to make all letters lower case. tr can do this, but it doesn’t like utf-8, so we use awk instead.

awk '{print tolower($0)}' legal_names.txt > lowered_places.txt

Duplicates

How do we remove duplicate strings from a file? You’d sort the places by name and then loop over all the names. Remove an element if it is the same as the previous element. Both these steps are done with

sort -u lowered_places.txt > unique_names.txt

Once again, check the file size and line count.

5. Generator

XKCD-password-generator is useful for us. We can feed this library a list of words, some rules, and it will generate some passwords for us. We’ve done most of the job by now!

Run pip install xkcdpass and run this code:

from xkcdpass import xkcd_password as xp
import os

base = os.path.dirname(os.path.abspath(__file__))
wordfile = os.path.join(base, 'unique_names.txt')

place_names = xp.generate_wordlist(
        wordfile=wordfile,
        min_length=3,
        max_length=10)

def generate_password(word_class):
    password = xp.generate_xkcdpassword(word_class, numwords=3, case='random')
    return password.replace(' ', '-')

if __name__ == '__main__':
    print(generate_password(place_names))

Here is a small Flask service running as a demo.