Archive for November, 2017

Duplicated Line Finding in Go

Saturday, November 18th, 2017

I’m learning to read, write and understand Go at the moment:

https://golang.org/

As part of this, I am reading “The Go Programming Language” by Donovan and Kernighan and completing the exercises in it.

In the first tutorial chapter, there are samples of code for detecting duplicated lines in text files. The technique put forward by book is to use a map of strings to integers, where the key is the line and the value is the count. The programs simply loop over the lines and increment the value for the current line in the map.

An exercise that is left for the reader is adapt one of the existing sample programs to read several files and print the names of the files that contain each duplicated line.

The first problem to overcome is that the names of the files have not be recorded as the input files are being processed. This means that when we come to print the data that we have collected, we have lost that information.

My first approach was to add a map of strings to string slices. The idea was to add the file name to the string slice for that line as each line is read. This works in the sense that we have the list of file names for each line that is duplicated. However, the list of file names will contain duplicates itself if a line is duplicated in a file. I could wrap the code to append the new file name in an “if” statement and check whether that item is already there. However, checking whether a slice contains an item is O(n), which sounds like too much unnecessary work. I needed data structure that discards duplicates so that each item is unique. One way to do this is to use the keys of a map of strings to integers as the values that you wish to keep unique and increment the values of the map.

I created the following map of strings to maps of strings to ints:

lineCountsInFiles := make(map[string]map[string]int)

and incremented the values as simply as:

lineCountsInFiles[line][filename]++

One gotcha that I ran into during execution was that the nested map also needs to be initialized before we can increment the value:

if lineCountsInFiles[line] == nil {
    lineCountsInFiles[line] = make(map[string]int)
}

This caught me because we can increment a value in a map of strings to ints without any initialization as the int has default value that be incremented:

counts[line]++

Once the files have been processed, we have a map from the lines in all the files to a map from the names of the files in which they appear to the counts of each line in that file.

Putting everything together looks like this:

// Prints the counts, lines and names of files where
// a line is duplicated
package main
 
import (
    "bufio"
    "fmt"
    "os"
)
 
func main() {
    counts := make(map[string]int)
    lineCountsInFiles := make(map[string]map[string]int)
 
    for _, filename := range os.Args[1:] {
        f, err := os.Open(filename)
        if err != nil {
            fmt.Fprintf(os.Stderr, "Problem reading %v: %v\n", filename, err)
            continue
        }
        input := bufio.NewScanner(f)
        for input.Scan() {
            line := input.Text()
            counts[line]++
            if lineCountsInFiles[line] == nil {
                lineCountsInFiles[line] = make(map[string]int)
            }
            lineCountsInFiles[line][filename]++
        }
        f.Close()
    }
 
    for line, n := range counts {
        if n > 1 {
            fmt.Printf("%d, %v\n", n, line)
            for filename, count := range lineCountsInFiles[line] {
                fmt.Printf("\t%d,%v\n", count, filename)
            }
        }
    }
}

Which is saved here:

https://github.com/robert-impey/CodingExperiments/blob/master/Go/dup-with-names.go

For files like:

apples
coconuts
apples
bananas
apples
bananas

https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/duplicated-fruit.txt

and

apples

https://raw.githubusercontent.com/robert-impey/CodingExperiments/master/Go/more-duplicated-fruit.txt

The output might be something like:

C:\Users\Robert\code\CodingExperiments\Go>dup-with-names.exe duplicated-fruit.txt more-duplicated-fruit.txt
4, apples
        3,duplicated-fruit.txt
        1,more-duplicated-fruit.txt
2, bananas
        2,duplicated-fruit.txt

I write “might be” as the order that keys are retrieved from maps is deliberately undefined.

There are probably lots of ways to solve this problem. If you have any suggestions, please let me know.