Go: UTF-8 Validator

I’ve been itching to learn Go for a while because the mascot is so cute! I’ve also heard that this language is supposed to be “easy”. I had the opportunity when I had to verify if a file contains any non UTF-8 characters by creating a simple script!

First Draft

I tried the “Guided Learning” feature of Google Gemini to help me build this. I am already familiar to the basic syntax by learning it at boot.dev, and I found it pretty helpful. It would not give me easy answers which was annoying, but it did force me to look stuff up myself and actually learn 😛

The flow of the program:

Read CLI arguments to get the specified filename
Open the specified filename
Read the file by chunks, loop through it until it’s the end
Exit on file reading error, exit on finding UTF-8 chars
Exit with success message if no UTF-8 chars is found

package main

import (
	"fmt"
	"io"
	"os"
	"unicode/utf8"
)

func main() {
	// get filename from 2nd arg (first is the path to go build)
	filename := os.Args[1]

	// get the stream open, and exit if err exists
	file, err := os.Open(filename)

	if err != nil {
		fmt.Println("Error opening file", filename, err)
		os.Exit(1)
	}
	
	defer file.Close()

	// 1kb memory alloc
	bytes := make([]byte, 1024)

	// store the position
	var offset int64 = 0

	for {
		bytesRead, err := file.ReadAt(bytes, offset)

		// check for end of file
		if err == io.EOF {
			fmt.Println("Reached End of File")
			break
		}

		if err != nil {
			fmt.Println("Error reading file")
			os.Exit(1)
			return
		}

		if !utf8.Valid(bytes[:bytesRead]) {
			fmt.Println("Invalid UTF-8 detected!")
			os.Exit(1)
			return
		}

		// increase the offset
		offset += int64(bytesRead)
	}

	fmt.Println("File is UTF-8 valid :)")
}

In this program, we had to allocate the memory manually by using make([]byte, 1024), which provides the space for the data to be read from file.ReadAt. The loop will read 1kb of data every loop, and check if that chunk has any non UTF-8 chars. In the end, we also increase the offset to make sure we are not re-reading the same chunk over and over again.

The slice bytes does not accumulate - it is only used as a buffer on every loop to dump the data read from file.ReadAt. We also need to check utf8.Valid() with only the the total bytesRead of the slice, because the last chunk might be 500 bytes, and the whole slice is 1024, hence we don’t need to read the last 524 bytes.

While this might work most of the time, there are conditions where this program would fail. If a non UTF-8 char is split between the first 1kb and the second 1kb, the program might not pick up this failure.

Second Draft

package main

import (
	"bufio"
	"fmt"
	"io"
	"os"
	"unicode"
)

func main() {
	// get filename from 2nd arg (first is the path to go build)
	filename := os.Args[1]

	// get the stream open, and exit if err exists
	file, err := os.Open(filename)

	if err != nil {
		fmt.Println("Error opening file", filename, err)
		os.Exit(1)
	}

	defer file.Close()

	reader := bufio.NewReader(file)

	for {
		r, _, err := reader.ReadRune()

		if err != nil {
			// check for end of file
			if err == io.EOF {
				fmt.Println("Reached End of File")
				break
			}

			fmt.Println("Error reading rune:", err)
			os.Exit(1)
			return
		}

		if r == unicode.ReplacementChar {
			fmt.Println("File Invalid, a non UTF-8 character found")
			os.Exit(1)
			return
		}
	}

	fmt.Println("File is UTF-8 valid :)")
}

In this version, we discarded the manual memory allocation and handover all the buffer management to the bufio package, buffered IO. We just need to make a new Bufio reader with bufio.NewReader(), and use the reader to read the every character (Rune) and see if it’s invalid.

However, this can be further made efficient by scanning per line instead of per character.

Draft 3

func main() {
	// get filename from 2nd arg (first is the path to go build)
	filename := os.Args[1]

	// get the stream open, and exit if err exists
	file, err := os.Open(filename)

	if err != nil {
		fmt.Println("Error opening file", filename, err)
		os.Exit(1)
	}

	defer file.Close()

	scanner := bufio.NewScanner(file)

	// loop through every line returned after .Scan()
	for scanner.Scan() {
		lineBytes := scanner.Bytes()

		if !utf8.Valid(lineBytes) {
			fmt.Println("Invalid UTF-8 detected on a line!")
			os.Exit(1)
		}
	}

	scanErr := scanner.Err()

	if scanErr != nil {
		fmt.Println("Error during scanning:", scanErr)
		os.Exit(1)
	}

	fmt.Println("File is UTF-8 valid :)")
}

In this version, instead of checking the validity per Rune which could be costly in larger and larger files, we check per line. This is done by the bufio.NewScanner(). The scanner.Scan() will not return error if it’s an io.EOF (end of file), and the errors can be accessed through scanner.Err(). But what if I want to point out the row and column?

Final Version

In this final version, I wanted to record the row and column. This would be better for debugging purposes, so users can know where the problem lies if the file contains any non UTF-8 characters. To know where the row/column of a failing character, the file needs to be processed rune by rune, so we will use the approach of the 2nd draft with some tweaks.

package main

import (
	"bufio"
	"fmt"
	"io"
	"os"
	"unicode"
)

func main() {
	// get filename from 2nd arg (first is the path to go build)
	filename := os.Args[1]

	// get the stream open, and exit if err exists
	file, err := os.Open(filename)

	if err != nil {
		fmt.Println("Error opening file", filename, err)
		os.Exit(1)
	}

	defer file.Close()

	reader := bufio.NewReader(file)

	// start at 1, because humans use 1-based indexing
	rows := 1
	cols := 1

	for {
		r, _, err := reader.ReadRune()

		if err == io.EOF {
			break
		}

		if err != nil {
			fmt.Println("Error reading file:", err)
			os.Exit(1)
			return
		}

		if r == unicode.ReplacementChar {
			fmt.Println("Non UTF-8 char found at row:", rows, "and column:", cols)
			return
		}

		if r == '\n' {
			rows++
			cols = 0 // 0, because will be incremented at the end of the loop :)
		}

		cols++
	}

	fmt.Println("File is UTF-8 valid :)")
}

We just added some logic to keep track if which rune we are checking using rows and cols :) To make it more solid, let’s add some tests to it.

Final Version 2

To make testing easier, we need to separate the core logic from main() to its own function - I named it checkFileIsValid(). This allows us to more easily test this program, as we don’t need to do any mocking/test the file reading mechanism (os.Args[1], and the os.Open()).

package main

import (
	"bufio"
	"errors"
	"fmt"
	"io"
	"os"
	"unicode"
)

func checkFileIsValid(file *os.File) (isValid bool, message string, err error) {
	reader := bufio.NewReader(file)

	// start at 1, because humans use 1-based indexing
	rows := 1
	cols := 1

	for {
		r, _, err := reader.ReadRune()

		if err == io.EOF {
			break
		}

		if err != nil {
			errMessage := fmt.Sprintln("Error reading file: ", err)
			return false, "", errors.New(errMessage)
		}

		if r == unicode.ReplacementChar {
			errMessage := fmt.Sprintln("Non UTF-8 char found at row:", rows, "and column:", cols)
			return false, errMessage, nil
		}

		if r == '\n' {
			rows++
			cols = 0 // 0, because will be incremented at the end of the loop :)
		}

		cols++
	}

	return true, "File is UTF-8 valid :)", nil
}

func main() {
	// get filename from 2nd arg (first is the path to go build)
	filename := os.Args[1]

	// get the stream open, and exit if err exists
	file, err := os.Open(filename)

	if err != nil {
		fmt.Println("Error opening file", filename, err)
		os.Exit(1)
	}

	defer file.Close()

	_, message, err := checkFileIsValid(file)

	if err != nil {
		// something gone wrong
		fmt.Println(err)
		os.Exit(1)
	}

	// print out final message
	fmt.Println(message)
}

Take note, our new function checkFileIsValid returns (isValid bool, message string, err error), however the isValid is not used in this program. This is intentional. I could just remove the isValid, but I think this function can be used in the future and returning isValid will be useful in other ways. For example, check the validity of the file, and if it’s valid, we save it to our cloud storage. Good practice on writing Go functions!

So - for the test, Go has its own testing package. We can create a file that ends in _test.go, and test the function by adding a TestXxx, where Xxx is the function that we want to test. Let’s write our first test.

// main_test.go
package main

import (
	"os"
	"testing"
)

func TestCheckFileIsValid(t *testing.T) {

	// get the stream open, and exit if err exists
	file, err := os.Open("valid.txt")

	if err != nil {
		t.Errorf("Error opening file. %s", err)
	}

	defer file.Close()

	isValid, message, err := checkFileIsValid(file)

	if err != nil {
		t.Errorf("%s", err)
	}

	// test for invalid checks to fail
	if !isValid || message != successMessage {
		t.Errorf("Expected isValid: true, received: %v\nExpected message: %s, received: %s", isValid, successMessage, message)
	}
}

This works! You can test it by running go test. If any tests fail, it will be printed. Try this out by changing !isValid to isValid.

Now, we need to add other tests to fully test the function. Instead of copying and pasting the same test over and over, we can use something called a Table Driven Tests. We make a struct for the test case, and a map to have different named tests.

const successMessage = "File is UTF-8 valid :)"
// these have \n, because the function returns a fmt.Println() string
const invalidFirstRowMessage = "Non UTF-8 char found at row: 1 and column: 13\n"
const invalidMessage = "Non UTF-8 char found at row: 102 and column: 13\n"

type testCase struct {
	fileName        string
	isValid         bool
	expectedMessage string
	expectedErr     error
}

var tests = map[string]testCase{
	"valid file": {
		fileName:        "valid.txt",
		isValid:         true,
		expectedMessage: successMessage,
		expectedErr:     nil,
	},
	"invalid first row": {
		fileName:        "invalid-first-row.txt",
		isValid:         false,
		expectedMessage: invalidFirstRowMessage,
		expectedErr:     nil,
	},
	"invalid in > 1st row": {
		fileName:        "invalid.txt",
		isValid:         false,
		expectedMessage: invalidMessage,
		expectedErr:     nil,
	},
	"invalid filename": {
		fileName:        "wrong-filename.txt",
		isValid:         false,
		expectedMessage: "",
		expectedErr:     errors.New("test error"),
	},
}

In the tests["invalid filename"].expectedErr , I put in errors.New("test error"). This error can be anything as long as it’s not nil, because we can test the err based on whether it’s nil or not.

To test with the map, we can just loop through it like this:

func TestCheckFileIsValid(t *testing.T) {
	// loop through the tests
	for name, test := range tests {
		t.Run(name, func(t *testing.T) {
			file, err := os.Open(test.fileName)

			if err != nil {
				// fail if it's not meant to error out
				if test.expectedErr == nil {
					t.Errorf("Expected expectedErr value: %s, got: %s", test.expectedErr, err)
				}
			}

			if err == nil && test.expectedErr != nil {
				t.Errorf("Expected expectedErr value: %s, got: %s", test.expectedErr, err)
			}

			defer file.Close()

			isValid, message, err := checkFileIsValid(file)

			if err != nil {
				// fail if it's not meant to error out
				if test.expectedErr == nil {
					t.Errorf("Unexpected error. %s", err)
				}
			}

			if isValid != test.isValid {
				t.Errorf("Expected %v, got %v", test.isValid, isValid)
			}

			if message != test.expectedMessage {
				t.Errorf("Expected %s, got %s", test.expectedMessage, message)
			}

		})
	}
}

And that’s it! This is the final final version, I promise!

Future Improvements

I think a good improvement would give like an extra option to record all instances of non UTF-8 characters through the use of a flag (e.g. --all). I also realised maybe I can refactor the test - maybe remove the check for error and expectedErr? Because this does not really test the function only, it also tests the opening if invalid filename.

Closing Thoughts

Nothing much to say other than I had fun creating this! It works, and now I can officially add Golang to my CV 😛 I’m kidding, but it has been really fun! The error as value forces me to handle error on every step, and also pushes me to return errors in functions that I write if there are any failure points.

Table Driven Testing was interesting, and it felt intuitive, too. I wanted to say it is similar to Data Attributes in PHPUnit testing. They use different data in same structures to verify the different test results. I know you can run these tests in parallel too!

Working with a statically typed language is a nice refresher after mainly working with JS and PHP. And the walrus operator - I still can’t get over it! :=)