• Stars
    star
    130
  • Rank 277,575 (Top 6 %)
  • Language
    Go
  • License
    GNU General Publi...
  • Created almost 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An implementation of sed in Go. Just because!

Sed-Go

An implementation of sed in Go. Just because!

Status

  • Command-Line processing: Done. It accepts '-e', '-f', '-n' and long versions of the same. It takes '-help'.
  • Lexer: Complete.
  • Parser/Engine: Has every command in a typical sed now. It has: a, i, c, d, D, p, P, g, G, x, h, H, r, w, s, y, b, t, :label, n, N, q, =.

This sed engine can be embedded in your program, wrapping any io.Reader so that the stream is lazily processed as you read from it. Of course I also have a command-line driver program here (in package github.com/rwtodd/Go.Sed/cmd/sed-go).

Differences from Standard Sed

Regexps: The only thing you really have to keep in mind when using go-sed, is I use Go's "regexp" package. Therefore, you have to use that syntax for the regular expressions. The main differences I've noticed in practice are:

Go-sed Traditional RE Notes
s/a(bc*)d/$1/g s/a\(bc*\)d/\1/g Don't escape (); Use $1, $2, etc.
s/(?s).// s/.// If you want dot to match \n, use (?s) flag.

Go's regexps have many rich options, which you can see here.

There are a few niceties though, such as I interpret '\t' and '\n' in replacement strings:

s/\w+\s*/\t$0\n/g

You can also escape the newline like in a typical sed, if you want.

Slightly Friendlier Syntax: Go-sed is a little more user-friendly when it comes to syntax. In a normal sed, you have to use one (and ONLY one) space between a r or w and the filename. Go-sed eats whitespace until it sees the filename.

Also, in a typical sed, you need seemingly-extraneous semicolons like the one after the d below:

sed -e '/re/ { p ; d; }' < in > out

... but go-sed is nicer about it:

sed-go -e '/re/ { p ; d }' < in > out 

Unicode: Original sed had no idea about unicode, but modern ones do, and Go-sed is unicode-friendly:

sed-go -e 'y/go/δΈ–η•Œ/' < in > out

Embed in your Code

I built the program as a library so that sed can be embedded into programs, wrapping any available Reader. The library processes the input lazily as you read bytes from the wrapped Reader.

Obviously, custom string-processing code is faster, but for quick and one-off tasks, this can be just what you need. For example, you want to strip out unix-style comments from a file, but can't be bothered to write the Go code at the moment:

engine, err := sed.New(strings.NewReader(`/^#/d  s/ *#.*//`))
n, err := io.Copy(myOutput, engine.Wrap(myInput))

If your input is a string, and you just want to get a processed string back, there is RunString:

output, err := engine.RunString(inString)

Note that, if you want an engine that emulates sed's -n quiet mode, use NewQuiet instead of New.

Building the sed-go executable

From the root of the repository, you should be able to build the driver program with:

go build ./cmd/sed-go

Sample Import Statement

If you want to embed a sed engine in your own program, you can import:

import "github.com/rwtodd/Go.Sed/sed"

(2019-01-03: I added a go.mod file to the repo so it can be built outside of $GOPATH)

Implementation Notes

I have never looked at how a "real" implementation of sed is done. I'm just going by the sed man pages and tutorials. I will note that in speed comparisons, go-sed outperforms Mac OS X's sed on my iMac, as long as the input isn't tiny. So, I think the architecture here is pretty good.

The library is spread out among several files:

  • lex.go: Lexes the input into tokens. Skips over comments. These are pretty large-grained tokens. For example, when it reads a 's'ubstitution command, it pulls in the arguments and modifiers and packages them into a single token. This makes the parser simpler. The tokens are sent over a channel so that the lexer can run concurrently with the parser. This is more of a design win than a performance win, but it is more than fast enough.
  • parse.go: Takes tokens from the lexer and parses the sed program. It's really a parser+compiler, becuase the output is an array of instructions for the VM to interpret. Because the tokens are designed to be pretty self-contained, this parser doesn't ever need to backtrack. I always like it when I can achieve that.

When branch targets (e.g., :loop) are parsed, a branch instruction to that location is stored off, along with the name. Then, after the initial parse, each branch is fixed up against the proper target.

  • instructions.go: This file holds all of the VM instructions except for substitution and translation, which are in substitution.go. An instruction for the VM I've built is just a func (svm *vm) error. This turned out to be a very flexible arrangement. Most instructions have a 1:1 mapping to sed commands, but others (like the 'n' command) are broken up at parse time into multiple engine instructions.

Simple instructions, like cmd_get (which handles the g sed command), are package functions, which can get inserted into the instruction stream context-free. A few instructions have state, such as the string to insert in an i\ command. For those, a closure holds the state:

  func cmd_newInserter(text string) instruction {
       return func(svm *vm) error {
           vm.ip++
           _,err := svm.output.WriteString(text)
           return err
       }
  }

A couple commands have so much state that a simple closure would be unwieldy, so those get a struct and an associated run method. That run method pointer becomes the instruction.

You can see in the example above that each instruction is responsible for incrementing the IP (instruction pointer) in the VM. That's flexible because many of the instructions branch, and they can set the IP to whatever they need. However, this was the number one cause of bugs during development: I'd add a new command, and forget to increment the IP, leading to an infinite loop on that instruction. So, I possibly should have had the VM auto-increment the IP, and have the branching instructions account for that when setting IP. It was a trade-off between keeping the inner loop as tight as possible and keeping the instructions as simple as possible. I might have made the wrong choice there.

  • engine.go: This is the sed-VM, and this file also has the entire public interface to the library. It is arranged for simplicity. You have one function to create an Engine from a sed program, and you can use that Engine to wrap an io.Reader. The same engine can be re-used against multiple inputs.

The inner loop of the interpreter very compact:

  for err == nil {
     err = svm.ins[svm.ip](svm)
  }
  • conditions.go: Conditions are what I call the guards around commands (like the 1,10 in 1,10d). The sed man pages act like the conditions are part of the command, but in my engine VM, they are commands themselves. In instructions.go, simplecond and twocond are the commands that make use of the conditions. The condition interface defines just one methon, isMet, which can inspect an engine and determine if the condition in question is met or not. The code in instructions.go does the rest.