Page 1 of 1

My CLI shell concept

Posted: Thu Aug 14, 2008 3:04 pm
by tetsujin
Hiya, I joined up here not because I'm interested in developing an OS per se - rather because it seemed like a good place to find a community of technically-minded people who might be able to help me in working out problems in a design I'm working on, for a new CLI command shell for Linux (and possibly other OSes as well... I have given some thought to making it Windows-friendly - but my current usage of colon as a syntax character, and possibly backslash as well, complicates things...)

Basically, if you're familiar with Windows Powershell, certain aspects of the design are along similar lines. I want to make a shell that has a concept of "data structures" that can be passed between programs (as opposed to just binary or text streams) and which incorporates certain ideas of object-oriented programming - at the same time I want it to have some of the flavor of more traditional Unix environments.

So probably the biggest conceptual change between, for instance, Bash and my shell design is that my shell would explicitly support a set of data types and structures which could be passed over the streams between processes. Most data would be passed by value, though I definitely want to have support for passing objects by reference in the future. The usual interchange format would be a new binary stream format (a binary representation of XML might work here, but at present I'm planning to make my own format. I may change my decision at some point - I'm sure people would adopt an XML solution more readily than something I cook up myself)

When a command is run it generally returns one or more values as a result (via the process's "standard output" file descriptor) - in terms of the structure of the result there's no difference between a command that returns one value and a command that returns more than one. (This is meant to help the definition of pipeline commands be more consistent - they can basically do a "for each" over their input even if the program from which they get their input doesn't allow for the possibility that there will be more than one value output...)

Values will have datatypes associated with them - the shell design will break with Unix tradition in that numerical values will not be text. The baseline data types will include numeric types, text in various encodings, symbols, file references, and structural stuff like lists and data records.

If the output format of one process doesn't match the input format of a process to which its output is being piped, data will be converted (if possible) to the format expected by the receiver. If the conversion is lossy, incomplete, or functionally impossible (some SHIFT-JIS characters, I believe, can't be written in Unicode, for instance - and of course a bigint may not fit in a 32 bit int, or conversion of an image to JPEG would be lossy) some sort of error or warning will be made.

In addition to regular executable files, non-executable data files may be used as command names if there is an "object layer interface" on the PATH for that file's data type. So, for instance, one could run "./picture.jpeg --size" as a command and get the pixel size of the image back as a pair of integers. The file in question needn't have the executable bit for this to work, it just needs to be recognized as a data type which has an object layer. In order to help protect users from deceptive files (for instance, malicious binaries, named as if they were data files, but with the executable bit set and program code inside) data files will be made visually distinct from true exectuables when the filename is typed in.

Of course, I'm not under the delusion that people will suddenly abandon their own data formats and work in whatever format I choose. It's just not practical or realistic. So my plan is to provide the following provisions:
  1. There will be a certain set of "recognized file types" which can go over streams between shell process without having to be tagged in any way - so if a process writes out XML, the receiving process will recognize that and have the option of dealing with the data as XML or attempting to convert it to something else.
  2. Commands can communicate their input or output format statically in cases where it's possible - basically an executable file could be tagged with metadata which would say "this program writes output in the form of a binary ISO9660 file system" or something like that - this data would then be communicated to the receiving process via a different channel, if necessary.
Examples:
# The comma is a "sequencing" operator - it has lower precedence than command invocation. It yields all of the values of the left-hand argument followed by all the values of the right-hand argument.
> (primes >> head 3), "something"
(2, 3, 5, "something")
> ((1, 2, 3), (4, 5 6))
(1, 2, 3, 4, 5, 6)

# The ampersand operator works as it does in bash - two commands can be run concurrently or one can be run without the shell waiting for it to finish... If two commands are run concurrently and both generate output, the output of the two commands is interleaved - values appear in the interleaved result according to how soon they were made available.
> (primes >> head 5) & (dictionary >> head 3)
(2, 3, "A", 5, "A's", "AOL", 7, 11)

# List data structure: when a sequence is encased in a list it becomes a single object...
> ([1, 2, 3], [4, 5, 6])
([1, 2, 3], [4, 5 6])
> [0, 1, 2, 3] --at 2
3
> [0, 1, 2, 3] --all
(0, 1, 2, 3)

# Environment variables can hold just about anything - if a variable containing a data structure is exported, the subprocess will get its encoded form, unless it's something like PATH - which some processes are going to expect in the traditional colon-delimited form...
> $a = ./image.png # $a contains a copy of the binary data in the PNG file, plus explicit type information identifying it as a PNG file...

# Equivalent of the traditional "find" command... it works the same but returns a sequence of filenames rather than a newline-delimited text stream...
> find -iname "*.png"
("foo.png", "bar.png", "contains spaces in the filename.png")
# In bash this command would be "find -iname "*.png" -print0 | xargs -0 rm", if you want it to handle filenames with spaces correctly...
> rm (find -iname "*.png")

# Scale a GIF file down and convert it to PNG:
> ./image.png = ./image.gif --scale --maintain-aspect (128, 128) : image/png
# (128, 128) is a sequence of values. Colon is a low-precedence type coersion operator. "image/png" is a data type constructor found on the PATH. The assignment operator can create files if the left-hand side is a filename.

# One potentially awkward decision I made in the design is to make different kinds of object references distinguishable by the parser, and have them be uniform in all contexts...
# An undecorated name is a reference to object on the PATH:
> cmd
# A name with a leading slash, dot-slash, or dot-dot-slash, tilde-slash, etc. is an absolute or relative file name... This will make some people uncomfortable since you can't refer to a file in the current directory without using the leading dot-slash...
> ./file.jpeg
# As a (hopefully) handy shortcut, names leading with a dot refer to files stored in a particular subdirectory of the user's home directory - a convenient way to refer to data stored in a persistent location...
> .data #Equivalent to ~/.shell-persistent-data/.data, or something like that...
# When I say these forms work the same "in all contexts", I mean it... Since there's really no telling where a command's arguments end unless you case it in parenthesis, however, the rule is that arguments belong to the first command to which they could possibly belong... For instance:
> rm find #removes all files in the current directory
> rm find -f #equivalent to rm (find) -f - the -f is an argument to "rm", not "find"
> rm (find /some/path) #This is what you'd have to do to provide arguments to "find"...

# If somebody wants to pass in a name that isn't some kind of file reference, there's three ways to do it.
> "some text" # Make it a string
> 'sometext # Or there's this string syntax, convenient for strings with no white space
> --symbol # Or it can be a symbol that's recognized as an argument by the program you're calling...

# I use curly braces to denote program blocks - for cases where somebody wants to do something that doesn't quite fit in my shell syntax they can write a code block in another syntax...
> {#!regexp /foo|bar/} #Equivalent to (regexp "/foo|bar/") - creates a regular expression object - except that it's the regexp utility that determines how much of the following text belongs to it... Obviously this sort of thing will only work for languages that have a syntax where you can tell, when encountering a closing curly brace, whether that curly brace is part of the program or not... So I think it'd mostly be useful for small tools that augment the shell syntax, though there's a chance it'll be adequate for some full-on programming languages as well...

Assorted other concepts:
  • Utilities like "grep" would be replaced with separate "filter" and "regexp" utilities - the former would filter a sequence of any type according to some callable predicate, and the latter would create a predicate object based on a regular expression... This makes it easier to use regexps in other cases, or to specify other regexp syntax without all tools having to support the same set of regexp syntaxes...
  • I hope to have a rich set of infix operators... Part of the reason I changed the redirection operators is because I wanted the pipe character as a "where" operator, and to support less than, greater than comparisons as part of the basic syntax...
  • I haven't worked out exactly how "structs" would work, syntactically speaking, but I feel there should be some sort of support for data structures with named fields, however. One idea I am considering is to support -> as a name binding operator - specifically it'd yield the right-hand value, but with metadata attached identifying the left-hand value as its name. Then the list syntax could be used to encase a sequence of such bindings in an object...
  • Various bits of the plan suggest that programs can supply information to the shell (possible calling arguments, input/output datatypes, etc.), statically or otherwise... Of course programs not written for the shell wouldn't normally supply this, so the plan is that such support could be layered on top of existing binaries through filesystem metadata or wrapper scripts
  • I'd like for archive file formats like ZIP, etc. to act more or less like directories - of course, making the shell support that and making everything the shell might run support that are two separate problems... I guess to go all-out with it I'd need to use FUSE or something to actually mount it, rather than just making it a shell-level thing...
Current design problems:
  • Commands like "cvs" or "chmod" don't follow my usual assumptions about GNU-style command argument format. Running something like "cvs checkout" in my current design would result in PATH searches for both "cvs" and "checkout"... I'm considering different solutions to this - possibly abandoning the idea that the rules are the same "in all contexts" - or possibly providing a way that "cvs" can provide its own name space, so that "checkout" is first tried as a "cvs" command, and as something else only if that fails...
  • I use the colon character for type coercion - this complicates the possible use of the shell on Windows (due to drive letters) and, perhaps more importantly, complicates the use of URLs as well, and X display variables - unless some kind of exception is made to recognize these forms...
  • I have generally wanted to avoid the use of the backslash character as an escape character (because it's used so much on Windows as a directory separator, and because it's visually too similar to the slash) but if I don't use backslash, I'll need to figure out an alternative... I'm not sure if I need backslash only inside quotes, or elsewhere, too...
  • I wanted the shell to be able to do math with infix operators - but I'm not sure that will work out, given the special meanings assigned to "-", "/", and "*" in the shell...
So, what do you think?

Re: My CLI shell concept

Posted: Sat Aug 16, 2008 7:19 am
by JackScott
From the syntax description you've given us, it looks to me a lot like Lisp for some parts, and Python for others. Would it be an advantage to you to pick either one of these two languages and adapt that to shell usage, instead of making a mix of it all? What I'm saying is, would a more pure syntax be better? (I'm not actually sure of the answer to this question, I'm just giving ideas).

As far as actually using variables and types in a shell goes, I love the general idea. I was very eager to try out Windows Powershell when I got Vista, and I think it adds a huge amount of expression and power to the command line, especially for scripts. Any attempt to make this work in a POSIX-style environment can only be lauded. I look forward to seeing this project progress.

Re: My CLI shell concept

Posted: Sat Aug 16, 2008 3:56 pm
by tetsujin
JackScott wrote:From the syntax description you've given us, it looks to me a lot like Lisp for some parts, and Python for others. Would it be an advantage to you to pick either one of these two languages and adapt that to shell usage, instead of making a mix of it all? What I'm saying is, would a more pure syntax be better? (I'm not actually sure of the answer to this question, I'm just giving ideas).
I have considered that - and I know there are people working on projects along those directions (Ruby Shell, Perl Shell, Python Shell, etc.)... But I felt there were some problems with integrating that syntax into a shell environment...

One problem is that, conceptually speaking, a shell is a programming language whose key characteristics are that the syntax is designed with an emphasis on interaction rather than on architected programming - and in which the filesystem is considered a part of the language's environment. That is - objects on the PATH are the language's commands, files on the filesystem are what the language mostly operates on... Most languages aren't particularly well suited to this. Syntax characters (the dot and slash being prime examples) clash with characters generally used in file names or paths... And other syntax characters generally in use in programming languages that aren't well suited for shell use, too - <, >, <<, >>, |, &, /, *, and so on are generally already taken as infix operators. For a shell to have even a moderately traditional feel some of these must be available for use.

Additionally, these languages, as far as I know, do not directly support piping data as a programming metaphor. Most programming languages pass one result to another process by using functional syntax: f(g(x)) where a shell programmer would say g(x) | f... Not to say it couldn't be added in (in a language as flexible as Python you certainly could) but it's not an assumption that's built into the language. I don't have any kind of research to back up the merits of pipelining as opposed to functional syntax but I think one advantage is that it's a more "flat" syntax - there's not the need for nesting that you would have with functional syntax - and you can start by writing the beginning of the process, the first step rather than the last, and not have to jump back to insert things into the beginning of the command-line.

Of course, style of programming is a big consideration here, too. I want a shell language with a nicer environment, but keeping the style of the shells I'm used to as much as possible. So even though I'm changing the underlying nature of the shell a bit I'm keeping its style the same as much as I can. That being the case I think it's easier to start with the syntax I want and change it as necessary to incorporate features I want - rather than starting with a language that probably has all the features I want and re-making its entire syntax to suit the style I'm after... Also, rather than embracing one scripting language, I want to create a shell environment that fosters integration between them - I feel like starting with one of the existing major scripting languages could undermine that position.

I am curious about what you mean by a more "pure syntax"...

Re: My CLI shell concept

Posted: Sat Aug 16, 2008 5:01 pm
by JackScott
I simply mean conforming more closely to an existing well-known design (like Lisp).

Re: My CLI shell concept

Posted: Sun Aug 17, 2008 8:35 pm
by tetsujin
JackScott wrote:I simply mean conforming more closely to an existing well-known design (like Lisp).
Ah. Well I had considered Lisp at the early stages of the design - I felt like it wasn't quite what I was after. Certainly not straight LISP because I want infix operators. I like the way in a shell you can write the sequence of commands for a pipeline according to the order in which operations are done, and not have to jump back to the beginning... You can start with "find" and then pipe it through some filter, as in "find | filter" - where in LISP you'd have to replace (find) with (| (find) (filter)) - or in most other languages filter(find)... I think that's important for a shell language - the pipeline presents the order of operations in a fairly natural way, and the user can generally add more steps without having to go back and edit earlier points on the command line.

I feel like the adaptation of shell syntax to an object-oriented environment gives it a bit of a Smalltalk feel - but mostly I just want it to have a fairly Unixy style, and just change the things I need to change to get the features I want... That's hard to do with an existing programming language, because so many of the characters common in programming languages as syntax are used in Unix shells, but for something different...

'Course, there are still unsolved problems with my design - for instance:
find >> create ./file_list; rm ./file_list
One would expect the second command to do one of two things - either remove the file list or remove the files identified in the file list. Both operations should be possible, but presently it seems there's not a clear way in the syntax to distinguish those two operations... In LISP the two cases would be distinguished by parens - parens around the reference to the file list would mean the files in the list would be deleted... That strategy doesn't work so well for me however since I'm using an infix syntax and parens are needed for grouping... It seems like an awkward solution - requiring special syntax for removing a file without examining its contents seems unnatural - requiring special syntax for removing filenames given in a list seems to conflict with the general design of the shell...

Re: My CLI shell concept

Posted: Mon Aug 18, 2008 3:57 am
by JackScott
Why not just a switch (--filelist or something) to be used on rm? Or is that the special syntax you are thinking of?

Re: My CLI shell concept

Posted: Mon Aug 18, 2008 10:10 am
by tetsujin
JackScott wrote:Why not just a switch (--filelist or something) to be used on rm? Or is that the special syntax you are thinking of?
Adding a switch to rm seems like treating a symptom - There's the larger problem, which is that my language can't readily distinguish between a reference to an object and the value yielded by the object. (Some languages like Haskell do work this way, of course... I'm not sure if that is really suitable for my design, however.

Basically, consider as well that someone could want to combine the output of two different commands to create the argument to rm (for instance - rm (./file_list, find ./some_directory, ./*.txt)) - adding a switch to rm can't solve that problem because the list of files to be removed comes from a sub-evaluation outside of rm's control... If this case really is symptomatic of a more general problem, then it must be possible to distinguish, in the syntax, between cases of ./file_list as a reference to the file list and ./file_list being evaluated to yield a list of file references...

I have been toying with a reference operator, goes something like this:

data@location
or simply
@location
which is equivalent to ?@location, meaning "What's at location?" At this point I guess I need to introduce the question mark syntax, which is a "positional unknown" character. The idea is that when a command is given in which the positional unknown is used, the shell attempts to solve for the unknown, if it can...

The at sign describes a relationship between a storage location and the data stored there. However, I still need to come up with a uniform way of applying it to this sort of problem. The options, apparently, are to require ./file_list to be decorated with the "at" operator either when the intent is to remove the list or when the intent is to remove the files in the list.

Option 1
  • When the user enters a command that yields a list of files (such as find or a glob such as ./*.txt), that evaluation structurally yields the location of that data but semantically yields the data at that location.
  • When storing the result of find in a file like file_list one has to make it explicit that the locations are to be stored, and not the values of the files located. (That is, find@? >> create ./file_list)
  • ./file_list is now a list of textual file names... rm ./file_list removes the file list, while rm @./file_list turns the list of filenames into a list of file references and removes them...
  • Removing the files in the file list, and all the files in the current directory whose name ends with ".txt": rm (@./file_list, ./*.txt)
The disparity here is that evaluation is now handled differently for items on the PATH than for items identified by a path name. So presumably rm find removes files yielded by find while rm /usr/bin/find removes the find binary. But then, suppose I built my own find binary and wanted to run it from my home directory? Wouldn't rm ~/my-find remove the binary? That may mean option 1 simply won't work.

Option 2
  • In all contexts, evaluating a file path or glob yields the data in those files - while running a utility like find yields a list of textual filenames. When the user specifies a file path or glob, that evaluation structurally yields the location of that data but semantically yields the data at that location.
  • So, creating a file list from find is straightforward: find >> create ./file_list.
  • rm takes as its argument a list of strings - textual filenames - there is no concept of a "file reference" and the syntax does not go out of its way to serve rm...
  • Removing a named file requires use of the @ syntax: rm @./file_list or rm @./*.txt
  • Removing the files in the file list, and all the files in the current directory whose name ends with ".txt": rm (./file_list, @./*.txt)
The problem here is that rm now apparently works pretty much backwards relative to what people are used to... And suppose you wanted a second level of indirection? I guess in that case ./file_list would have to be passed to another command to evaluate the files it points to before passing it to rm...

Option 3
Option 3 solves the problem by removing the notion of "default sub-command evaluation". That is, ./file_list yields the list of filenames in the list, while some-command ./file_list simply passes "./file_list" to the command. For one command to be evaluated and its result passed as an argument to another command, a special syntax (perhaps the bash-style $()) must be used. For instance:
  • rm ./file_list - attempts to remove "./file_list"
  • rm $(./file_list) - attempts to remove the files identified in the file list
  • Removing the files in the file list, and all the files in the current directory whose name ends with ".txt": rm ($(./file_list), ./*.txt)
Basically, anything that's evaluated in the context of an argument to a command is not evaluated as a command would be - unless it's encased in $() to put it back in an explicit evaluation context. The evaluation context syntax $() must be distinct from the grouping precedence syntax () because the latter is also used for forming sequences...

This solution at least appears comprehensive, and makes the shell behave a little more like traditional ones... I'm not altogether happy with this solution, however - I feel like the language is more consistent with having command arguments be evaluated in the same way as commands themselves... Still, it may turn out to be the best solution to this problem. It also solves the problem my shell design has with existing commands like cvs... I'll have to give it some thought, I guess.