Ceylon is a language for defining structured data as well as regular procedural code. One of the first things you run into when defining data formats is the need for micro-languages
- syntactic validation for character strings that represent literal values of some data type. For example:
- email addresses
- phone numbers
- dates, times, and durations
- regular expressions
- cron expressions
- URLs and URIs
- hexadecimal numbers
For example, we would like to be able to write things like:
Date date = '25/03/2005';
Time time = '12:00 AM PST';
Boolean isEmail = '^\w+@((\w+)\.)+$'.matches(email);
Cron schedule = '0 0 23 ? * MON-FRI';
Color color = 'FF3B66';
Url url = 'http://jboss.org/ceylon';
mail.to:='gavin@hibernate.org';
PhoneNumber ph = '+1 (404) 129 3456';
Duration duration = '1h 30m';
And we want the compiler to be able to perform some kind of syntactic validation on the format of these character strings. Sometimes, this validation might be as simple as a regular expression. But in other cases, more complex syntactic validations are thinkable.
So in Ceylon we've reserved single quoted character strings for this usecase. What we have not yet figured out is how to handle the problem of determining what particular format a single-quoted literal adheres to (what type of literal it represents), and how to validate the literal against that format at compile time. Ceylon doesn't do left-to-right type inference, so we might end up needing to make you specify the type explicitly, for example:
Date date = Date '25/03/2005';
Time time = Time '12:00 AM PST';
Boolean isEmail = Regex '^\w+@((\w+)\.)+$'.matches(email);
Cron schedule = Cron '0 0 23 ? * MON-FRI';
Color color = Color 'FF3B66';
Url url = Url 'http://jboss.org/ceylon';
mail.to:=Email 'gavin@hibernate.org';
PhoneNumber ph = PhoneNumber '+1 (404) 129 3456';
Duration duration = Duration '1h 30m';
I don't think that's ideal, but it's probably the safest thing. As for validation, I can see two possibilities:
- allow the application to supply a plugin validator, a Ceylon object that gets called at compile time, or
- allow a type to specify its literal format using an annotation (which might specify a regex, or perhaps even some more powerful BNF).
These days, I'm leaning towards the second option:
class Color(format(Bnf '(`0`..`9`|`A`..`F`){6}') Quoted quoted) { ... }
One consequence of the support for quoted literals is that we might end up using backticks to quote single-character literals, for example: `A` or `\n`.
The truth is, some more thinking and experimentation is needed in this area.
May I suggest backticks for parsed string literals and single quotes for character literals? It would be more C-like.
I guess I think that parsed string literals (that's a better term than what I've been calling them, btw) are much more common than character literals in business code...
One possibility would be to let you define an for single-quoted literals in a certain file. So you would write something like:
import ceylon.datetime { implicit Datetime, ... } Datetime datetime = '1/8/2011 8:00AM';Only one type could be implicit in any file. Other single-quoted literals would need to explicitly specify a type.
And we could probably even say that by default, Character is implicit.
Just an idea...
I don't know, but this example:
seems particularly difficult for determining the actual that should be used. I don't know how you could avoid having to specify the type in this case.
I'd hope this would be independant of locale. Which means using ISO date/time formats.
The whole idea of this stuff is that these formats are defined by libraries, not be the language spec.
We're working on it :-)
Ready to be shoot down again, but what about:
Date date = Date('25/03/2005');At least it's more regular. Nothing new and fancy. This could either be a compile error or just a warning..
Date date = Date("25/03/2005");And I like the annotation approach..
Feels a little to for my taste. You try it with one type, and it works but then the next doesn't.. And you have to scroll to the top to see which one is implicit.. And if you copy/paste someones code it doesn't compile..
You're probably right. It was just an idea.
Do you think it's more readable that way? Perhaps, you're right. Let's see a slightly more realistic example:
Datetime dt { date = Date('25/03/2005'); time = Time('12:30pm'); timezone = Timezone('GMT+3'); }vs.
Datetime dt { date = Date '25/03/2005'; time = Time '12:30pm'; timezone = Timezone 'GMT+3'; }I'm not sure. I think I still find the second version slightly more readable.
God dam!! This 'Your session has timed out, you have been redirected to the start page.' is killing me.
I've tried to write the same message 3 times now and lost it every time. This is even the second time I write this message...
You wrote Seam! You should be able to fix this.. ;)
Now, no more chances..
I think I like the idea of using backtick `25/03/2005` instead of single quote. Backtick in most scripting languages means which is sort of a fit here, and at least it won't be easily confused with a literal. In most scripting languages double-quotes is meant for a string with variable interpolation, while single quote is meant for non-evaluated litteral, which is sort of the opposite of the Ceylon meaning IMO.
Now back to the syntax. I just think it looks more regular. Nothing new and exotic. Even if your new to the language you still get it right away and if you try it your self and mix up " and ', the compiler will help you out.
And what about
Datetime dt (Date('25/03/2005'), Time('12:30pm'), Timezone('GMT+3'));vs.
But I agree that if you look past what you're used to, then the second version doesn't look to bad..
Anyways it's still an awsome feature! Can't wait to use it on regexps..
Btw, would it be possible to take multiple qouted literals? Not sure if its useful..
class Color(format(Bnf '...') Quoted date, (Bnf '...') Quoted date) { ... } Datetime dt = Datetime('25/03/2005', '12:30pm')But writing this I think I realize that where the first one looks like your calling a constructor, the second one looks like your annotating the string.. But annotations should not start with caps..? Hmm.. Still to new at this..
Anyways, I finally got to finish the message without loosing it. Both literaly and figuratively..
I could live with that I suppose. The thing was we had a discussion where some folks pointed out that they didn't have backticks on their European keyboards...
I'm not experiencing this.
Well, yeah, and the thing is, that's kinda how I think about it ... I think of it more as a type annotation than as an invocation.
Well, sure, but lets not take the analogy too far. I mean in a more abstract sense, like writing String name.
I do as well, and it REALLY annoying ;)
(using Google Chrome v13)
Date date = Date('25/03/2005');I don't know, but with an example like this what's the difference between the construction of a object and the definition of a literal? Only the fact that one uses double quotes and the other single?
I'm still trying to wrap my head around this idea to be honest.
At what time would the actual literal be created? At compile time, storing it somehow in a serialized form? Or would it just be the string which would get parsed when the class gets loaded for the first time, meaning that the compiler just checks once if it can parse the string correctly but then stores the string for later re-parsing?
PS: the session time-out seems not to happen when you've got live preview enabled.
The only difference is that the compiler looks inside the literal to see if it matches the expected format.
The latter, probably.
I suppose what you're getting at is that if we're going to be qualifying these things by the type, we don't really need a special quote character at all, and could just use " like regular strings. That's probably right.
Exactly.
Also, I'm trying to imagine things like for example XML literals:
XMLDocument xml = ' <books> <book title="xxx" stuff="foobar" /> </books>';which would be nice of course but I wondering what would happen if the asociated parser tries to validate against a remote schema and such (there are several examples in the Java world where those validators are almost impossible to turn off). Maybe that connection is only available at runtime and not on the machine where the compilation is done. (Because it would be logical to use the same parser for literals as for tuntime documents, afterall in the end it's all the same)
Regular expressions or BNF might be more apropriate, although much less powerful, but maybe allowing just any kind of plugin would be too much power? I don't know, I think this part of the language is still pretty weak.
It seems to me as if plugin literals are a restricted form of compile-time macros. Have you considered implementing a lightweight macro system instead, and then these literals would just be implemented within that? You could not only syntax check, but parse at compile-time, then replace the literal site with something like: Date(2011, 05, 25, 11, 00, 00). It would be much nicer than parsing that at runtime each time that code is called, otherwise if you're parsing at runtime I don't see much point in doing plugin literals at all, though I guess the syntax checks are nice.
However, if you don't want to do compile-time parsing, perhaps the regexp/bnf could be an annotation on any callable which the compiler uses to syntax check a string literal argument? So the syntax for using them would look the same as a standard constructor/function call, it would be very regular. This also means that this sort of checking could be implemented later.
Another idea: if you use ' by itself, you run can run all the registered literal parsers, dropping them as they fail a match, then you gather the ones that succeeded, and you end up with a union type of some sort on the RHS, then the type checker does the rest. Would this work within the type system?
As I mentioned in one of my previous comment (against another post), I like the way that common literals are present in the REBOL language, unquoted. I think this is really useful for a language targeted at business computing. If you decide not to allow pluggable literals, please consider special casing common ones anyway (urls, dates, times, file paths).
Hrm, never imagined using it for that - I'm not a fan of mixing XML snippets into code - but the truth is you could, indeed, imagine using it for that. But if you're going to go there, well, there's no reason why you couldn't imagine calling the compiler for some other totally different programming language.
Well, the compiler would give you an error saying that validation failed. That seems to be what you would want, right?
I think you can pretty quickly come up with examples where regular expressions don't solve the problem. You can't use them to match parens, for example. (The power of regular expressions has been somewhat oversold by the unix/scripting community.) Running a whole XML parser - or a whole compiler, OTOH, might be taking the idea too far.
Kinda.
I suppose that in theory this could be handled as some kind of Processor that works against the Ceylon syntax tree. I don't think you get access to the internal structure of expressions with a Java Processor, but you could easily imagine exposing that.
That's a really good point. Interesting.
Well, I have to do it, and I already do it. I have a lovely typesafe syntax tree for Ceylon. The question is whether I want to let you play with it :-)
OK, so I've now got two people voting for this.
I did wonder about this idea - though I did not get as far as thinking about it as a union type. That makes the idea much more palatable! The only objection I had to this approach was that you would not get to see what error caused your literal to not be a valid Date. It would just refuse to let you assign the literal to Date. That's not great.
A very early version of Ceylon was going to have built-in datetime literals, but in the end I just realized that there were simply too many other interesting .
Yeah, errors are the big issue with this approach. What if you did left-to-right type inference for this case only? ie. let the type checker do its thing first, skipping over the single quoted literals, then go and try to expand them once you know what type they should be.
That's because you are an authenticated user. Anonymous guest sessions have a short timeout, I think it was 10 minutes or something. This is not a mailinglist.
Not really! Having the validation, and therefore the compilation, depend on the state of the environment seems a mistake.
If a connection fails during runtime I can take appropriate action, but I don't think I would ever want my compilation to fail because I lost my internet connection. You'd probably go like ;)
While with runtime code I might just decide not to test that particular part of the application until my connection comes back up.
I don't know if you were counting me in this, but I wasn't really voting for it, it was more of a , I'm not sure I really like this idea as it is now
Good point, I don't have any European keyboard at hand, but I would have thought that simply because of the fact that many languages here use the grave accent they would have it on. Certainly it hasn't stopped Europeans from doing Shell or Perl ;)
What I mean is, I've got two people telling me that it should just look like an ordinary string literal passed to a constructor.
So I spent some time thinking about this on the place today. And now I'm thinking of taking a different tack. I think I have really good reasons for being against left-to-right type inference in general, but it seems to fit this feature, and I think we can easily impose some restrictions that make left-to-right type inference workable in this special case.
Here's my idea:
For example, the following declaration of Color:
Color parseColor(String literal) { ... } literal { format = '[0-9A-F]{6}|red|blue|green|yellow|black|white'; parser = parseColor; } shared interface Color { ... }And of Text:
class Text(Color color, String... text) { ... }Would let us write the following:
Text { color = 'FF3B66'; "Hello World!" }WDYT?
On the plane.
Well I agree with bending the inference rules, as it was my suggestion as well. I like the clean and simple annotated implementation as well. :)
I still don't agree with delaying the parsing until runtime, since it's something which I think the compiler could do (compiling regexps!), but I realise that makes things much more complicated to implement. It could perhaps be implemented later as an optimisation. However, if you're going to add some sort of parser-combinator support instead of just supporting regexps, then that already makes the implementation trickier...
That said, I'd be pretty happy with exactly the solution you presented above.
On a tangent, the following code:
Color parseColor(String literal) { ... } literal { format = '[0-9A-F]{6}|red|blue|green|yellow|black|white'; parser = parseColor; } shared interface Color { ... }Makes me wonder if you're allowed to write this:
literal { format = '[0-9A-F]{6}|red|blue|green|yellow|black|white'; Color parser(String literal) { ... } } shared interface Color { ... }This syntax is supported in regular named argument lists, why not in an annotation?
I'm inclined to think that this probably isn't allowed, but I can't quite figure out why not :-/
Michael, note that we can start out with something kinda simple-to-implement here, and later beef it up. Anyway, this feature isn't going to be in M1 of the compiler.
Aye. I can't wait until there's a compiler/some code to play with. I feel that out of all the recent languages in the past few years, Ceylon is has just the right balance of features and a very clean syntax. It's the language I'm most excited about at the moment.
It makes me wonder, My best answers would be compiled into either
local r = RegexBuilder() .beginning() .string("u") .times(2).begin .characterset("u") .end .string("ah") .build();The d programming language has a feature called mixins which are somewhere between the c preprocessor and lisp macros. It's implemented using functions that must be pure (ie. no side effects, see below) so it can be called by the compiler. The string it returns must be a valid declaration (or in this case, an expression) and is subsequently compiled.
You could define a function to be pure iff
Additionally methods annotated literal
Using this definition you could then do
literal "[0-9A-F]{6}|red|blue|green|yellow|black|white" String color(String literal) { ... } shared interface Color { ... } Color color = 'CCEECC'; // gets transmorgified into Color(204, 238, 204);Then any exceptions thrown by the parse function could be treated as a compiler error.
I am not convinced this is a good way to implement this, but it is something
On a related note, you could allow a prefix which is not required unless there is some ambiguity:
local color = color'CCEECC'; or import org.michelangelo { local c = color }; // If something like this were legal local color = c'CCEECC(missed a quote in that last suggestion):
local color = color'CCEECC'; or import org.michelangelo { local c = color }; // If something like this were legal local color = c'CCEECC'It has been said before but I will repeat: we REALLY need a mailing list to discuss this kind of thing. Even more so because the comment system does not support notifications. I had completely missed this discussion.
Great! Absolutely the best way. Assuming you can also write:
Wether you parse at compile time or at runtime doesn't erealy matter for the developer, it's just optimization.
1
I also get worriedwhen you say you'll update the introductions. How will we notice..?
We'd like to follow your progres, but your blog is not working to well for us.
Folks, subscribe to the atom feed! You can probably even do this inside your email client. Unless you're using gmail, like me :-/
Yes, that's . :-)
Michael, at some stage I spent a several days trying to come up with a way to define a pure annotation. Unfortunately I never wrote down my thoughts, so it would probably take me a day or more to reconstruct my reasoning. Anyway, my conclusion about pure was that it was going to be really difficult (read: impossible) to statically analyse purity without losing the ability to do some things that are totally legitimately pure (e.g. instantiate and use a mutable helper object like a Builder or something inside a method without leaking its reference out of the method). So I ended up dropping the idea.
I followed your link and had a look. I see what you mean. Now, to be honest I've never written much code in any language with a preprocessor or macros (never been a C, C++, or lisp programmer). All I know is that the preprocessor was once of the things intentionally removed from C when they designed Java, and that in all these years of writing Java frameworks I've never felt like it was a missing feature. And I think it's an area where you need to be super-careful, where it's really easy to come up with a feature that is easy to and ultimately harmful in a language designed for use in teams of people with diverse backgrounds and experience-levels. Now, arguing against myself, I certainly found Processors useful for working around the limitations of Java's non-typesafe reflection API when designing the new query API in JPA2. But on the other hand I also never felt completely comfortable with that approach. It was always very clear that I was resorting to extra-linguistic features to work around limitations of the type system. I dunno, perhaps if I'd ever used lisp I would be more comfortable with the whole idea...
After more thinking about it, I like the idea you outlined with the literal annotation. It's very simple, and looks complete. That 3rd bullet is the kicker, it means you don't need to have the withColor(Color 'ffeeff') finger exercise. because the parameter must be of type Color.
For reference this is the 3rd bullet that I'm referring to:
Aye! Down with the C Preprocessor! Long live lisp macros!
When you look at the implementation of Clojure's core module, you'll notice that there nearly all of the language's key constructs are implemented as macros. (for ...), for example, and even (let ...) is a macro wrapper around (let* ...) that adds destructuring to binding amongst other things.
This works for lisp because the code is in the same shape as the data: a list of function calls with corresponding arguments. You mutate that list and you mutate what the program does.
Methinks d's 'mixin's'--let alone a full-on macro system--is overkill for this simple problem.
As an aside, you'll notice that lispers don't use DI. They typically don't need it.
In general, if you do "unwise" things you're on your own. The point is designing a language so that "unwise" things are clearly visible, but you'll never cover it all, so why bother?