PHP Internals News: Episode 52: Floats and Locales

PHP Internals News: Episode 52: Floats and Locales

In this episode of "PHP Internals News" I talk with George Banyard (Website, Twitter, GitHub, GitLab) about an RFC that he has proposed together with Máté Kocsis (Twitter, GitHub, LinkedIn) to make PHP's float to string logic no longer use locales.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 52. Today I'm talking with George Banyard about an RFC that he's made together with Mate Kocsis. This RFC is titled locale independent floats to string. Hello, George, would you please introduce yourself?

George Banyard 0:39

Hello, I'm George Peter Banyard. I'm a student at Imperial College and I work on PHP in my free time.

Derick Rethans 0:47

All right, so we're talking about local independent floats. What is the problem here?

George Banyard 0:52

Currently when you do a float to string conversion, so all casting or displaying a float, the conversion will depend on like the current local. So instead of always using like the decimal dot separator. For example, if you have like a German or the French locale enabled, it will use like a comma to separate like the decimals.

Derick Rethans 1:14

Okay, I can understand that that could be a bit confusing. What are these locales exactly?

George Banyard 1:20

So locales, which are more or less C locales, which PHP exposes to user land is a way how to change a bunch of rules on how string and like stuff gets displayed on the C level. One of the issues with it is that like it's global. For example, if you use like a thread safe API, if you use the thread safe PHP version, then set_locale() is not thread safe, so we'll just like impact other threads where you're using it.

Derick Rethans 1:50

So a locale is a set of rules to format specific things with floating point numbers being one of them in which situations does the locale influence the display a floating point numbers in every situation in PHP or only in some?

George Banyard 2:06

Yes, it only impacts like certain aspects, which is quite surprising. So a string cast will affect it the strval() function, vardump(), and debug_zval_dump() will all affect the decimal locator and also printf() with the percentage lowercase F, but that's expected because it's locale aware compared to the capital F modifier.

Derick Rethans 2:32

But it doesn't, for example, have the same problem in the serialised function or say var_export().

George Banyard 2:37

Yeah, and json_encode() also doesn't do that. PDO has special code which handles also this so that like all the PDO drivers get like a constant treat like float string, because that could like impact on the databases.

Derick Rethans 2:53

How is it a problem that with some locales enabled and then uses a comma instead of the decimal point. How can this cause bugs and PHP applications?

George Banyard 3:02

One trivial example is if you do, you take a float, you convert it, you cast it to string, and then you cast it back to float. If you're on a locale, which is the dot decimal separator, you will get back the original float. However, if you have like locale which com... which changes the decimal separator, like the German one, you'll get a string; you'll get like three dash, three comma 14, and then when you convert it back to float, you will only get three because PHP doesn't recognise the comma as a decimal separator in its string to float conversion and so it will loses the decimal information.

Derick Rethans 3:39

That doesn't seem particularly very useful as a feature. So my question here is we talked about floating point numbers and, and I think floating point numbers have other issues as well. Not sure whether we want to go into the details of how floating point numbers and computers work, but we can if you want to.

George Banyard 3:56

The easy way to explain floating points is to use like exponential notation, or to use the scientific exponential notation, which most people will know from engineering or physics, where you usually have like, one significant like the number, like a comma, a couple of numbers, and then you have like an exponent which raises it to usually, so to your power 10 to the something, which then gives you an order of magnitude. Floating points, basically that but in base two.

Derick Rethans 4:26

Positions have magnitudes attached to them. They're all powers of two.

George Banyard 4:30

Yeah.

Derick Rethans 4:31

And of course, when we use numbers an decimal, like pi being a bad example.

George Banyard 4:36

Once said.

Derick Rethans 4:37

I was going to say if you divide 10 by three, you get 3.33333 that never ends, right. And I reckon if you have a specific number in decimal like three point 14, then you can't necessarily always exactly represent it in binary.

George Banyard 4:55

Yeah, one common example would say it's like one 10th which has like a perfect representation in decimal. But like in binary is a never ending repeating sequence. When you try to like display naught point one, like how it's saved in floating point, it's really weird and everything to get like these rounding errors which can propagate.

Derick Rethans 5:15

And hence you often hear people recommend to never use float for things like monetary values, but then as you said that you sentence that right?

George Banyard 5:23

Yeah, put everything in integers and work with integers and just like format it afterwards.

Derick Rethans 5:29

So let's get back to what you and Mate are actually suggesting to change. What are the changes that you want to make through this RFC?

George Banyard 5:36

The change's more or less to always make the conversion from float to string the same, so locale independent, so it always uses the dot decimal separator, with the exception of printf() was like the F modifier, because that one is, as previously said, locale aware, and it's explicitly said so.

Derick Rethans 5:56

Doesn't printf also have other floating related format specifiers? I believe there's an E and a G as well. And uppercase F. What is the difference between these?

George Banyard 6:06

Lowercase F is just floating point printing with locale awareness. Capital F is the same as lowercase, but it's not locale aware. So it always uses the dot decimal separator. Lowercase E is, what I've learned recently also locale aware, and uses the exponential notation, like with a lowercase e. Uppercase E is the same as lowercase E, but instead of having a small like a lowercase e in the printing format, it's a uppercase E, and lowercase G has some complicated rules onto when it decides which format to choose between lowercase F and lowercase E, depending on like how big like the number of significant digits are after the comma, or like the dot. And uppercase G is the same but using uppercase F and uppercase E instead of lowercase E and lowercase F.

Derick Rethans 6:58

And all of them can be locale dependent then except for uppercase F.

George Banyard 7:02

Yeah.

Derick Rethans 7:02

Do you think this is going to impact people's applications, if you change the default of normal casts to be locale independent?

George Banyard 7:10

I would have expected it to not be that significant. And only that would affect displaying floating point. So if you're like in Germany, instead of like seeing a comma, you would now see a dot, which can be annoying, but I wouldn't imagine is the most, the biggest problem for you like end users. But apparently, people made tooling to work around the locale awareness of it. And so they could maybe break with passing stuff, which I suppose that happens because it's been, PHP's 25 years old. And that behaviour has been there for like ever. So people worked around it or work with it.

Derick Rethans 7:49

Is this going to be purely a displaying change or something else as well?

George Banyard 7:54

For example, if you would send like a float to like an API via HTTP, you would usually already need to have like code around to like work around like the locale awareness, or like all by resetting set locale or by using number_format or like sprintf or something like that. Because most other APIs or like you would like contact would expect like the float to use like a decimal point. PHP. If you do the string to float conversion again, which was not a point, then you get only an integer basically.

Derick Rethans 8:27

Because PHP's parser, strips it out once it stops recognising digits, which is in this case, the comma.

George Banyard 8:33

Yeah, that would make the code nicer. The main reason why me and Mate like decided to propose this RFC is because like most APIs, and also databases and everything, expect strings to be formatted in like a standard way. Currently, like if you for whatever reason, use a locale, then it's not, but yeah, like apparently people worked around that when they were maybe stripping stuff from like HTML whatever displayed and try to work around it because that got raised in the list quite recently.

Derick Rethans 9:06

This change does not necessarily remove the ability of using locales for formatting numbers, because PHP still has the lowercase F as format specifier for printf. And sprintf and friends. Does PHP have other ways of rendering numbers according to locales?

George Banyard 9:24

According to locales? I don't think so. You can format it something like manually, or the number format a class from the Intl extension.

Derick Rethans 9:35

Yeah, from what I understand, number_format, you have to do it all by yourself. And the intl extension doesn't support the posix or C locales from the operating system, right. It uses its own locale rule set from the Unicode project. The RFC lists some alternative approaches. Would you mind touching a little bit on these as well?

George Banyard 9:58

One of the alternatives approaches is to deprecate setlocale altogether. Because as a byproduct, this just fixes the issue because you can't define any locale anymore. So, there will always be locale independent. This has been discussed like in back in 2016, mostly because of the non thread safe behaviour. Because it affects global states and everything. But at the time, the conclusion was, because HHVM, like did a patch, making a thread safe, setlocale function was to mimic this patch and like implement it into PHP, which hasn't been done yet. Another one that we thought about was to deprecate kind of the behaviour and like raise a notice, like a deprecation notice, because that would happen like basically on every float to string conversion. The penalty, like the performance penalty, seemed pretty like strong. One other thing we considered was with Mate was to deprecate the current behaviour in some way. However, emitting a deprecation notice on basically every float to string conversion seemed not to be ideal. And just like flood, the log, the log output, and like also bring like a performance penalty because like outputting warnings isn't like most friendly thing to do performance wise.

Derick Rethans 11:21

What has the feedback been so far?

George Banyard 11:24

Feedback currently has been that like most people, well, one person because there hasn't been that much feedback.

Derick Rethans 11:30

There hasn't been that much feedback because you've only just proposed?

George Banyard 11:33

some of the feedback we got officiates the change However, they have concerns about like the modification of like, in every case for locales without having any upgrade paths. In some sense. It's just, oh, you have the change, and then you need to execute it and see what breaks. We may be currently considering like ways to figure that out, maybe by adding a temporary ini setting which would kind of be like a debug mode, where when you use that it would like emit notices when like this conversion would happen before and they would notice: Oh, this is not happening anymore. You need to like be aware of this change in behaviour

Derick Rethans 12:17

Did we not used to have E_STRICT for this at some point or E_DEPRECATED?

George Banyard 12:24

E_DEPRECATED is still a thing. E_STRICT got mostly removed with PHP seven. There've been like a couple of remaining notices which I got rid off or put back to normal E_WARNINGS or E_NOTICES in PHP seven point four. There were like two or three remaining. But yeah, like so that's one way to maybe approach it of like implementing a debug ini setting which would only be used for like dev because then where if you get like warnings and everything, you don't really care about the performance impact. And then in production, you would like disable that and the warnings wouldn't pop up.

Derick Rethans 12:56

How would that setting be any different from just putting it behind an E_DEPRECATED warning?

George Banyard 13:00

So with an E_DEPRECATED warning, we would need to show this behaviour, and we would need, and we could only change the behaviour in like PHP nine. Currently if we do that with like debug setting, we could change it with PHP 8.

Derick Rethans 13:13

That's a bit cheating isn't that?

George Banyard 13:15

Could say so.

Derick Rethans 13:16

I'm interested to see how this ends up going. Do you have any timeframe of when you want to put it for a vote?

George Banyard 13:23

Currently, we've only started this discussion. And I think until we figure it out, if we get like an upgrade pass, or multiple upgrade passes that we could then put into a secondary vote. I wouldn't expect it to go to voting that soon. Maybe end of April would be nice.

Derick Rethans 13:41

So around the time when this podcast comes out?

George Banyard 13:44

Ah! For once!

Derick Rethans 13:46

For once I got my timing right.

George Banyard 13:49

Yes. Don't you have like the string contain one which just got out.

Derick Rethans 13:53

Yes.

George Banyard 13:54

Then that vote close like last week.

Derick Rethans 13:57

Yeah, it's really tricky because there's so many, so many small now that I can't keep up.

George Banyard 14:02

Yeah, Mark also did like his debug.

Derick Rethans 14:04

Yeah. And there's like two or three tiny ones more that I would quite like to talk about. But by the time there's an opening in the schedule, it's pretty much irrelevant. So I'm trying to see whether I can wrap a few of the smaller ones just in one episode because there's the throw expression, the is literal check, and typecasting in array destructuring expressions, and all showed up in the last three days.

George Banyard 14:26

I suppose people have like, lots of time now. Now, it's a taint checker, basically, like I know, there's been like this paper by Facebook like six or eight years ago, which talks about how they kind of tried to implement in their static analyzer, but like, a static analyzer doesn't need to be something in the engine. That's what I don't really get.

Derick Rethans 14:45

Thank you, George, for taking the time this afternoon to talk to me about a locale independent float to string RFC.

George Banyard 14:53

Thanks for having me on the podcast again. Derick.

Derick Rethans 14:55

You're most welcome. Thanks for listening to this installment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 51: Object Ergonomics

PHP Internals News: Episode 51: Object Ergonomics

In this episode of "PHP Internals News" I talk with Larry Garfield (Twitter, Website, GitHub) about a blog post that he was written related to PHP's Object Ergonomics.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 51. Today I'm talking with Larry Garfield, not about an RFC for once, but about a blog post that he's written called Object Ergonomics. Larry, would you please introduce yourself?

Larry Garfield 0:38

Hello World. My name is Larry Garfield, also Crell, CRELL, on various social medias. I work at platform.sh in developer relations. We're a continuous deployment cloud hosting company. I've been writing PHP for 21 years and been a active gadfly and nudge for at least 15 of those.

Derick Rethans 1:01

In the last couple of months, we have seen quite a lot of smaller RFCs about all kinds of little features here and there, to do with making the object oriented model of PHP a little bit better. I reckon this is also the nudge behind you writing a slightly longer blog post titled "Improving PHP object ergonomics".

Larry Garfield 1:26

If by slightly longer you mean 14 pages? Yes.

Derick Rethans 1:29

Yes, exactly. Yeah, it took me a while to read through. What made you write this document?

Larry Garfield 1:34

As you said, there's been a lot of discussion around improving PHP's general user experience of working with objects in PHP. Where there's definitely room for improvement, no question. And I found a lot of these to be useful in their own right, but also very narrow and narrow in ways that solve the immediate problem but could get in the way of solving larger problems later on down the line. So I went into this with an attitude of: Okay, we can kind of piecemeal and attack certain parts of the problem space. Or we can take a step back and look at the big picture and say: Alright, here's all the pain points we have. What can we do that would solve not just this one pain point. But let us solve multiple pain points with a single change? Or these two changes together solve this other pain point as well. Or, you know, how can we do this in a way that is not going to interfere with later development that we've talked about. We know we want to do, but isn't been done yet. So how do we not paint ourselves into a corner by thinking too narrow?

Derick Rethans 2:41

It's a curious thing, because a more narrow RFC is likely easier to get accepted, because it doesn't pull in a whole set of other problems as well. But of course, as you say, if the whole idea hasn't been thought through, then some of these things might not actually end up being beneficial. Because it can be combined with some other things to directly address the problems that we're trying to solve, right?

Larry Garfield 3:07

Yeah, it comes down to what are the smallest changes we can make that taken together have the largest impact. That kind of broad picture thinking is something that is hard to do in PHP, just given the way it's structured. So I took a stab at that.

Derick Rethans 3:21

What are the main problems that we should address?

Larry Garfield 3:24

So the ones that identify that people have been talking about are the following. One is constructors are just way too verbose. If you've looked at almost any PHP class, in almost any framework, the most common pattern is: you start with a class, you declare three to five properties that are private or protected. Then you have a constructor that takes three to five parameters and assigns each of those to those properties. Usually the names match all the way through, types match all the way through. It's all it's doing is shoving those parameters into properties. Right now, you have to repeat each property name four times total. It's just way too verbose. It's just more typing than we should be doing. And so there have been various proposals for ways to have to type less to do that.

Derick Rethans 4:11

We'll get to the solutions in a moment, I'm sure.

Larry Garfield 4:14

The next one is what I've called the bean problem. So I've referenced to Java beans. For those who have not worked with Java before. And I haven't worked with it in a long time. But when I last did, this was standard, you'd have what's called a Java bean, which is just a Java class that has a bunch of properties that are private, and then a getter and a setter for every single one of those properties. PHP, you see the same pattern a lot, especially in ORMs. Largely that comes down to this makes serialisation and deserialization straightforward because you can access properties through a method, you know, the names, automatic naming and so on. But that's again, an awful lot of typing to bypass the private and protected keyword. So how can we reduce the mental overhead of that and just have access to what we need to with less work. That relates to a lot of the reasons for that is immutable objects. So it's been increasingly popular in PHP in recent years to have objects that even though the language doesn't support immutability are effectively immutable, in that the object doesn't give you a way to change its properties. But it gives you a way to create a new object that is the same, but with certain changes. Think DateTimeImmutable in PHP core, or it has a modify() method, which doesn't change the objects in place. You see, if you call a DateTimeImmutable object, call it with the modify() method with a parameter of plus one week you get back a new DateTimeImmutable object, that is the timestamp one week later. That pattern is increasingly common. PSR-7, the HTTP messages spec uses that a lot of other packages have started doing it. The way that usually ends up working is these wither methods. It's with some value, with some some property name and so on, similar to a setter, but it returns a new object and there's a common pattern for that now. Another problem is materialised values, where you have something that conceptually is a property. And to a outside caller, it really should just be a property. But you want to not have it be a full property itself. The example I use the kind of the canonical example is you have a first name property and a last name property and you want to format a full name property. There's a lot of cases like that. Right now, you do that as a method, and you have some kind of static cache internally. Which works. It's just: Can we make that better? And can we not make it worse with any of these other changes? A lot of this comes down to how do we make not make any of these problems worse. Another problem is, for lack of better term, and what I call the documented property problem, where if you have a large constructor, then you're going to pass in a bunch of different values because they all map to properties, but you need to keep track of: Okay, which one of these is which? And especially comes up for value options, rather than service objects. Were introduced in C, or Rust or Go would just be a bare struct, essentially, which PHP doesn't have. And we can get to why I think that's okay, we don't have. But objects where you really just have a combination of properties, and that's okay. But you still need to keep track of them, you want to be able to create an object that has only some of them. And if you have eight optional properties, and you want to just set the last one, right, now you have a bunch of nulls or question marks, or empty quotes, or zeros, or whatever default value, and again, it's just very cumbersome. And so the kind of the question I was looking at is, how can we make all of these better and not make any of them worse? That's kind of the problem space. I think most people can relate to, at least most of these.

Derick Rethans 7:46

I would think so to certainly in some of my code, where that's been the case. Hopefully, that was all the problems you found.

Larry Garfield 7:53

I think I got all of them.

Derick Rethans 7:55

As I alluded to, in the introduction, there have been quite a few smaller RFCs already to address some of the problems that you just mentioned. Which you list and as well as others in things that you have found that multiple people currently already do. Should we have a quick look at what these things are?

Larry Garfield 8:15

One of the proposals that I looked at was writeonce properties, as we are recording this, there's an RFC for that that's in voting. Although it looks like it's probably not going to pass that the vote stays where it is. Now, the idea there is allow typed properties to have a read only marker on them just like the type or public or private, and then they can only be written to once if they're uninitialised you can write to them, after that they're just stuck that way. The advantage is that would make them safe to expose publicly. And so you can have a property that you can expose to the world just access a property but not be concerned about someone changing it out from under you. The downside of that mainly comes down to that evolvable immutable object where that with method then becomes a lot harder, because you can't say: clone this object and change this one property because well, you can't change this one property, you'd have to fully construct a new object. There's also two different proposals that have been floated recently for compact object property assignments. I think they have different names for the same basic idea. Basically, if an object has public properties, being able to write to those in one shot in a code block, along with the constructor in a named fashion. It's essentially there's a common pattern now where you pass an associative array to a function which has a bunch of named properties, and then you can put them in whatever order you want. And then you know, dissect those and map those to properties internally. It's essentially taking that idea and baking it into the syntax, which does help and gives you when you have a lot of properties that are optional. It makes it a lot easier to you have a lot of properties defined or a lot of parameters defined it makes it a lot easier to piecemeal select them. The downside is all of those proposals to date only work on public properties, which have a long list of challenges with them. It also means you're bypassing any kind of validation around this property is only valid if this property is set, or this property has to be less than this property, and so on. Those are too limiting, but definitely they're trying to solve a real pain point.

Derick Rethans 10:19

Nor can you enforce types through that, of course.

Larry Garfield 10:21

Some of them I think, might be able to

Derick Rethans 10:23

I meant associative arrays.

Larry Garfield 10:25

Yeah, the associative array approach you can do now, which is really the only possible thing I can say in its favour is that it works today. Type enforcement isn't there, it's poor for documentation. Please don't do that. All these are dancing around names parameters, which is a different language feature that's been discussed on and off for many, many years. I don't know of any current RFCs on the table for this one, but it's come up many times. Number of languages have this Python has it for example, where give or take whatever syntax instead of specifying, call this function with parameters, one, seven and 19, and then you have to guess what those numbers mean, you can call a function with count equals one, order equals ASC, whatever. And then you can reverse the order, change the order around. It's essentially the same idea. But for function parameters rather than Object Properties. Again, there's implementation challenges there. But certainly there are languages that do it successfully. Another problem space people have been looking at is access control. So we mentioned the the read only property. In the discussion for that Nicholas Grekas, made a suggestion for having instead of having a read only flag, allow the access control on a property to be different for read and write. So you could have a property that is publicly readable but not writable. But private writable, or private and protected writable. That gives you many the same benefits as the read only flag would have, but without breaking some of the current patterns we have around cheap cloning of objects and so forth.

Derick Rethans 11:58

Because of course in PHP, PHP's object oriented system is based on classes, not on objects. You can access read and write private properties of other objects as long as they have the same class.

Larry Garfield 12:10

Correct. And that's something that we take advantage a lot of in cloning, to hold wither method style is based on that. If that feature of PHP went away, it would break an awful lot of code. So don't change that. Other things have been on the table. People have talked in the past about constructor promotion, which is a feature that a couple of languages have including Hack, which is the Facebook PHP fork. The basic idea there is, instead of repeating properties once for their declaration, once in the constructor, and then twice in an assignment, you just declare them as part of the constructor. And it becomes essentially a macro to expand that out to the same original code. Hack already has a syntax for that. This one actually has been a proposal for PHP before and it didn't pass.

Derick Rethans 12:57

Was it proposed in the exact same syntax as Hack? I don't believe so because Hack had types at the moment, and PHP did not.

Larry Garfield 13:05

The earlier syntax, I was just looking at that RFC earlier today, used public function constructs this arrow foo, comma, this arrow bar. And then you still had to declare the properties independently, so it only solves half the problem. And the syntax looked kind of weird. The Hack syntax just lets you put the entire property declaration in place of the parameter in the constructor line, and it fills in all of the other pieces. You have public function, construct, parentheses, private int, a number, private bar, some bar object, and so on. And it would automatically create that property on the class and take the parameter and promote it and do the assignment for you. So that's what Hack does. I believe TypeScript has something similar, although I haven't worked with it. It's again just simplifying that common case. Another non PHP place I look for inspiration is Rust, because Rust does immutable objects very well. And so I figured, alright, let's let's look what other languages are doing. What Rust does, they have objects that are more bare than PHP does, much like Go where it's really a struct to which you can attach methods rather than an enclosed object, but they let you create a new object. Here, the object constructor syntax is essentially named parameters already, you're essentially providing a Json like block of this property of this value, this property should have this value, similar to the object constructor proposals. But you can then say, dot dot some other object of the same type, which Rust reads as: and fill in anything I haven't specified with the values from this other object. The fallout of that is making new object that is the same as this other object, but for this one change really easy. Could we do something like that either using Rust syntax or something else just conceptually, would that work to make with the with style methods easier, possibly would it help bypass the problems with a read only flag and so on. Finally, kind of the granddaddy of them all proposal in PHP from a couple of years ago is property accessor methods. This is a very contentious RFC, it didn't pass mostly for performance reasons, as I understand it. But the idea here was you could declare a property to have a dedicated getter and setter method. And then when you try to read or write a property, that method gets called transparently in the background. It's essentially the same idea as the magic get and magic set methods on objects, but specifically for each property, which can then eliminate a lot of: if we're talking about this property, if we're talking about that property gives you a lot more flexibility. It also allows you to then, because those are methods, control the access of those methods separately for get and set. So you can have a public getter and private setter method. A number of other languages have this, Python does, JavaScript does. So I included that okay, this has been a proposal on the table before, I personally really like it. The only downside is the performance impact because since people can't really know in advance if a property it's going to be accessing is guarded by methods like this or not, it means every property access, therefore has an extra if statement around it in the engine. And the performance impact of that, well, small, individually, really adds up when you're talking about 10s of thousands of property accesses. As I understand that, that was the main reason that it didn't pass before. I don't have a good solution for the performance issue. Unfortunately, it would be delightful if you know the typing system would let us do that. Or if the JIT would do something there. I have no idea that's well out of my wheelhouse.

Derick Rethans 16:34

That's lots of solutions that people have come up with in the past and haven't made RFCs for yet. Solving them all one by one, as you mentioned isn't particularly useful thing to do. Because, as you say, you end up in a jumbled mess of things. Your article continues to have an analysis section about all the different aspects of all the different problems and solutions that we've just mentioned here. What's your thinking here, how to join up all the dots?

Larry Garfield 17:00

My goal was alright, as I said, what's the minimum amount of change we can do, that gets us the maximum benefit and solve as many problems as possible without making anything worse? Is there a way that we can make some problems not their own problem, but the result of some other problem? Can we make one a degenerate case of another and thereby solve, kill multiple birds with one stone essentially? What I came up with was: one, constructor promotion on its own, I think is very useful. Let's do that. Named parameters on their own are very useful, let's do that. The combination of constructor promotion and named parameters together gives us the equivalent of a object initialization syntax. The specific symbology in the syntax may look slightly different. But essentially you get the same net effect where you could say, hey, new product object and pass it a series of key values and you're done. And the object itself is defined as just a bunch of key values in the construct statements, and no body, and that still gets promoted. So we end up with struct like, or record like objects with relatively little syntax as kind of a side effect of these two other changes that have good arguments for them on their own.

Derick Rethans 18:14

And also without introduce a new concept such as struct.

Larry Garfield 18:18

Exactly. There's also discussion about, should we just introduce a separate language construct for a struct or a record, that is just their properties, possibly some validation, they will pass by value instead of by reference, which makes immutability easier, to design those for immutability. I've toyed with that idea in the past. And every time I come down to eventually I'm going to want to do everything that classes do anyway. Or if they do something special, I'm going to want to do those in classes, except for the way they pass. Legitimately, there's cases where we would want to have a value object that passes in a more by value style instead of the pseudo reference that objects passed today. There are use cases for that, that's really the only difference. Everything else is essentially the same in both cases, it's more work than is needed to try and create a whole separate construct there. Instead, let's make this one construct flexible enough that we can use it in either way, at whatever use case makes sense. I think those two changes together give us the most bang for the buck and don't harm anything else.

Derick Rethans 19:16

Both of these two proposals help to solve the first problem that you have outlined, which is the problem with constructing objects. So the other problem that we spoke about is the value object and access to properties for example. Have you come up with a solution of which proposals would work towards solving that problem as well?

Larry Garfield 19:36

My proposal on that front, based on what's available, is so I like Nicholas's idea of separate access control for read and write. Okay, now what syntax can we use for that that is going to be self explanatory and readable and not block property accessors if we ever get to the point of figuring out how to do those performently. I don't think we can go all the way to property accessors right now, I would love to, but I don't think that's feasible. Instead, we can borrow some of the syntax from that proposal and let you declare hard to explain this in verbal format. It's like: string name, curly brace, public get, private set, curly brace. Which is essentially the syntax that the property accessor proposal RFC had, but with the method bodies removed, which that RFC actually supported anyway. And what that gives us is then a syntax to say, this property has different visibility for reading and writing, for get and for set, in a way where it's natural to be able to add in functionality to that later for getters and setter methods. If we figure out how to do it. There are probably other syntaxes that could do the same. I'm flexible. I think the key here is some sort of syntax that gives us that split visibility in a way that opens itself to future extension, rather than just throwing more keywords before a property and hoping it works out for the best. And once you've done that, then I think it's worth it to consider: could we do some kind of Rust like cloning or Rust like creation process? I don't know. It could be a variant on cloning. People have proposed a clone this with and then list of properties. And that, essentially de-sugars into creating that new object and then calling a bunch of property set commands. Maybe that's viable. Maybe it's not I'm not sure. Maybe using a syntax closer to what Rust has so that certain thing parameter lists can get auto populated, I don't know. But I think that's an area worth exploring, and would be a nice add on to these others, but it's not a prerequisite. The thing I like about what I'm proposing here, each of these individual pieces carries value on its own. And there's a good reason to vote for each of these on their own, but they dovetail together so that the whole is greater than the sum of the parts. And I think that's the mark of good design where you don't solve each individual problem. You have tools that together solve several problems. It just kind of falls out of the design.

Derick Rethans 22:06

Of course, at the moment you wrote this blog post, none of these proposals had more to it than your description in your article.

Larry Garfield 22:15

Some of them had old RFCs that had been proposed and either didn't make it to a vote or the vote gone slightly negative for various reasons. But yeah, I did not have any patches. My C skill is still extraordinarily limited. That this was a discussion starter, not a here's an RFC with code.

Derick Rethans 22:32

Of course, we are no day and a half or two days later. And now there is of course, an RFC for one of them, which is the constructor promotion, which pretty much as we spoke about earlier, picks up Hacklang's syntax and ports it to PHP.

Larry Garfield 22:47

Yes, I've concluded that my primary role in PHP internals is inspiring Nikita to go write things.

Derick Rethans 22:53

And you were successful in this case.

Larry Garfield 22:56

A year ago, I was on this podcast with you talking about comprehensions, when I was pushing for those, and those never happened. But out of that discussion, Nikita noticed, oh yeah, short lambdas I should go finish those and then went and finished that RFC. My role is convincing Nikita, he should do things. So I consider that a worthwhile contribution.

Derick Rethans 23:13

Fair enough. I agree. Anyhow, it would be interesting to see where this ends up going. We are about, what three, three months away from PHP 8.0's feature freeze. So there's plenty of time to look at these other three proposals that you concluded would be great to have altogether.

Larry Garfield 23:32

I'm happy to work with anyone who actually does know, working on internals on any of these. Personally, I think the asymmetric visibility is the next one after constructor promotion. That's straightforward to do. I know Levi Morrison on the lists has suggested that named parameters has a lot of other gotchas around it that I didn't get into here. And that is very likely. There may very well be implementation reasons why these are harder than I present them as. I fully acknowledge that. But again, if any of these individually, I think still moves the language forward in a way that doesn't close off future avenues.

Derick Rethans 24:07

Do you think you'll end up learning some C to be able to work on this yourself?

Larry Garfield 24:11

So I used to work in C briefly, 16 years ago. I had a very, very short career writing software for Palm OS.

Derick Rethans 24:18

And I remember us talking about it, when we recorded episode last year.

Larry Garfield 24:22

And I did some C again, just recently, while playing with FFI. As we've discussed before, the PHP engine is not written in C, it's written in a macro language that is written in C. There's a learning curve there that I have yet to scale.

Derick Rethans 24:34

Fair enough.

Larry Garfield 24:35

If someone wants to mentor me in that while we work on one of these, I am very open to that. So putting that out there.

Derick Rethans 24:40

You might be inundated by messages now, you never know.

Larry Garfield 24:43

Better that then getting ignored

Derick Rethans 24:45

Do you have anything else to at?

Larry Garfield 24:46

I think it's beneficial for PHP collectively to take this broader approach of, not just okay, what can solve this immediate problem in front of us, we can scratch this one itch, but what are all the itches that we have that need to get scratched? And how can we solve all of those in a way that is going to have the best bang for the buck. And let us do the least amount of work at the least amount of syntax, least amount of conceptual overhead, and yet give us the most flexibility. And there's been a lot of talk anytime we're talking about the PHP type system of we eventually want generics, generics are hard. But let's make sure that whatever we do, doesn't make generics even harder. I think that's good that we have this goal in mind. And we're: all right, what iterative steps get us closer to that without locking us, in without painting us into a corner. And that's kind of what I'm trying to do here. And I would very much encourage everyone working on PHP to take that approach of: don't solve the immediate problem, look at the broader picture, what will solve multiple problems, what will dovetail nicely with something else and what kind of big picture plan in architecture we can look at that ends up making the language better rather than just looking at our feet.

Derick Rethans 25:57

Well, thanks for taking the time this afternoon to come and talk about the object ergonomics. We'll see how much of it ends up in PHP eight.

Larry Garfield 26:05

Fingers crossed.

Derick Rethans 26:07

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening and I'll see you next week.


PHP Internals News: Episode 50: The RFC Process

PHP Internals News: Episode 50: The RFC Process

In this episode of "PHP Internals News", Henrik Gemal (LinkedIn, Website) asks me about how PHP's RFC process works, and I try to answer all of his questions.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 50. Today I'm talking with Henrik come out after he reached out with a question. You might know that at the end of every podcast, I ask: if you have any questions, feel free to email me. And Henrik was the first person to actually do so within a year and a half's time. For the fun, I'm thinking that instead of I'm asking the questions, I'm letting Henrik ask the questions today, because he suggested that we should do a podcast about how the RFC process actually works. Henrik, would you please introduce yourself?

Henrik Gemal 0:52

Yeah, my name is Henrik Gemal. I live in Denmark. The CTO of dinner booking which does reservation systems for restaurants. I've been doing a PHP development for more than 10 years. But I'm not coding so much now. Now I'm managing a big team of PHP developers. And I also been involved in the the open source development of Mozilla Firefox.

Derick Rethans 1:19

So usually I prepare the questions, but in this case, Henrik has prepared the questions. So I'll hand over to him to get started with them. And I'll try to do my best to answer the questions.

Henrik Gemal 1:27

I heard a lot about these RFCs. And I was interested in the process of it. So I'm just starting right off here, who can actually do an RFC? Is it anybody on the internet?

Derick Rethans 1:38

Yeah, pretty much. In order to be able to do an RFC, what you would need is you need to have an idea. And then you need access to our wiki system to be able to actually start writing that, well not to write them, to publish it. The RFC process is open for everybody. In the last year and a half or so, some of the podcasts that I've done have been with people that have been contributing to PHP for a long time. But in other cases, it's people like yourself that have an idea, come up, work together with somebody to work on a patch, and then create an RFC out of that. And that's then goes through the whole process. And sometimes they get accepted, and sometimes they don't.

Henrik Gemal 2:16

How technical are the RFCs? Is it like coding? Or is it more like the idea in general?

Derick Rethans 2:23

The idea needs to be there, it needs to be thought out. It needs to have a good reason for why we want to add or change something in PHP. The motivation is almost as important as what the change or addition actually is about. Now, that doesn't always get us here at variable. In my opinion, but that is an important thing. Now with the idea we need to talk about what changes it has on the rest of the ecosystem, whether they are backward compatible breaks in there, how it effects extensions, or sometimes how it effects OPCache. Sometimes considerations have to be taken for that because it's, it's something quite important in the PHP ecosystem. And it is recommended that it comes with a patch, because it's often a lot easier to talk about an implementation than to talk about the idea. But that is not a necessity. There have been quite some RFCs where the idea was there. But it wasn't a patch right away yet. It is less likely that these RFCs will get accepted, because in order to get something into PHP not only needs to be there a good idea, that also needs to be there a good implementation of it. If you have been a long term contributor to PHP, then you should know how to write a patch yourself. In other cases, you'll see people that have an idea try to find somebody else to do and work on the implementation together. But all RFCs, if they get accepted. It's always pending a good implementation.

Henrik Gemal 3:52

How is an RFC actually done? Is that like a template you fill out or is it like a website or how does it work?

Derick Rethans 3:59

Our Wiki, I will add a link to that in the show notes, has a template of how to create an RFC. It has a set set of sections. There's always an introduction that basically lays out what it is about or why this change is being made. Then there is often a proposal of what the change actually is. And then there's a few sections that are sometimes empty or sometimes are filled in such as, at least backwards incompatible changes, for which PHP version is been targeted, what the impact is to all the parts of the PHP ecosystem. But these things are not always necessary, because they don't always make sense to do right? If you want to add a new syntax to PHP, then that almost never influences existing extensions, but it will influence OPCache, for example. And then there's also often things like open issues, things we haven't quite thought through yet. A bit of a discussion, discussion bits will get filled in after people in the PHP internals list, which I'm sure we'll get to in a moment, come up with better ideas or alternatives sometimes, and then things like future scope will also be part of the template. We don't really require a very rigid approach to this, but we do appreciate if all the sections are filled in, or at least thought about in such a way that there's either information or not information. And then at the end, there's often a proposed voting choice. Everything at the moment needs to pass by two thirds majority before it gets accepted. So yeah, those are the things in the template itself. But the template is important. And you do need to fill it in, if you want to propose an RFC.

Henrik Gemal 5:33

Are all RFCs public or do you have like private RFCs?

Derick Rethans 5:38

All RFCs have to be public, otherwise they can't be voted on. But some RFCs start out of just a conversation with a few developers coming up with an idea. In the last few months, some more complicated RFC start out on a GIT repository. As a pull request, they never get merged anywhere. Because on GitHub, it makes it much easier to comment on specific sections for adopting feedback. Instead of having large discussions on the PHP internals mailing list, where sometimes comments might just get lost because there's too much text in there. Even though these RFC start out, while they're still sort of public, but nobody knows about them. In the end, they will always have to be public otherwise there won't be any voting, done on it, and it won't get accepted.

Henrik Gemal 6:27

Where's the RFC sent to and who's kind of in charge of the RFC? Is the one that makes the RFC or is it like a RFC commander?

Derick Rethans 6:37

The person that makes the RFC is responsible for guiding it through the whole process that we have. Once they are finished, there is a requirement for you emailing the PHP internals list with a specific prefix, which I think is RFC in square brackets. And then that starts a minimum discussion periods of two weeks. That discussion period might end up longer, in cases, lots of things to talk about or discuss or lots of disagreements, but the discussion period has to be a minimum of two weeks on the PHP internals mailing list.

Henrik Gemal 7:09

I was wondering a little bit about the priority RFCs because I see RFCs as like, a little bit like feature requests. So wondering who actually decides on the priority of an RFC?

Derick Rethans 7:23

Nobody really decides on the priority. Multiple RFCs can go through the process at the same time, you don't really have a priority of which one is more important than others. So yeah, there's nothing really there for it.

Henrik Gemal 7:35

I was just wondering if it's done like a normal project, you know, there might be many RFCs at the same time. I'm wondering how many kind of RFCs are there at the moment, are we talking 10 or are talking thousands?

Derick Rethans 7:50

This depends a bit on where in PHP's release cycle we are. PHP should get released at the end of November or the start of December. In all PHP seven releases that actually has happens. Usually the period between December and March, there will be like maybe one or two a week, which is great because that makes it possible for me to pick the right one to make an episode out for the podcast. At the moment, there are 10 outstanding RFCs. That means there are so many that I don't actually have enough time to talk about all of these on the podcast. However, they are often more just before we go to feature freeze, which happens at the end of June. So there's still two months to go. But you also see that over the last two years, there's a lot more smaller RFCs than there are big RFCs. So big RFCs like union types. They tend to be early in a release cycle, where smaller RFCs, as an example here, there's currently an RFC that there is no episode about, that suggests to do a stricter type checks for arithmetic or bitwise operators. Those are tiny, tiny changes. And in the last two years, there have been more and more smaller RFC than bigger RFCs because they tend to limit the amount of contention that people can disagree with and hence, often makes it easier to then get accepted. That is a change that I've been seeing over the years. But no, there are no thousands for each PHP version, I would say on average, there's about one a week, so about 50.

Henrik Gemal 9:19

I want to get a little bit into the voting part, because that sounds kind of interesting, who can actually vote?

Derick Rethans 9:28

After the two week minimum discussion period is over on the PHP internals mailing list, an RFC author can decide to put up the RFC for a vote. And that also requires you then to send an email to the PHP internals mailing list prefixing your subject with the word vote in capital letters. Now at this moment, you unfortunately see that people start paying attention to the RFC. Instead of doing that during the discussion period. At a moment of vote gets called you shouldn't really change RFC unless it's for like typos or like minor clarifications to things, you can't really change syntax in it for example. People can vote our people with a PHP commit access. And that includes internals developers, documentation contributors, and people that do things in the infrastructure. Everybody that has a PHP VCS account and VCS, version control system, that used to be CVS and now then SVN, and now GIT, as well as people that have proposed RFCs. So the group that technically could vote is over 1000 big, but the amount of people that vote is very much under 50 most of the time. We don't really have any criteria beyond you have to have an account to be able to vote in PHP RFCs.

Henrik Gemal 10:43

How is the voting actual done?

Derick Rethans 10:47

Since about last year, each RFC needs to be accepted, with a two thirds majority. On each RFC on the wiki, once a vote gets called you as an RC alter needs to include a small code snippets that then creates a poll. Very often do we want this? Or do we not want this? So it's a yes or no question. But sometimes there are optional votes, whether we want to do it a specific way, or another specific way. Sometimes that allows you to then select between different syntaxes. I don't think that is necessarily a good idea to have. I think the RFC author should be opinionated enough about picking a specific syntax. It is probably better to have a secondary vote as we call those. Those secondary votes don't to have two thirds majority is often which one of the options wins out of these. But the main RFC won't get accepted, unless there's a two thirds majority with a poll done on the wiki.

Henrik Gemal 11:46

What happens after the vote? You know if it's both if it's Yes or no?

Derick Rethans 11:53

I'll start with the easy case, the no case. If it's a no then the RFC gets rejected. That also means that sometimes an RFC fails for a very specific reason. Maybe some people didn't like the syntax, or it was like a one tweak where it would behave in a wrong way or something like that. But as a rule that says that you cannot put the same subject back up for discussion for six months, unless there are substantial changes. Now, this has happened with scalar type hints, for example, and a few other big ones. If an RFC gets accepted, then pending on whether there is an implementation, the implementation will get set up as a pull request to the PHP project on GitHub. And then the discussion about the implementation starts. If the implementation doesn't get to the point where it is actually good enough, or whether it can actually not be implemented in a way that it doesn't impact performance, it still might end up failing, or might not get merged. And in some cases, it means that a feature will get added at some point but it might not be necessarily in the PHP version that it got targeted for. I don't actually have an example for that now. If the implementation is already good and already discussed it can get merged pretty much instantly. And then it will be part of the next PHP version.

Henrik Gemal 13:08

How many RFCs voted on every year? And what majority voted yes or no?

Derick Rethans 13:16

I don't have the stats for that. But there is a website called RFC watch, where you can see which RFCs had been gone through and which one had been accepted or not, in a nice kind of graph way. I will add a link in the show notes for that. I would guess that during a year, about 50 RFCs are voted on. And I will think that about half of them are passing. But that's a guess I don't have the stats.

Henrik Gemal 13:42

Thank you very much for the answers. It brought me closer to the whole process of the PHP development. You have any other things to add?

Derick Rethans 13:52

I don't think so at the moment. I think what we she'd be a bit careful about is that although we're getting closer and closer to feature freeze at the end of June. We currently have just elected the new PHP eight zero release managers, but I keep the names secret, because this podcast is recorded in the past. They are going to be responsible now for doing all the organisatorical work for PHP eight zero. And that also means that feature freeze will happen at the end of June somewhere. And I expect to see a bunch of RFCs coming up with just enough time to make it into PHP eight zero, or not. So that's going to be interesting to see what comes up there. But other than that, I think we have explained most things in the RFC process now. And I thought it was a fun thing for once somebody else asking the questions and me giving the answers. And I think in the future, I think I would like to do like a Q&A session where I have multiple people asking questions about the PHP process. I also thought this was a good experiment and thanks for you taking the time to ask me all dthese questions today.

Henrik Gemal 15:00

No problem. I love your podcast. I listen to it whenever I bike to work. It's nice to get some insights into the PHP development.

Derick Rethans 15:10

Yeah, and that is exactly why I started it. Thank you Henrik for taking the time this morning to ask me the questions. And I hope you enjoyed it.

Henrik Gemal 15:18

Thank you very much for having me on the show.

Derick Rethans 15:22

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 49: COPA

PHP Internals News: Episode 49: COPA

In this episode of "PHP Internals News" I converse with Jakob Givoni (LinkedIn) about the "Compact Object Property Assignment", or COPA for short, RFC that he is proposing for inclusion in PHP 8.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 49. Today I'm talking with Jakob Givoni about an RFC that is made with a very long name, the compact object property assignment RFC or COPA for short. Jakob, would you please introduce yourself?

Jakob Givoni 0:39

Yes, my name is Jakob. I'm from Denmark, and I've been working programming in PHP for 20 years now. I work as a software engineer for a company in Barcelona that's called Vendo. I got inspired to get involved in PHP internals after I saw you as well as Rasmus and Nikita in a PHP conference in Barcelona last November.

Derick Rethans 1:00

there was a good conference, I always like going there. Hopefully, they will run it this year as well. What I'd like to talk to you about today is the COPA RFC that you've made. What is the problem that this is trying to solve?

Jakob Givoni 1:14

Yes, I was puzzled for a long time why PHP didn't have object literals. And I looked into it. And I saw that it was not for lack of trying. Eventually, I decided to give it a go with a different approach. The basic problem is simply to be able to construct, populate, and send an object in one single expression in a block, also called inline. It can be like an alternative to an associative array. It gives the data a well defined structure, because the signature of the data is all documented in the class.

Derick Rethans 1:47

Of course, people abuse associative arrays for these things at a moment, right? Why are you particularly interested in addressing this deficiency as you see it?

Jakob Givoni 1:57

Well, I think it's a common task. It's something I've been missing, as I said inline objects, obviously literals for a long time, and I think it's a lot of people have been looking for something like this. And also, it seemed like it was an opportunity that seemed to be an fairly simple grasp.

Derick Rethans 2:14

What kind of solutions do people use currently, instead?

Jakob Givoni 2:18

I think, very popular one is the associative array where you define key value pairs as an array. The problem with that is that you don't get any help on the name of the indexes nor the types of the values.

Derick Rethans 2:33

I mean, it's easy to make a typo in the name, right? And it just either exists in the array suddenly, if you set it or you just get a random null value back. As you said, yeah, there's no way of enforcing the type here, of course. COPA compact object property assignment is a mouthful, and it is a new bit of syntax to the PHP language. What is this new syntax going to look like?

Jakob Givoni 2:55

While it looks just like when you assign a value to a property, but here you can add several comma separated lines of property name equals value inside a square bracket block, which is coming after the array and the array arrow operator. The syntax shouldn't really conflict with anything else we have at the moment.

Derick Rethans 3:17

Because that's becoming more and more of a problem, right? Finding new bits of characters to use for new syntax. It is something that came up with annotations or attributes as well.

Jakob Givoni 3:27

And then to start talking about, does this look like typical PHP? Or do you just like this syntax? Or do you hate it? It becomes a taste based thing. For me, the important thing is that if it works, and if it's fairly trivial to implement, I don't have a problem with it.

Derick Rethans 3:43

There was a related RFC early in the year which was called the object initializer RFC. How is your proposal different from that one?

Jakob Givoni 3:51

The object initializer is a new concept. Mine is different in in that I didn't want to introduce any new concepts. My approach was focused on pragmatism. In that other RFC, the initialization is done at the construction time. And you can kind of do it without even having to define your constructor. And one of the most important aspects of that one was to enforce that all the mandatory properties have been initialised. Because you can have type properties in PHP 7.4. If they don't have a value, then there is introduction of this new state of uninitialized properties. And the author of that RFC wanted to make sure that once the object was ready was fully constructed, it would validate that there was nothing missing there. So it has like six out of seven characteristics in common with mine, and one characteristic that is different. I looked into this about the mandatory promises and I didn't find a simple way or an obvious way to handle it. I have one idea if this COPA should pass and I have another idea if it fails. I didn't want to include that it was not part of my main goals.

Derick Rethans 5:01

I'm looking at the syntax here for a bit. And it seems that way how you can do this COPA block. If you have an object, you use the arrow which is dash greater than sign square brackets, and then the list of properties that you want to assign values to. And the RFC shows that to be equivalent to doing each line manually yourself. Does that mean that it is only works for public properties?

Jakob Givoni 5:31

No, it would work also, for what do you call it, virtual properties that don't actually exist, or if they're private, it would just invoke the magic set method in that case. The same thing would happen as if you were to do the assignment line by line as in the example.

Derick Rethans 5:48

Without there being the underscore underscore set method set, it means that you can only really set the public properties in that case.

Jakob Givoni 5:56

You won't be able to set private or protected properties directly unless the magic method does that.

Derick Rethans 6:03

So does that mean that it is pretty much only something that happens in syntax, and it doesn't have any other side effects or any other functionality that you wouldn't already be able to do?

Jakob Givoni 6:15

Yeah, it's just a new syntax for that. The emphasis here was pragmatism. So not introducing any new concepts.

Derick Rethans 6:23

What would use cases for this be?

Jakob Givoni 6:25

Typically, as I mentioned, they're data transfer objects, value objects. Those simple associative arrays that are sometimes used as argument backs to constructors, when you create objects. Some people have given some examples where they would like to use this to dispatch events or commands to some different handlers. And whenever you want to create and populate and and use the object in one go, the COPA should help you.

Derick Rethans 6:58

I suppose COPA would also work for standard class objects?

Jakob Givoni 7:02

It's an object just like anything else. So yeah, yes, there shouldn't be any surprises.

Derick Rethans 7:07

But of course, it doesn't really make a lot of sense to use standard class because then again, of course, you don't have the benefits of checking your property names or types, again, of course. Are the other use cases you can think of?

Jakob Givoni 7:19

Why don't have anything else in mind.

Derick Rethans 7:22

I remember quite a long time ago, because this is a subject that comes up quite a bit. That's pretty much people that write PHP code abuse associative arrays so much. Just like the object initializers RFC, as well as your COPA RFC, try to use objects in a different way to be able to prevent developers from abusing associative arrays, pretty much as more stricter data types. In languages like C, there's a distinct datatype for this is called a struct. Do you think it would make sense that instead of trying to overload our object semantics, then in stats use, or introduce something like a struct concept of that C or other kind of statically typed languages have?

Jakob Givoni 8:10

As I understand it, a struct is basically the same thing as structured as what I'm talking about structure set of data. However, I'm not sure if it's worth it to introduce a new concept. I don't know if it's necessary if it's possible to reuse the things that we already have enough familiar with. I think I would prefer that you call it overloading the object. But I don't see a lot of problems with having an object that is simply a list of properties with values. It's a very basic object. An object doesn't need to have any methods, it's possible to use that. Every time we add a new concept like struct would be, I feel that it would lead to a combinatorial explosion of implications that later you need to assess every time you want another future change. I haven't seen any RFCs that have specifically mentioned structs. But it is a very related concept.

Derick Rethans 9:08

I'm just asking because I spent a lot of time in C where we have structs. But we don't really have objects or classes to begin with. It's more familiar for me to use that. And the other reason why I was asking is that perhaps it would be possible to create like a slightly more natural syntax, because, in my opinion, I think the one that you currently have chosen isn't particularly the most friendly one, but that's my own opinion here.

Jakob Givoni 9:33

There might be a window of opportunity, because curly brackets after the variable is going to be deprecated as a way as an array access. So maybe that could be used just curly brackets and dropping the arrow itself. That would look a lot more like like an object, I think, and it would also be shorter.

Right. I mean, PHP 7.4 deprecated these.

So the question is just how soon can we remove it and replace it to mean something else completely?

Derick Rethans 10:03

Yeah, that's a good question. I don't think I have the answer either. I guess it can be introduced as long as syntax that existed previously would now not do something different. And I think you would actually be okay here.

Jakob Givoni 10:15

I'm pretty sure it would throw a syntax error. If you try to run this code in a previous version.

Derick Rethans 10:21

I meant saying if you would reuse the curly braces, because as you said, they have been deprecated in PHP 7.4.

Jakob Givoni 10:28

I mean, if someone were not to follow that deprecation notice, that is now in place and would continue to keep their the code. If we change the implementation, it's better to get a clear, fatal error than to just have something really spurious happening.

Derick Rethans 10:45

Yes, absolutely, I definitely agree. Now, that's sort of what I was trying to get at, but you explained it more eloquently than I did. The RFC lists a few special cases. It talks about execution order and exceptions. I think some, somebody brought up somewhere that what happened If we're trying to set multiple properties through COPA and say the second out of three throws an exception. What would be the end state of the object for example? Could you talk a little bit through that?

Jakob Givoni 11:11

Regarding exceptions being thrown in any of those expressions where you are assigning, it's important to understand that the block of code that is COPA is not an atomic operation. Anything that happened before the exception will still have happened. And everything anything that happens after won't happen. Exactly like what you would expect if you were doing it line by line. Or if you were using method chaining to do several things on an object. I think it's going to happen what you would expect to happen unless for some, I think it might be unintuitive, that it's not an atomic operation. But it's just important to keep that in mind. That's why I listed it under special cases. And there's something similar with the execution order, in that you can list the properties in any order you like. It doesn't necessarily mean that you're going to get the same result if you change the order because you will be able to use the value of a previous assignment in the next one. Again, not 100% intuitive, but I think it might be worth the trade off in implementation and flexibility.

Derick Rethans 12:19

As you mentioned, there's no new semantics in there. Talking a little bit about implementation here. As there is no patch available, is this something that you'd be interested in developing yourself? Or are you looking for somebody else to help you out on that?

Jakob Givoni 12:32

I actually haven't contributed any code before. I'm not familiar with C. But one reason that I chose this RFC and this approach is also that if I can't get any volunteers, I might be able to learn and to do it myself, since it seems like it's mostly a parser syntax thing, probably should be able to pick that up.

Derick Rethans 12:53

I would also think because there is no new semantics in here, that it would instead be something in between, probably just the lexer that we have, the parser, and then constructing an equivalent abstract syntax tree or AST segment out of that.

Jakob Givoni 13:12

I would be thrilled to collaborate with someone to do some pair programming in order to get started if anyone is up for it.

Derick Rethans 13:18

So if you're listening to this episode, and you want to help Jakob out, why not get in touch with him? His contact details will be in the show notes for sure. The RFC also lists a few things that you have thought about, but you have decided not to either pick up into the RFC or you don't think they are in scope. Would we'll talk about that a little bit?

Jakob Givoni 13:36

There's some special things that you can do at the moment when you assign a value to a property. Things like using a variable to specify the property name, or to generate the property name from an expression using the curly brackets after the arrow. There's also array access directly on the properties, or increment, decrement, or nested object accesses. I don't think that these things are really essential. I've decided to probably leave it out of scope for now unless it's trivial. If it if it's trivial to implement that as well. It's okay with me. It's not deal breaker. But you have to do a cost benefit analysis. And I'm thinking that it could be a future scope. If there's a demand this can be addressed in a later RFC.

Derick Rethans 14:23

The RFC also talks about nested COPA. But it looks so complicated to me that I'm not sure whether it is actually something that we even should add to begin with.

Jakob Givoni 14:34

I don't think it's as complicated as it looks. So you can already already do nested COPA in if you create a new object inline as well as you of course, you can assign it to a property in the outer scope of the COPA. But if you want to over, to set just one property of a nested object, then you cannot do that directly. Well, you can do it actually if you access the previous one. Because you have access to the current property when you do their assignments. So you can see in my example that you can do it. But there might be a better syntax for doing that.

Derick Rethans 15:11

I'm happy to see that there's no backward incompatible changes. So that's always a win. What has been the feedback so far?

Jakob Givoni 15:17

Yeah, the feedback has been mixed bag as to say. There's some recognition that this has potential to be a useful feature. This is a critique of the syntax, as you also mentioned, and then about the missing functionality, like the mandatory properties and atomic operations. And then of course, named parameters always comes up. The PHP internals list. It's a tough crowd. I really enjoyed engaged in this project. So I don't mind it's part of it. I also really like this side discussion that we're having currently about ways to improve the way that we collaborate and make progress, especially on tough issues.

Derick Rethans 15:58 That has definitely improved over the last five years to a decade, but it can always be improved more, I would say. What is your end goal with this RFC? I guess you would like to see this added to PHP at some point, are you targeting it for PHP eight?

Jakob Givoni 16:13

I would be extremely proud to see this added to PHP at some point. And if it can make it into PHP eight in the first release, that would be awesome. That's at least what I'm going for, for now.

Derick Rethans 16:25

The PHP project is looking for release managers for PHP eight zero, with feature freeze happening at the end of June somewhere. So there's lesser and lesser time available for doing these things. So I'm curious to see where this ends up.

Jakob Givoni 16:39

It's a race against time at the moment.

Derick Rethans 16:42

But that's always the case, isn't it? I think be interesting to see if, if somebody wants to help out to make the implementation of this, or rather, I'd be interested to see whether you'd be able to pick up that yourself actually. We can always do with more people that work on a PHP language. Do you have anything else to add yourself?

Jakob Givoni 17:00

I'd say that I spent a lot of effort researching and writing this. And I just hope that people will study the RFC properly and keep an open mind. I know it's probably going to be a hard sell. And that's okay. I just wanted to give it a go. And this is just just the beginning of my contributions, I hope.

Derick Rethans 17:19

I spoke with Mate a little bit a few episodes ago. He was getting worried about it not getting accepted at some point. And I pointed out to him that scalar type hints took about a decade and seven attempts to finally make it into PHP. So it helps to just persist I would say in times.

Jakob Givoni 17:37

Times change and also you get new ideas and you evolve.

Derick Rethans 17:42

The language continues to improve and that's how I like it. Thanks, Jakob for taking the time to talk to me today. It was interesting to see what you're up to.

Jakob Givoni 17:51

My pleasure. Thank you so much Derick for having me.

Derick Rethans 17:56

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 48: PHP 8, JIT, and complexity

PHP Internals News: Episode 48: PHP 8, JIT, and complexity

In this episode of "PHP Internals News" I discuss PHP 8's JIT engine with Sara Golemon (GitHub).

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 48. Today I'm talking with Sara Golemon about PHP 8 and JIT. Sara, would you please introduce yourself?

Sara Golemon 0:33

Hi there. Hi there, everybody listening to PHP internals podcast. I'm Sara. I've been on this podcast before. But in case you're just getting here to for the first time, welcome to the podcast. You have a nice backlog to go through. I am a lapsed web developer, come database security engineer by day, and an opinionated open source dev slash PHP 7.2 release manager by night and also day. I've been involved with the project for about 20 years now off and on. Somehow I just keep coming back for more punishment.

Derick Rethans 1:03

We're leading up to PHP 8, with lots of new features being added. But one of the biggest thing in PHP 8 that I've spoken about on the podcast on before all the way back last year in Episode 7, is that PHP eight is going to get a JIT engine. Would you care to explain what a JIT engine does again?

Sara Golemon 1:20

Well, I'm going to give you the short, you can look this up on Wikipedia in two seconds definition of JIT, means just in time compilation. That doesn't really tell you much, unless you listen to it on the sort of other half of that of AOT, or ahead of time compilation. AOT is what you expect from applications like GCC, you know, you just make an application that you've got C or C++ kind of source code to that's ahead of time. JIT is saying, well, let's take the source for application. And let's just run with it. Let's just start executing it as fast as I can. And eventually we're going to get down to some compiled code. That's going to run a little bit quicker than the initial stuff did. PHP already has this nice little virtual machine built into it. We call it the Zend engine. That takes your script and immediately just says: All right, well, what does this say in computer terms? Well, a computer readable term is a series of these op codes, they're also called byte codes in other languages that give you instructions for: run this type of instruction at this time and get something done. The PHP runtime interpreter interprets that one instruction at a time basically pretending to be a CPU. This works quite well, it runs quite efficiently. But there's still this sort of bottleneck in the middle there of a program pretending to be a CPU running on top of a CPU in order to run other code. The idea of JIT is that this thing sitting in the middle is going to gradually figure out what your program really is trying to do and how it's intended to run, and It's going to take those PHP instructions and it's going to turn them all the way down into CPU instructions, so that it can get out of the way and let the CPU run your code natively as if it had been written in a compiled AOT kind of language. What that actually means for execution of PHP code in PHP 8 is still sort of a, you know, a question that's, that's left to be answered here. I listened to your interview with Zeev. Episode 7, is a good episode of getting some good information on that. We do definitely agree on what the status of the JIT within PHP is, right now we can. It's subjective facts like this is how much work has been done largely by Dmitri, where we can kind of expect to see the best gains come from. I personally think I might be a little bit more pessimistic than him in terms of the actual performance impact we get out of it. I think we both recognise we're not going to see the two to one kind of improvements we saw from five to seven. Nobody's realistically expecting that, but if you look at the demo that Zeev ran a few months ago, where he shows the Mandelbrot set being generated in two different PHP requests, and then WebSocket out to a nice pretty display, it's a very visceral reaction because you can see one Mandelbrot set being calculated much, much faster than the other. And he acknowledges though this is not realistic PHP code, nobody's writing the Mandelbrot calculation in PHP. We can see that under certain workloads, it's definitely getting faster. But for PHP core mission, which is web serving, I mean, we both know that it's not going to be massively fast. I think it's going to be almost imperceptibly fast.

Derick Rethans 4:41

One question for my site, the Mandelbrot set, the implementation of that is all in a specific function, right? And it's all CPU heavy code, not IO.

Sara Golemon 4:51

Yes.

Derick Rethans 4:52

And it's all that in the same function.

Sara Golemon 4:54

Yes.

Derick Rethans 4:55

Now, what I was thinking of the other day is that how does this interact with calling standard library functions, because the JIT engine is going to have to go out of basically running things on the CPU and calling things that are then implemented in C to begin with.

Sara Golemon 5:10

So you're asking that question, because you already know some of the pitfalls of JIT, and you're leading me into it. And that's fine. When a JIT emitter is taking the language that it's emitting, so PHP. As long as it remains within the scope of PHP, it can sort of keep track of where it's at. It's like, Okay, I know this variable's init, your because I saw it get set. I know that this is going on here. I know that's going on there. And it can carry those assumptions around as it's admitting code. And emit very efficient code that doesn't need a whole bunch of double check guards of like: Wait, is this still an integer? Wait, is that still a string? All of these sort of like escape hatches for when things go wrong. Anytime you cross over into, I will say C-land, or internals land, or ahead of time compiled land. It's basically calling into what it sees as a black box. And it just says: Okay, here's some data, I know the types going in, have fun with it. And something air quotes happening in the air happens with that code and the black box spits out an answer. Well, by the time the black box has spit out the answer, the JIT that has taken that PHP code, no longer knows if any of its assumptions are true or not. It just has to say: Well, time to start from scratch, time to keep track of where we are from here, build up a new set of assumptions. So we get this speed bump in the road of executing code. And it turns out most PHP applications are using a whole lot of those internal API's because they're quite useful. There is a kitchen sink in PHP, and it does stuff. So you have these repeated hits of this road bump happening, and that's not great. If we want to compare this to other JIT languages that are out there. I might suggest we compare this to HHVM because of course, HHVM, at least in the beginning implemented a fairly close kin cousin to the dialect of PHP. It has since diverge much more and become hacklang. But it was doing the same thing, taking PHP code, running it native on the CPU and occasionally having to make that cross to this its own version of internals, or it was running C++ code. One of the ways to reduce those numbers of jumps is that they took a lot of those internal functions, the ones that actually didn't need to do anything, particularly internals ish, and just rewrote them in PHP code. And if you look at the HHVM source code right now, there is a big directory called systemlib and that's a whole bunch of hacklang code, read it as PHP code, that is implementing a lot of these very common quote unquote internal functions. We just had an RFC for function called str_contains(), that is a function that could have been hundred percent been written just as PHP code. Something could have thrown that into packagist. For the record, I voted against it because of exactly that. I think you should write that in packagist and just put it in your composer.json is okay. It's gonna pass anyway, it got a lot of votes. That aside over, that is a sort of function that if we were putting it into sort of an 8.X version of PHP, where we did have our own type of systemlib, we would have probably just said, let's write that as PHP code. So that the JIT, when it enters that function, can keep all those assumptions intact, and potentially even inline some of those instructions and avoid the function call entirely. That's basically taking all of the instructions that are part of the in this case, str_contains() function, and implementing them within the scope of the function that was calling it. So you skip that entire function call overhead, which a lot of people know is still one of PHPs sort of weaker points in terms of where that fat to trim is, as Zeev said in Episode Seven, we still have some parts of PHP that are a bit slow, irrespective of a JIT.

Derick Rethans 8:50

There are actually a few functions that have been inlined now into op codes. strlen() is an example of this where instead of it now being a function call, it's actually directly an opcode. Because it is a function that is used so much and actually gain a bit of performances there.

Sara Golemon 9:05

Yeah, I think all of these functions as well are just a single opcode for type check. Yeah.

Derick Rethans 9:10

There's a whole bunch of them for sure. I saw that earlier this morning, Dmitri produced, or proposed another branch in which he implemented tracing JITs, instead of the JIT that we already have, and I have no idea what the difference is between a normal JIT engine and the tracing JIT engine,

Sara Golemon 9:25

Ultimately, the distinction is not that important to end users, it's going to function the same, but it is a sort of an internal implementation detail. HVVM's by the way, is a tracing JIT. It basically looks at any given unit of work that it needs to translate, let's say a function, and it says, what are the pieces that have these sort of non branching parts attached to them? Let me look at each of the non branching pieces. And let me create a version of that translation based on the types that I expect to be going in there. If the types fail, I'm gonna have to create a new version of that piece. But then that piece can plug into this sort of chain of tracelets to create a full function. Most of the time, especially if you've written code that is well type hinted, you've got, you know, strict types turned on, you've got all of your types on the on the function parameters set. And it's very easy for the JIT to infer the types out of what you've put into your function. You're only ever going to need to create a single tracelet of any given section, and your full trace is going to be a single, unbroken chain of: do this, do this, maybe do a jump to another spot, just keep doing this, doing this, doing this. If you have, let's say, slightly messier code, maybe you're not using any kind of type hinting it becomes very difficult to infer any of the types, because there's lots of different call sites, that are doing lots of different things. We may end up having some functions that have multiple tracelets per body section that get built into the giant bush of interconnected edges, that's less ideal in terms of maximising performance, but it still at least functions.

Derick Rethans 11:06

We have spoken a little bit about what a JIT engine is and sort of how it works. It sounds quite complex and complicated.

Sara Golemon 11:14

It is definitely complicated. And I'm feeling like that's another lead. And so I'll just run with it.

Derick Rethans 11:19

I've also got to say my next leading question... Maybe I should actually ask the question?

Sara Golemon 11:24

Well, let's actually take a step back from the JIT for a second. And let's look at where the engine is right now. So the engine is basically two very large pieces. That's the sort of the extension library of all of the runtime functions. Everything you see exposed in user space, and the actual scripting engine. There are some other smaller pieces, but those are two, the two really big pieces. There are a whole lot of people pay a whole lot of attention to the extension piece, because that's the flashy bit. That's the part that gives you some bit of binding that you didn't have before, or some bit of functionality that can be delivered out of the box as part of that kitchen sink. And that definitely needs attention. I'm glad that that continues to evolve. But the scripting engine is that piece that defines syntax and how code is actually going to run.

Derick Rethans 12:09

Reading extension's code as a whole lot easier than reading the engine code.

Sara Golemon 12:13

And that's where I was going to go with that, yes, if you look at the code that's under ext, you can even come into that code without knowing any C at all. And you can actually make pretty good sense of a lot of it because a) PHP uses a whole lot of macros. So every function is literally defined with a macro that says: PHP_FUNCTION, like right here, PHP function, every class method, PHP_METHOD, here's the class name. Here's the method name. And what these things do are pretty clear sort of API's. They're very small bite sized pieces for the most part. The bits that involves sort of defining a class and how it does its memory management, those get a little bit more complicated, but I think on the whole extension code is far more accessible. If you go and look at the engine, particularly the runtime pieces of the engine, although the compiler is complex as well. You have to do a lot of digging before you even get to a point that you can see how the pieces maybe start to fit together. You and I have spent enough time in the engine code that we know where to look for a particular thing. Like let's say that opcode, you mentioned that implements strlen(). We know that, oh, zend_vm_def.h has got the definition for that. We also know that that file is not real code. It's a pre processed version of code that gets built later on. Somebody coming to that blind is not going to see a lot of those pieces. So there's already this big ramp up just to get into these engine as it exists now in 7.4. Let's add JIT on top of that. You've got code that is doing call forward graphs, and single static analysis, and finding these tracelets, and making sense of the code at a higher level than a single instruction at a time, and then distilling that down into instructions that the CPU is going to recognise. And CPU Instructions are these packed complex things that deal with immediates, and indirects, and indirects of indirects, and registers. And the x86 call ABI is ridiculous thing that nobody should ever have to look at. So you add all this complexity to it, that by the way, sits in ext/opcache. It's all isolated to this one extension that reaches into the engine, and fiddles around with things to make all this JIT magic happen. You're going to take your reduced set of developers who know how to work on Zend engine, and you're going to reduce that further. I think at the moment, it's still only about three or four people who actually understand how PHP's JIT is put together enough that they can do any effective work on it. That worries me for sure. I don't think that's an insurmountable hill to climb, especially if we can start getting some documentation written about it, at least from a high level point of view. Hey, you know, look over here to find this stuff. Look over here to find that stuff. Something to get started. So the people who have at least that basic understanding of how the VM part of the Zend engine works can sort of upgrade their knowledge to get into to the JIT. I only think that's worth it. If we actually get real performance boost out of JIT. If we actually turn the JIT on, and we see that for PHP's core workload, which is web serving, we're only seeing a one to 2% gain. For me, that's not enough. It may be enough for others. But for me, I would call that experiment, not a failure, but a non success at that point. Certainly there are people out there who are still going to want to use it, because they are you doing command line applications, and they're doing complex math. And I'm not saying we can't have it. I'm just saying it takes less than a forward stage that point.

Derick Rethans 15:43

Somebody mentioned earlier in the chat room. It's also another set of potential bugs, right?

Sara Golemon 15:48

It is definitely another potential bugs.

Derick Rethans 15:51

It's pretty much another implementation of the PHP syntax bits of PHP.

Sara Golemon 15:57

So if you run an application and you get behaviour you don't expect, where is that behaviour actually coming from? You can spend a lot of time looking in Zend engine because you're thinking like: Oh, well, this is the thing that executes opcodes. And when I run it in a single command line, it's definitely going through this bit of code, but it works on a single command line run. But at the twentiest request on my web server, it's not working. Why is that happening? Well, it turns out, it's happening, because that's when the JIT has finally kicked in, because it has enough information. And it's running through this tracelet that was just a little bit wrong. And well, crap. You mentioned I think, at one point, when we were talking in Miami just a couple months ago, that you're just gonna have to turn the JIT off entirely when Xdebug is running,

Derick Rethans 16:41

Just like I'll already turn OPCache optimizations off, because there's just too confusing for people.

Sara Golemon 16:46

It's confusing and complex, but it's also it may not even be 100% possible because we are right there down at the bare metal of running CPU instructions. There's not a lot of opportunity to just say like, Oh, hold on Mr. CPU, let me just take a look at your registers right now. Okay, this is okay, let's go ahead and keep going now. The VM that we have now in in Zend lends itself 100% to those kinds of activities, CPU does not. What that means is that what we experience in the development mode with Xdebug running is not going to be the exactly the same thing that we experience in real runtime code. And I don't know if we have a solution for that.

Derick Rethans 17:23

As far as I know, there's no solution for it at all.

Sara Golemon 17:26

I was trying to cage it in the hope that maybe we could someday have solution for it.

Derick Rethans 17:30

It'd be lovely, but I can't see that happening to be honest. I think it's going to be important to find out how much this actually benefits, real live code. How does it benefit your Laravel project or your Symfony project or anything like that? I think it's going to be hard to now make a case for not shipping PHP 8 with a JIT. I think that'd be a bit unfair. But on the other side, if it's, as you say, only really gives you one or 2%, whether this is worth have the additional complexity. The additional maintenance burden as well as another opportunity for having bugs that are a lot harder to reproduce, but it's actually worth having it at all?

Sara Golemon 18:11

I definitely don't want to poopoo on the JIT effort.

Derick Rethans 18:14

Oh, no, absolutely not.

Sara Golemon 18:15

I think this is an important experiment to run. And I think if 8.0 as a whole winds up being a sort of public beta experiment of it, that will definitely give us a lot of good information. And I am super hopeful that we see better percentages, that we see 5-10 maybe even 15%

Derick Rethans 18:31

Absolutely.

Sara Golemon 18:32

I want to be guarded in what I how I talk about it on a podcast like this because I don't want anybody say: Oh, 8's gonna be great. Our code is gonna run 10 times as fast as it was running before No, that's not gonna happen two x is not gonna happen. We're talking much lower numbers than that. Be guarded, be hopeful, but 8.0 is going to be, as I said, it's going to be that sort of public beta experiment.

Derick Rethans 18:55

I think that's great. I think running this experiment again because ta similar experiment was, of course run during the PHP 5.6 days when PHP 7 came out. Originally with PHP 7, was PHP with a JIT engine. And then Dmitri and others found out that it was so much other things that could be done to make PHP run pretty much twice as fast.

Sara Golemon 19:16

Yeah, there was a lot of really low hanging fruit.

Derick Rethans 19:19

Yep. And that was great to see. I am apprehensive about people thinking that the JIT engine in PHP eight is going to similar performance boost.

Sara Golemon 19:29

We'll see. Nothing to say about it, but then: we'll see.

Derick Rethans 19:32

But I would suggest is that if you're interested in seeing what this can do for your projects, you should go try it out. Download PHP's master branch, enable it and see how it goes.

Sara Golemon 19:41

And of course, make sure you are running on x86 hardware. I doubt very much that he's bothered to put more than one back end on this.

Derick Rethans 19:48

I don't actually know.

Sara Golemon 19:49

I haven't looked. He might be using some helper library for it. So it's possible that we're hitting multiple backends. But this is probably going to be an x86 only thing and possibly a Linux thing. I should find out the answer to that question.

Derick Rethans 20:00

I should do too. Okay, Sara, thanks for taking the time this morning to have a chat with me about PHP 8' JIT efforts.

Sara Golemon 20:08

It's fun as always, I always love to speak with you Derick. You bring a bright Corona of sunlight to my day.

Derick Rethans 20:16

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 47: Attributes v2

PHP Internals News: Episode 47: Attributes v2

In this episode of "PHP Internals News" I chat with Benjamin Eberlei (Twitter, GitHub, Website) about an RFC that he wrote, that would add Attributes to PHP.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 47. Today I'm talking with Benjamin Eberlei about the attributes version 2 RFC. Hello, Benjamin, would you please introduce yourself?

Benjamin Eberlei 0:34

Hello, I'm Benjamin. I started contributing to PHP in more detail last year with my RFC on the extension to DOM. And I felt that the attributes thing was the next great or bigger thing that I should tackle because I would really like to work on this and I've been working on this sort of scope for a long time.

Derick Rethans 0:58

Although RFC startled attribute version two. There was actually never an attribute version one. What's happening there?

Benjamin Eberlei 1:05

There was an attributes version one.

Derick Rethans 1:07

No, it was called annotations?

Benjamin Eberlei 1:08

No, it was called attributes. There were two RFCs. One was called annotations, I think it was from 2012 or 2013. And then in 2016, Dmitri had an RFC that was called the attributes, original attributes RFC.

Derick Rethans 1:25

So this is the version two. What is the difference between attributes and annotations?

Benjamin Eberlei 1:30

It's just a naming. So essentially, different languages have this feature, which we probably explain in a bit. But different languages have this. And in Java, it's called annotations. In languages that are maybe more closer home to PHP, so C#, C++, Rust, and Hack. It's called attributes. And then Python and JavaScript also have it, that works a bit differently. And it's called decorators there.

Derick Rethans 1:58

What are these attributes or annotations to begin with?

Benjamin Eberlei 2:01

They are a way to declare structured metadata on declarations of the language. So in PHP or in my RFC, this would be classes, class properties, class constants and regular functions. You could declare additional metadata there that sort of tags those declarations with specific additional machine readable information.

Derick Rethans 2:27

This is something that other languages have. And surely people that use PHP will have done something similar already anyway?

Benjamin Eberlei 2:35

PHP has this concept of doc block comments, which you can access through an API at runtime. They were originally I guess, added as part or of like sort of to support the PHP doc project which existed at that point to declare types on functions and everything. So this goes way back to the time when PHP didn't have type hints and everything had to be documented everywhere so that you at least have roughly have an idea of what types would flow in and out of functions.

Derick Rethans 3:07

Why is that now no longer good enough?

Benjamin Eberlei 3:09

Essentially, user land developers use doc blocks to put metadata in there, and you could access them through an API. We had two sort of standards, or we still have two standards that use this. The documentation standard coming from the PHP documentor community. And then mostly runtime use case that exists now is covered by the doctrine annotations library, which, incidentally, I have also worked on a lot. It is used, for example, by the Symfony community, by the Drupal community, and by a few other communities as well that are smaller that wanted to go into the direction of using annotations in this case or attributes.

Derick Rethans 3:53

What would doctrine use an annotation for?

Benjamin Eberlei 3:55

I said before that annotations, add metadata to declarations. So let's say you have in your code, for example, classes that you want to store in the database. So you need to map PHP classes to database tables and back. Usually, you would do that using some kind of configuration. And configuration can be many folds. So the easiest way would be to write this in PHP, say, this is the column name, this is the field name, this is the class name and then store and use this information. And then you can go and store this in ini files, yaml files, XML files. The problem with this kind of approach is often that you have the configuration file and you have the class, and they are totally separate from each other, usually in very different places of the codebase. This is not some kind of configuration that is fluid. It's very, very static configuration that depends on the class. And it will not really change unless the class also changes. So changes are usually done together. In this case, it might make sense to put the configuration on to the class. Because then you see the declaration, you see it's configured in some way. And then you can more easily understand that changes affect each other in some way. And it leads to less mistakes, in my opinion. And it makes it a little bit more obvious that the class is used in some configured way.

Derick Rethans 5:26

We've had a quick look at what annotations are. The RFC introduces them in a different way, the attributes that you're not proposing, how are they different from the doc block comments?

Benjamin Eberlei 5:37

The idea is that we introduce a new syntax that is independent of the doc block comments. Essentially, before each declaration, you can use the lesser than symbol twice, then the attribute declaration, and then the greater than sign twice. This is the syntax I've used from the previous attributes RFC. And Dmitri at that point used the syntax from Hack. And it makes sense to reuse this not because Hack and PHP are going in the same direction any more. But because Hack at that point they introduced it that they had the same problems with which symbols are actually still easy to use. And we do have a problem in PHP a little bit with the kind of sort of free symbols that we can still use at certain places. And lesser than and greater than at this point are easy to parse. There are a bunch of alternatives and one thing that I will probably propose is an alternative syntax where we start with a percentage sign, then the square bracket open and then a square bracket close. This is more in line with how Rust declares attributes. While Rust uses the sort of the hash symbol, which we can't use because it's a comment in PHP.

Derick Rethans 6:54

And you don't want to use emojis.

Benjamin Eberlei 6:55

Some crazy people propose to use emojis which would easily work in PHP, but I guess it would be hard to remember the number to get the Unicode sign.

Derick Rethans 7:06

Within the two opening lesser than signs and two greater than signs to close it. What's in the middle?

Benjamin Eberlei 7:12

You declare an attribute name. And then you sort of have a parenthesis open, parentheses close, to pass optional arguments. You don't have to use them. So you can only use the attribute name. If you sort of want to tag something: just this is a validator, or this is an event listener, whatever you come up with, to use attributes for. But if you need to configure something in addition, then you can use. The syntax sort of looks like if you would construct a new class, except that you don't have to put the new keyword in front of it.

Derick Rethans 7:45

It looks like function arguments pretty much.

Benjamin Eberlei 7:47

Yes, exactly. Yeah.

Derick Rethans 7:48

What kind of values can you use in the optional arguments to the attributes?

Benjamin Eberlei 7:53

The attributes are not really runnable code in a way. Since they are declarations, they don't allow arbitrary PHP code to run there. What is obviously allowed a simple literal values, so a number, or a fixed string, a fixed array declaration, and all this kind of things are possible. What is also possible is exactly the same expressions that you can also declare in class constants. So, in the class constants, you can do simple mathematical expressions, you can reference other constants. So, this is something that will be very interesting for attributes to do reference class names for example.

Derick Rethans 8:34

What happens if you define an attribute on a declaration element?

Benjamin Eberlei 8:38

What happens is that while the PHP script gets compiled, it will see that there are attributes declared and it will parse the attributes and similar to the doc block store them on the internal structure for future reference. Attributes are parsed in my current proposal in a way that you can have every attribute just once. This is something that is still under heavy discussion, because there are a few good ideas why you would need two, or multiple. Essentially similar to how a doc block is a string, we then store an array, which represents the attributes belonging to the class or the function or the constant. And this is something that the engine stores and also stores it in OPCache.

Derick Rethans 9:27

How would you access these attributes?

Benjamin Eberlei 9:28

Attributes are accessed through the reflection API. The reflection API also allows access to doc blocks. For attributes that would be a new function called getAttributes(). And it returns a list of all attributes using a new reflection class called ReflectionAttribute. There you can access what name does this attribute have? What are the arguments that are passed? And then this goes into one of the next features of this RFC proposal. You can also ask it to return this attribute as an object instance.

Derick Rethans 10:05

An object instance of which class though?

Benjamin Eberlei 10:07

Attributes, and this is something that is different to the initial version, the version one attributes RFC is, attributes names resolve to class names. That means if you declare an attribute, for example, Foo, and you have an import for our class, MyApplication/Foo, then during passing the attribute will be resolved to my attribute view name. It uses the same mechanism for class resolving that is used in every script. It reflects the use statements that are declared in the file. And you can use namespaces, namespace operators to reference the attributes as well.

Derick Rethans 10:49

These are attributes not classes, so I don't quite see all the link between the attribute names in the classes is?

Benjamin Eberlei 10:55

One problem with the original doc block based system was that there are conflicts between attributes of different systems. One library would have a type annotation, or a var annotation, and some other library would also use it. This could lead to conflict if the syntax for them was slightly different. So this would lead to problems when multiple parses would use the same attribute. And they would parse them differently. And this could lead to errors. One problem that was mentioned in the initial attributes RFC and that, I think, if you vote us all so used as a reason for voting no is that there was no namespacing, which means that different libraries could clash and their use of attributes. My idea was we already have classes, we have namespacing. We can resolve this by using this mechanism. You declare an attribute and an attribute always resolves to a class. In the best case scenario, you would also declare this class in your code. Essentially, the attribute is not an attribute, but it's a special class that represents an attribute. This is also shown in the code that by having an additional interface, or a sort of a marker interface, that attributes can implement to make it obvious that they they are used as an attribute.

Derick Rethans 12:19

You mentioned that you could access the attributes through reflection API, and you can get them out as an object?

Benjamin Eberlei 12:25

Yes, this is why I mentioned before that the syntax sort of looks like constructing a new object, but without the new keyword. When you access the objects through the reflection API, it would essentially instantiate the class, and all the arguments that you put into the attribute declaration are passed into the constructor of the object. And this is why the connection is there between a class and an attribute. It directly goes to instantiating the attributes as an object using the arguments and giving the developer access to them.

Derick Rethans 13:00

Does it only do something like this when you use the getObject() on the reflection arguments? Or is it also possible that I don't care about these classes things whatsoever, and I can just get a list of attributes and their optional values that are associated with them?

Benjamin Eberlei 13:16

You don't have to have a class, and the class name resolving in PHP is independent of classes actually existing. The attributes RFC respect that. You can just import anything that is not a class and use an import statement to shorten the attribute usage, or you can use the absolute namespace syntax to put a fully qualified attribute name into your code. And it wouldn't fail. The fail would only happen when you call the method on ReflectionAttribute to get the attribute as an object. So this is something the RFC is also in flux with and about to change it. The first version mentioned that attributes will always be auto loaded when they are declared at compile time. This would essentially treat attributes similar to base classes or interfaces, in a way that they are always resolved, they're always checked. However, this is a little bit overkill for userland attributes. And a lot of feedback was related to this should only happen when the reflection API is used. So I'm going to change this. One thing that we do need to handle in a way is a built in attributes. One reason why I want to add this RFC as well is that there are a few use cases coming up in PHP itself, that could benefit a lot if we had built in attributes. Since we don't have a clear path forward there. But Nikita has published his ideas on editions. So there's some paths forward to having PHP code work slightly differently depending on what developers want. Attributes could be helpful there. Other things for example, the JIT. JIT has features where you can at the moment use doc block comments to declare methods as always JIT-able or never JIT-able. Dmitri used doc block comments to check for JIT or no JIT tag in there. This is essentially something that attributes should be used for because should be machine readable. Then there's a lot of other stuff that for example, Rust also put forward that PHP is struggling with: conditional declarations of functions. For example, Symfony has a polyfill library that adds functions that are in higher languages, re implements them in a way that they're also available in lower versions where they don't exist in core. There are a lot of hacks around the sort of conditional declaration of functions and classes and stuff that make it difficult for OPCache to actually cache the files. I believe there are also even more problems if you use these kind of fights with pre loading. Essentially what could be done with attributes would be something like conditionally declared as function only if it's on PHP 7.3 and lower something like this.

Derick Rethans 16:13

You just mentioned using JIT or no JIT as an annotation. Does that also mean that extensions have easy access to these attributes?

Benjamin Eberlei 16:21

OPCache's not a PHP core functionality. It's still its own extension. The idea is that extensions have access to attributes in a very simple way. So there will be a Zend API, sort of an internal name for an API that the Zend engine provides to extensions and extensions will be able to access attributes and make decisions based on this. Extensions can already hook into the compile step of PHP and there's a hook called zend_ast_process. During AST processing, you can do stuff. That would be one way to, for extensions to look at attributes and maybe change code if they want. Then the engine obviously has tonnes of other hooks where the declarations are available in the data structure that the Zend engine provides. So there's zend_class_entry, for example, where you could look into the attributes as an extension and make decisions.

Derick Rethans 17:20

This is a pretty new RFC, and hence there're always going to be few open issues. Because we like to argue about stuff. What are the open issues on this RFC?

Benjamin Eberlei 17:29

This is the seventh RFC on this topic. So there has been a lot of discussion. I guess this feature is, in a way quite controversial because of the implementation details. A lot of my work now will be to find the best implementation that can actually make this feature part of core by getting enough votes for it. And so I gathered a lot of feedback from the community; also talked a lot to contributors. Changes that I will be probably doing is allowing multiple attributes. What I said before, the auto loading has to be clarified. There has to be some distinction between internal attributes and user land attributes in a way that doesn't require auto loading. Hack, for example, has __ as a magic prefix, which I want to avoid, because it puts up all this magic methods, sort of argument back on the table. We need to have something to make a distinction between userland and internal attributes, because the internal attributes need to be validated very strictly at compile time. And the userland attributes need to be validated only when you call the getAsObject() method on the reflection API.

Derick Rethans 18:42

How long do you think there'll be before you put this RFC up for a vote?

Benjamin Eberlei 18:46

It's a bit tricky because this issue is so controversial. I don't want to invest month of work and then get a no vote. And so I do want to have some feedback quite quick enough. I do realise that the first draft needs some work and clarifications that would otherwise lead to no votes from contributors. So I hope to get this done in, let's say, two to four weeks of additional work.

Derick Rethans 19:09

All right, Benjamin. That was a great explanation of the attributes version two RFC.

Benjamin Eberlei 19:16

Thank you for having me, and I really appreciate it again.

Derick Rethans 19:21

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 46: str_contains()

PHP Internals News: Episode 46: str_contains()

In this episode of "PHP Internals News" I chat with Philipp Tanlak (GitHub, Xing) about his str_contains() RFC.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 46. Today I'm talking with Phillipp Tanlak, about an RFC that he's made titled str_contains. Phillipp, would you please introduce yourself.

Philipp Tanlak 0:35

Hey, Derick. My name is Philipp. I'm 25 years old and I live in Germany. I work for an IT service company, which does mainly development and maintenance of IT projects. We specialise in the maintenance of e-commerce website and create enterprise applications.

Derick Rethans 0:52

How long have you been using PHP for?

Philipp Tanlak 0:54

I've been using PHP for quite a long time now that might be six years I guess.

Derick Rethans 0:58

What brought to you creating an RFC?

Philipp Tanlak 1:02

The main reason I've created this RFC was out of necessity and interest, mainly to scratch my own itch.

Derick Rethans 1:08

That is how most things make it into PHP in the end isn't it?

Philipp Tanlak 1:11

Yeah, I guess.

Derick Rethans 1:12

The RFC is titled str_contains, that tells me something that is about strings and containing things. How do we currently find a string in a string?

Philipp Tanlak 1:22

The current approach to find the string in a string is to use the strpos() function or the strstr() function. But on Reddit, I found someone also use preg_match which I find kind of interesting.

Derick Rethans 1:35

There are multiple amount of different methods in use, what are the general problems with these approaches that people have made?

Philipp Tanlak 1:41

So the current approach which I find is not very intuitive, and mainly because of the return values of these functions. For example, the strpos() returns either the position where the string is found, or a false value if the string is not found, but there has to be a check with a !== operation, and the strstr() function just returns a string. So you have to convert that to a boolean to check if the string is found or not.

Derick Rethans 2:11

Because with strpos(), if you wouldn't use the === or !== operator. Of course, if it would find it at the first position of the string, it'd be zero position, and it would return false, even though it's sfound it.

Philipp Tanlak 2:26

Yeah.

Derick Rethans 2:27

So there's a few different problems with these things. Also, I don't think it's particularly vary intuitive to do because you sort of need to come up with like a whole construct to see whether it's part of a string.

Philipp Tanlak 2:37

Correct. I don't think it's intuitive for a beginner. So if someone is learning PHP for the first time, then he has to search through the documentation, what are the exact return values for these functions, and has to remember that so I thought, string or str_contains() might be a better fit for that to just return a true or false value.

Derick Rethans 2:58

We've mentioned str_contains() a few times now, I guess the RFC is producing to add this function. How would this function differ from what PHP already has?

Philipp Tanlak 3:07

So this function does not differ in a lot of ways. It's basically the same implementation of the strpos() function. But instead of returning the position of the found string, it just simply returns it as a boolean value. So either true or false.

Derick Rethans 3:23

I can imagine some people will say, well, you can just do this in your own wrapper function, right? Because pretty much what it deos is converting the results from strpos() to a boolean. But you must have a good reason of why to want to add an extra function here.

Philipp Tanlak 3:38

The reason for this function, and maybe someone might disagree is, mainly a user experience for the developer. So this is just out of necessity which I found, and I've been using this function quite a lot. So I thought this might be a valid add to the PHP language. So I tried to implement it and it got some great reviews. So I thought that wasn't a very bad idea I had.

Derick Rethans 4:04

Is the RFC suggesting just out a single function: str_contains().

Philipp Tanlak 4:09

Yes, the RFC is currently adding just a single function, which is the str_contains(). When I first submitted the discussion about this RFC, there were quite a few people asking why is there no case insensitivity or multibyte versions for these, and I did not think of those at first. But in the discussion, it became clear that the multibyte version did not seem to be very necessary because the comparison is going to be byte by byte. Unlike strpos(), the position of the found string is not relevant. So it doesn't matter if there is any difference in encoding.

Derick Rethans 4:47

I remember in last year, there was another RFC related to strings functions they were the string_starts_with() and a string_ends_with(). Those are two functions and there were also variants for both case insensitivity, ss well as multibyte. Which made eight different functions to be added to pretty much do a single thing. That RFC failed, potentially because there are so many things being added.

Philipp Tanlak 5:11

Yeah, that was also the main reason, I think the case insensitivity of this function, or the variant of it was not so relevant. So I did not include it into the RFC just because of this case you mentioned. So instead of polluting the global space with more functions, someone suggested to just advance PHP incrementally and add in case sensitivity for this function just if it is necessary.

Derick Rethans 5:37

This is a common recurring subject. Most of the people I spoke with in the last few episodes are all adding things to PHP bit by bit instead of coming up with big RFCs which I think is a good way of going forwards. When reading the RFC, I had a quick look at which argument the function would accept. PHP of course this weakly typed strings in most of time. Is this str_contains() function handling distinct different from what strpos() does for function arguments.

Philipp Tanlak 6:10

So the str_contains() function uses the same internal function, which is php_memnstr(), if I recall correctly. It tries to interpret it as a string. And if it's not a string, it either throws a warning or notice, but I've just run some checks and it seems like in the next PHP version, non string values which are passed into the string functions will be interpreted as a string, and if that is not the case, it will throw an error or usually return false.

Derick Rethans 6:43

So it doesn't do any special magic, and just relies on the PHP tends to do for parsing arguments and weak and strict typing.

Philipp Tanlak 6:51

Yes, that's correct.

Derick Rethans 6:53

Most RFCs they come with a patch, as does yours. How did you find it getting started with writing things for PHP instead of using PHP.

Philipp Tanlak 7:02

So basically, I've looked at the PHP source code in the past, just to see how things are implemented. And I had some basic background in C. So I thought that this was not very hard for me. Most of the functions or things I had to do to include this patch, were already there. So basically, I just copied the strpos() function and remove the, when the string is found, use the position to calculate a new string and just remove that code and return the boolean value from the found position.

Derick Rethans 7:35

Because it is not a very different function from strpos(), it's just pretty much a different return type. It's a lot easier to do.

Philipp Tanlak 7:44

Yeah.

Derick Rethans 7:45

When looking at feedback, what were the main criticisms of this?

Philipp Tanlak 7:48

The main criticism of this was basically just the variants of these functions. So mainly the multibyte variant or the in case sensitivity. Other than that, the response was very, very nice and, and also very rewarding for me. So I thought I did a good job on this. And many people wanted to have this function in PHP, but either did not have the time to implement it or it was too easy. I'm not sure how that went. But I think the response from the devs and the overall PHP community was very nice.

Derick Rethans 8:23

The RFC is already in voting, so I'm I'm a bit late to talk about them. Usually I'm and things are still in discussion. And at the moment, it looks like it is passing because the votes are 43 to 6 with another weeks ago, then.

Philipp Tanlak 8:37

Yeah.

Derick Rethans 8:37

Do you think this will be your last RFC? Or do you have something else in mind?

Philipp Tanlak 8:41

At the time of this recording I don't have anything else in mind, but maybe if I find something. Since I'm working with PHP on a daily basis, which I think is worth adding to PHP I might create a new RFC.

Derick Rethans 8:54

That's how I started and see what happens now. Thank you for taking the time to talk to me today Phillipp, I hope you enjoyed this.

Philipp Tanlak 9:01

Yeah, thanks for having me Derick.

Derick Rethans 9:05

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 45: Language Evolution Overview Proposal

PHP Internals News: Episode 45: Language Evolution Overview Proposal

In this episode of "PHP Internals News" I chat with Nikita Popov (Twitter, GitHub, Website) about the Language Evolution Overview Proposal RFC.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 45. Today I'm talking with Nikita Popov yet again about a non technical RFC that he's produced titled language evolution overview. Somewhere last year, there was a big discussion about P++, an alternative ID of how to deal with improving PHP as a language but also still think about how some other people already use PHP and I don't really want to change how they currently use PHP. Like then I didn't really have an episode about that because I'd like to keep politics out of this podcast, or definitely PHP's internals politics. I do think that we realised at that moment that something did have to happen, because there's not really policy about when we can add things, when we can remove things, and so on. So I was quite pleased to see that you have come up with a quite wordy RFC, not talking about anything technical, but more looking forward of were will see PHP in the near or medium future, I would say. What are your thoughts about making this RFC to start with?

Nikita Popov 1:29

As you mentioned we had some pretty, let's say heated discussions last year, concerning especially backwards incompatible changes. So there were a number of very, very contentious RFCs. One of them was the short opentags removal, and another one was the classification of undefined variable warnings. So whether those should throw or not throw, and well basic contention is this that PHP is a by now pretty old language, 25 years old. And we can all admit that it's not the language with the best design. So it has evolved relatively organically with quite a few words, and the famous inconsistencies. And now we have this problem where we would like to resolve some of these long standing issues. Many of them are genuine problems that are introducing bugs in code, that reduce developer productivity. But at the same time, we have a huge amount of legacy code. So there are probably many hundreds of millions of lines of PHP code. And every time we do a backwards compatibility break, that code has to be updated, or more realistically, that code does not get updated and keeps hitting on old PHP version that, at some point also drops out of security support. And now the question is how can we fix the problems that PHP has, while still allowing this legacy code to update their PHP version. The general idea of how to fix this is to make certain backwards compatibility breaks opt in. By default, you just get the old behaviour, but you can specify in some way, exactly how it's done doesn't really matter at this point, that you want to opt into some kind of change or improvement.

Derick Rethans 3:34

As one example being the strict types that have been introduced in PHP that you need to turn on with a switch with a declare switch.

Nikita Popov 3:42

Strict types is really a great example because it has the important characteristic that has done per file. So you can turn on the strict types in one file and not affect any other code, at least in theory. So there are some edge cases, but I think like mostly you can just enable strict types in your library and you don't affect any other library that the project uses. We would like to extend this concept. It should be possible that libraries can update to your language, well, it's called language dialect without forcing other libraries or without forcing the using codes to update as well. Because this is what we have to do right now, though, before you can update your project to PHP eight, let's say, you first have to wait that all the libraries you're using update to PHP eight. And maybe there are libraries that are going to update but also say that: Okay, now actually PHP eight is required. And then you kind of get these complex dependencies with libraries supporting these versions and not supporting those versions, and doing updates becomes pretty hard. As I said, the idea is to make the these backwards incompatible changes opt in some way, and there are multiple general models. So as you mentioned, P++ is the most radical approach. It's more or less a separate language but sharing the same implementation. And as the name suggests that this is inspired by C and C++. So those are usually implemented in the same compiler. And they can be interoperable in a limited way, mostly in that you can use C code inside C++ easily. Using C++ code inside C code tends to be much harder. Yeah, P++ is, I think the option we are pretty unlikely to take for a couple of reasons, because it's this kind of one time huge break which first means that we only have one chance to get it right, and given all the track record, we should maybe not rely on that. Also means that the upgrade becomes especially hard because you have to do everything at once. It's not spread out over a longer time.

Derick Rethans 5:54

You say that we need to get it right in one go, but that is hard to say because you don't know, in the future what else we want to add? Like the RFC mentions a few few other cases, like, for example, things like forbidding dynamic Object Properties, we'd have to do right away now as well, if he'd go with the two languages one implementation phase, right? I mean, if we hadn't thought about it, nobody would have thought about it after the split as we made, we'd still not be able to do it.

Nikita Popov 6:20

That's true. So P++ is, one time, one time solution. It doesn't really scale over time. I mean, there are also other concerns. And I think like in the end, one of the big ones is just that we don't have the resources for it anyway. So we have only maybe three full time developers on PHP. And I don't think we want to start focusing on this huge separate language more or less. Now we're just going to take a couple of years. Next to having this entirely separate language, there are two other ways to approach the problem. One is editions, which is a concept used by the rust programming language. The idea there is that next to the version, which is more or less than implementation version, you also have this edition, which is a completely orthogonal concept. Basically, we will say: okay right now we are for example at edition zero. And then in addition one you opt into some kind of set of backwards incompatible changes. Then in addition two, there are more backwards incompatible changes, and so on. Each edition is essentially a superset of the previous one.

Derick Rethans 7:32

Would it also mean you couldn't get new features in a new edition or is it purely about making backwards incompatible changes?

Nikita Popov 7:40

So, this is purely about backwards compatibility. So, if a new feature can be added without breakage then should always be available. The editions switch would only control the backwards incompatible parts. This is to contrast with the second approach, which is to have fine grained declare statements. As you already mentioned, we have the existing strict types directive and we could continue down the same path. So, we could add new declare for no dynamic Object Properties equals one, and then for a strict operators equals one, and for whatever else equals one. And then you would have this long list of possible declares, with which you could enable or disable some particular bit of language behaviour.

Derick Rethans 8:26

Then I can imagine that in another five years, that list might be 20 options long.

Nikita Popov 8:31

Right. So, the concern there is of course, one part is maintenance, because we have to support basically an exponential combination of different options. And the other is from the programmer perspective, that the like mental model becomes more complicated because you have to keep in mind like which exact set of declares am I using right now? I should say, though, that this model is actually used by Python. Because Python has this import or use from future feature. So there is basically this magic module __future from which you can import language features that will become the default in newer Python versions. For example, you can import the new integer division behaviour inside an older version. This is more or less the same as doing the declares, the fine grained declares, just with a different syntax and with the I think, stronger focus that the behaviour is going to become the default in the future version.

Derick Rethans 9:38

So basically, you're opting into experimental functions really?

Nikita Popov 9:41

Could be either experimental functions, or it could be really functions from newer versions. In particular Python, also for a while had parallel development of Python 2 and Python 3, in which context this probably makes more sense.

Derick Rethans 9:56

There's pretty much three options that the RFC mentions: a new language common implementation or the PHP / P++ option, the editions, and the fine grained declares. These are all still going to be based per file?

Nikita Popov 10:12

So that's the second large question, what is the general model? And the second one is where we declare it. The approach I was initially pursuing was to have this declare it at the package level. So for a whole library or for for a whole project.

Derick Rethans 10:32

How would you define what a package is?

Nikita Popov 10:33

We have namespaces. And there is a somewhat loose coupling between namespaces and packages. So I have an old RFC for a namespace scope declares, where you could, for example, specify strict types for whole namespace, which is, I think, maybe the most natural way to treat packages right now, because this is the closest thing to a package we have. Fortunately, it does have a few issues. One of them is that this namespace package mapping is not always there. So there are packages that have some somewhat odd nesting of name spaces. And I've also heard that some people, for example, define their models inside the Doctrine name space, because they're, you know, extend their classes. So they also put them the namespace. Of course, you shouldn't do that. But it's things that could happen, because we don't really have this enforcement that the namespace really is a package. And then there are also technical concerns, because right now, namespaces are really just a compile time thing to handle name resolution, and now they kind of turn into a feature that also has some kind of runtime impact. And you have to consider things like what happens if you have multiple namespaces in the same file, and also other considerations, like what happens if the names namespace is first used, and you issue some namespace scope declares afterwards. All that can be resolved, but it makes the model somewhat more complicated.

Derick Rethans 11:53

And I guess you end up having to declare these namespace scope declares maybe in a separate file or something like that?

Nikita Popov 12:14

At least what I have in mind that is that you would declare them in composer.json, and Composer would then take care of registering them with PHP itself. Of course, you could also do that manually, which are not using Composer but that at least was the 95% use case.

Derick Rethans 12:31

In applications that make use of Composer, it is very likely that Composer knows about all the libraries that a specific application uses, and hence will be able to construct an array, where it can tell PHP by calling a function declaring all the different options or editions of whatever that end's up being.

Nikita Popov 12:49

So that's one of the approaches. There are also some alternatives. One is to instead introduce an actual package concept. One of the possibilities is to basically: add an extra line to each file, which says package and the package name. So that really removes any and all ambiguities. But you do have to add that extra line, which serves some very limited purpose. And basically only for these package scope declares, could maybe also be used for some extra features, like, package private symbols.

Derick Rethans 13:23

But it would also instantly make that code base non-parsable with older PHP versions.

Nikita Popov 13:28

That's also true, right. But that's a general problem that most approaches I think, would have. So namespace scope declares is one that doesn't have it, but even the per file approach would have this problem because if you write for example, declare edition, then you would right now on PHP seven get the warning that the edition declare is not known. Yeah, last variant that I'm discussing here is to make packages based on the file system, which is something many other languages do. So you have some kind of magic file somewhere that says okay, this directory and all the sub directories are part of the package. In PHP, this kind of file system based approach is somewhat problematic, because our include mechanism is not really based on the file system but on fairly general stream abstraction. You can include from the file system, you can include, if you're really crazy from HTTP, but you can also include from Phar files, from an input stream, or from some kind of custom defined stream. These file system based packages require some additional operations to be well defined. So they have to have a notion of path canonicalization so you can determine whether a file is inside the directory, even if there are things like symlinks or the file system is case insensitive. Which does exist for the file system. So we have the real path syscall, but doesn't exist for streams right now. And a similar problem is that we need to be able to walk up from a path to the directories. And that's also something that doesn't exist for streams. And like more generally, not all streams really have a well defined concept of a directory. For example, if you are reading a file from stdin, so the stdin or the input stream, then there is no directory and like, which package is that going to be in?

Derick Rethans 15:31

I think it would be hard to end up debugging at some point. So why some things don't actually end up being in a package where you expect them to be, for example. And then on top of that, you also need to define: Well, how do I call this file and things like that, right? I mean, a PHP script wouldn't be just a single file, for example, would be a single file and this extra definition file. And that's the concept of course that we don't have in PHP at all. Everything is on profile pretty much.

Nikita Popov 15:56

Which is why at least to right now. I think, like the immediate way forward, is to use per file declares. So if we don't use the fine grained declare approach, and instead have a single edition, then it's not really a problem to put the declare edition inside every file, because this is already what we do for strict types. It's like not super ergonomic. But I think it's also not a huge problem. And it does have the one very big advantage that files are and remain self contained. So you don't have to consult an external definition that may be hard to locate to figure out how to process.

Derick Rethans 16:36

And every IDE or tool would have to implement that same logic and make sure that it's all consistent with each other as well.

Nikita Popov 16:43

I wouldn't say it's really hard, but it might be somewhat fragile, especially when it comes to convention. I said if we put things in composer.json, there's probably something tooling can easily deal with. But if you then encounter a project that doesn't use Composer and uses as some other way to register the package declares, then you might run into problems.

Derick Rethans 17:09

Lots of things to talk about and discuss at some point. As you submitted this RFC to the mailing list some time ago now, what is sort of the feedback that you're getting on this?

Nikita Popov 17:19

So I think the general direction, at least this pretty clear. Most of the discussion is focused on the addition concept, not the finger in declaratives, or the P++. I think for now, we would also go with the per file approach. Now, the main two points that remain contentious is: first, how does the support timeline look like? So basically, the concept of editions just enables different libraries to upgrade independently. That's the core premise. But at least in Rust additionally editions of are also guaranteed to be supported forever. So you can leave your old code running on the old edition, and you do not have to ever update it.

Derick Rethans 18:10

How often do they make new editions? Every three years?

Nikita Popov 18:13

Yeah, it's not quite clear yet, but probably it's going to be every three years. And now for us, the question is, well, do we want to support old editions forever? Or do we want to give them a finite lifetime? Say we introduced a new edition in PHP eight, and then we supported until PHP nine. That means code can take its time to do the necessary updates, but it does have to do the updates at some point.

Derick Rethans 18:37

But you'd have five years?

Nikita Popov 18:39

It's more of the general question of if it's forever or if it's limited. So I think based on the discussion, there is a pretty strong preference to not support them forever.

Derick Rethans 18:51

But for how long then? I mean, it must be longer than what we support a normal PHP version for, right?

Nikita Popov 18:56

Yeah, would expect it to be something like a major version cycle. The second question is related to the strict types, as you said, strict types is like an existing example of a mechanism that works like this. And now we're introducing a second mechanism with the same basic characteristics. Are we going to merge them or not? Would we say that, in the new edition that strict types is enabled by default, or even always enabled? If we do that, and we say that additions have limited support life, that means that strict types is going to become the only option in the future at some point, at least. You can imagine that this is somewhat contentious because there are quite a lot of people who consider weak types to still be the superior option.

Derick Rethans 19:49

Whenever I go speak at conferences or user groups, that's not the case. One question is, which keeps recurring always is: Why isn't this the default in PHP eight? I think there's an expectation that strict title at some point is going to be turned on by default.

Nikita Popov 20:04

Yeah, and the thing, this is where people disagree whether this expectation is this or not. So there are plenty of people in the discussion thread, well, by plenty I mean, at least two, who strongly think that strict types should remain an option. I mean, PHP of deals with often deals with input coming from HTTP or from a database which is usually coming in as a string. And they think that the typecast you have to do to make that work with strict types actually kind of weaken the type safety guarantees, because if you perform an explicit cast, then that cast is performed basically without any checks. So you can like take a completely non numeric string cast it to integer and you will get zero without any warning or whatever. While even in weak typing mode, that would still result in an error.

Derick Rethans 20:58

It's a curious thing actually when you mention databases because, of course databases, you've defined very strict types for your data in them. It's just that it's interesting that PHP's interface to most of these old SQL databases, just decided to always turn into a string.

Nikita Popov 21:14

It's it does actually support returning things in they're like native type.

Derick Rethans 21:20

With PDO, yes.

Nikita Popov 21:21

But under options, and I think it's also like dependent on whether you do emulation or not, and stuff like that. And you have all these different drivers that have differing support for that. But yeah, to get back to strict types, but one of the options is to really keep editions and strict types separate, and also evolve the strict and the non strict mode independently. So you could say that in the new edition, the strict typing mode becomes stricter, for example, by also extending to operators, arithmetic operators, not just to function arguments, but that of course doesn't mean that: Yeah, we saying strict types of states exist forever as a separate track of language.

Derick Rethans 22:06

Yeah, that's an interesting one. I'm not sure how to get to a conclusion there actually. Because there's always going to be people on each side side.

Nikita Popov 22:13

Yeah.

Derick Rethans 22:13

Would you think that this language evolution overview proposal would have been decided on which way to go by the time feature freeze for PHP eight comes around?

Nikita Popov 22:23

I think it would be pretty good to have this for PHP eight, because well, it's new major version and the time to introduce this kind of concept. I should say, though, that we already have quite a few backwards incompatible changes in PHP eight, and at least some of them are, like, we are definitely not going to retrofit them into the editions concept. So there are already certainly going to be breaking changes there.

Derick Rethans 22:52

Why wouldn't you retrofit them? I mean, if we end up deciding a PHP eight will have these editions, would they not be part of that or would they always end up breaking anyway? Because it seems like a sort of an ideal place to then do it.

Nikita Popov 23:05

And yeah, problem is just that the there are some quite extensive changes, especially when it comes to warnings versus exceptions, and will just be like a lot of efforts to get this under an edition flag and to support both behaviours there. Maybe some of the existing changes could be moved into there, with not a huge amount of effort. But I think there are definitely going to be some like hard edition independent breaking changes.

Derick Rethans 23:37

New major PHP versions still might have some backward breaking changes independently from when we do the editions or not, or more declares or not?

Nikita Popov 23:46

Yeah, that's like one more question, what exactly is the scope of editions? What goes into the edition, what doesn't go into there? I mean, there is always a cost to ending something with this mechanism. One is just maintenance for us. And of course that like user has to consider more different versions of the language. And I think one particularly large aspect that would likely never fall under edition concept is changes to the standard library. So additions work well for language changes, but I don't think they really make sense for a standard library changes. So everything that involves depreciations, or functions with eventual removal would not be covered for that.

Derick Rethans 24:31

Do you have an example of such a change in the standard library that PHP eight might have?

Nikita Popov 24:36

What I just said might as the general that, usually in every PHP version, we deprecate a bunch of functions and are going to remove them at some point. And these deprecations are like going to apply independently of what edition you set. Actual changes in terms of like real behaviour changes of the standard library I think that's something we quite rarely do. Actual changes to the standard library where the behaviour of a function is changed. That's something we generally try to avoid. Specifically because this causes relatively subtle backwards compatibility breaks. So usually we will either do changes by introducing a new flag or a new function, or by deprecating the functionality entirely. Even when it comes to language changes, there is like I know one example. And the discussion was, well, if we had the edition concept, and we wanted to introduce something like traits, the trait functionality in general is not backwards compatibility breaking. But the trait feature does introduce two new reserved keywords, which is trait and insteadof. So there is technically a backwards compatibility break even though it's finer. And now you have the trade off. Do you introduce traits in the new edition and only reserve the keywords there, thus removing any backwards compatibility break. Or do you you introduce it always, which means that everyone can benefit from it, even if they haven't updated the code to the new edition yet. But it does introduce the small backwards compatibility break. And then you get this trade off and the discussion what you should be doing about that.

Derick Rethans 26:17

I think making that kind of decisions will have to be done based on evidence. And I think in the past you've used the top thousand projects on GitHub and see whether things break or not to make a decision. For example, having the nested, or the triple, quadruple nested ternary. Anytime people use it, it's pretty much a bug in the code.

Nikita Popov 26:36

Yeah, so to give one example, in PHP 7.4, we introduced the short closure syntax with the fn keyword, and they're the source code analysis showed that basically, fn is not used outside of tests, apart from one library, which is my own. Which does have quite a few dependencies. And that library was indeed broken essentially completely by that change. So in that case, I think there might have been an argument that this feature should be introduced under an edition, because there is like evidence of actual breakage in the wild.

Derick Rethans 27:14

This is one of us trying to get it right. We now have evidence for it.

Nikita Popov 27:18

And probably like the insteadof keyword for traits, that there's much less problematic.

Derick Rethans 27:24

Again, as I say, it's the data that speaks that there right? That was quite a bit to go through. I'm curious to see where those discussions ends up going. Hopefully, we get to a conclusion somewhere in the next few months and ready for PHP 8.0. Who knows? Maybe we have another podcast episode where we introduce a new editions concept.

Nikita Popov 27:43

So this is probably my most vague RFC, with a somewhat unclear goal and the somewhat unclear discussion outcome.

Derick Rethans 27:53

Do you have anything else to add to this discussion that we've missed?

Nikita Popov 27:55

I think there is just one thing maybe worth mentioning, which Rust uses pretty extensively, which has automatic upgrades. So they have some tooling to do that, which is mostly reliable. And I think it would be pretty nice if in PHP, we had something similar. In PHP, we can't really make this reliable because language is just way too dynamic. And we actually do have some tooling in the form of the rector library. But we might want to think about providing something under the PHP project umbrella that is more geared towards like doing updates that are as safe as possible. So you can run them without thinking but still reduce your loads some what.

Derick Rethans 28:40

And that is something that is definitely for the future. Thanks for talking to me about the language evolution overview proposal.

Nikita Popov 28:46

Thanks for having me, Derick.

Derick Rethans 28:53

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP line. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.


PHP Internals News: Episode 44: Write Once Properties

PHP Internals News: Episode 44: Write Once Properties

In this episode of "PHP Internals News" I chat with Máté Kocsis (Twitter, GitHub, LinkedIn) about the Write Once Properties RFC.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 44. Today I'm talking with Máté Kocsis about an RFC that he produced called write only properties. Hello, Máté. How's it going?

Máté Kocsis 0:34

Yeah, fine. Thanks.

Derick Rethans 0:36

Would you mind introducing yourself a moment?

Máté Kocsis 0:38

My name is Máté Kocsis and I'm a software engineer at LogMeIn. I've been using PHP for 15 years now. And after having followed the mailing list for quite some time, I started contributing to the project last October, and now Write Once properties is my first RFC.

Derick Rethans 0:58

What is the concept of Write Once Properties?

Máté Kocsis 1:00

Write Once Properties can only be initialised, but not modified afterwards. So you can either define a default value for them, or assign them a value, but you can't modify them later. So any other attempts to modify, unset, increment, or decrements them, would cause an exception to be thrown. Basically, this RFC would bring Java's final properties, or C#'s, read only properties to PHP. However, contrary how these languages work, this RFC would allow lazy initialization. It means that these properties don't necessarily have to be initialised until the object construction ends, so you can do that later in the object's life cycle.

Derick Rethans 1:48

PHP already has constants, which are pretty much write only properties as long as they're being defined in a class definition. How does differ?

Máté Kocsis 1:58

Yeah, it's it's the difference because, so you can assign these properties value in the constructor or anywhere. You don't don't have to define them a default value.

Derick Rethans 2:12

Okay, and of course constants have the other problem is that you can only set its values to constants, not necessarily to any sort of expressions, or the result of other method calls.

Unknown Speaker 2:22

So you can use objects, resources, any kind of property value here.

Derick Rethans 2:28 You mentioned C#'s read only properties. And you sort of mentioned them in the same breath as write ones properties for PHP. These seem like opposite things

Máté Kocsis 2:39

Not quite opposite, but there's some distinction between the two. C sharp requires these properties to be initialised until the object construction ends. And this is very difficult to achieve in PHP. And now I'm using Nikita's words: Object construction is a fuzzy term and you can be sure if, if the contractor is involved at all. For example, if you are using Doctrine or proxy manager, so we decided to allow lazy initialization, which means that you don't have to assign these properties a value, you are free to do anytime when you want.

Derick Rethans 3:22

What happens if you read them without them having being set yet?

Máté Kocsis 3:27

Initially, when I started working on this proposal, I faced the problem because untyped properties have an implicit default value in the absence of an explicit default value. That's why you just can't really use them with the write once properties. Either you have a default value or you can do anything with them. That's why we we had to only allow typed properties with the write once properties and typed properties are in an uninitialised state by default. You can't read them until you first assign them a value.

Derick Rethans 4:04

Because in PHP 7.4 that will throw a type error. So that actually ties in really nicely with PHP 7.4's initialise concept for the type hinted properties.

Máté Kocsis 4:14

Yes.

Derick Rethans 4:15

One thing that is slightly skipped over is which keyword does the RFC produced, because you mentioned final for Java and read only for C sharp, which one of you picked for PHP?

Máté Kocsis 4:25

So there were plenty of possibilities considered. The first one was the final keyword. At first, it seemed to be the obvious choice for me, but after thinking about it, I turned out that it's not not the right candidate because currently it affects inheritance rules in PHP. And now we are talking about mutability rules. We had sealed which comes from C sharp and the problem is the same because it also affects inheritance rules, so we shouldn't reuse it for different purposes. We also consider immutable. It's one I like. But it might be a little bit misleading because the usage of immutable data structures, like objects or resources are not restricted at all. Then there's locked, which is a bit too abstract or vague name.

We also have writeonce as well. And technically, it's the most accurate term. But from the user's point of view, it could be a bit confusing because they are not expected to write them at all, only the read these properties. And now we have readonly and probably this keyword get the most traction so far. And it's good. It's a good name because it refers to what users should generally do with these properties. However, there's also a slight problem that users can, or in some circumstances can, write these properties too. But that's not the general use case.

Derick Rethans 6:10

It's a curious thing. I remember we had a PHP developers meeting back in 2000, let's say 2008. But it could as well have been 2005, where we also actually spoke about read only properties, but I'm going to have to dig up the notes for that to see what it said there. Maybe you find it interesting to read to see what the history said about this.

Unknown Speaker 6:31

I'm curious. The question is open, so I plan to put it to vote.

Derick Rethans 6:37

When do you think you're putting it up for a vote?

Unknown Speaker 6:39

I think it should be close now. I will answer the mail, which came from Nicholas. I don't know if there is no more problems than we could do it this week or early next week.

Derick Rethans 6:54

As the properties are write once, how will she implement lazy loading with that? In order to do the lazy loading, you need to first figure out whether the property is already set. How will you know that it's already set? How can you check for that?

Máté Kocsis 7:07

I think generally you don't have to worry whether a property's write once or or not. Since mainly, we are talking about private or protected properties in the most cases. However, if you need this information, then you will be able to use reflection. I've already added support for method in in ReflectionProperty for this purpose.

Derick Rethans 7:31

Let me ask a little bit more about that. You mentioned that this is meant for lazy loading. I understand lazy loading is something that you do well, you're executing and all the methods. For example, on an object, you do get something and that needs to fetch things from a database. Because those write once properties are private or protected, most of the time, the code that fetches the things from the database that does the lazy loading still needs to know whether the properties already been written to. Because if it would attempt it again, you'd potentially get an exception. So how would it know it's already been written to?

Máté Kocsis 8:03

Good question. I was talking with with Marco Pivetta. His use case with proxy manager is to unset these properties in advance and then it can use the get or set or I don't know which magic methods.

Derick Rethans 8:28

I saw that the RFC mentioned a few other alternative approaches for this feature. And the headlines in the RFC say: read only semantics, write before construction semantics, and property accessors. Would you mind explaining these and why they haven't made the final RFC?

Máté Kocsis 8:44

The first one was to follow Java and C sharp, and require all write once properties to be initialised until the object construction ends. And this is what we talked about before. The counter arguments were that it's not easy to implement in PHP. This approach is unnecessarily strict. The other possibility is to let our limited writes to these properties until object construction ends and then do not allow any writes. But positive effect of this solution is that it plays well with bigger class hierarchies, where possibly multiple constructors are involved, but it still has the same problems as the previous approach. Finally, the property accessors could be an alternative to write once properties, although in my opinion, these two features are not really related to each other. But some say that property accessors could alone prevent some unintended changes from the outside and they say that maybe it might be enough. I don't share this sentiment. So in my opinion, unintended changes can come from the inside, so from the private or protected scope. And it's really easy to circumvent visibility rules in PHP. There are quite some possibilities. That's why it's a good way to protect our invariants.

Derick Rethans 10:15

What was the most criticism you got on the mailing list about his proposal?

Máté Kocsis 10:18

As far as I remember, the property accessor. The biggest criticism was that we don't really need this term, but we could use property accessors.

Derick Rethans 10:29

We have spoken a little bit about what this feature is. We went into a few use cases with lazy loading. What would other use cases for this be?

Máté Kocsis 10:38

I think it's really suitable for domain driven design, or working with value objects, and I'm a great fan of DDD. The problem is PHP can't guarantee any immutability for our objects. Just one example. You can invoke the object constructor as many times as you wish, which overrides all your properties.

Derick Rethans 11:04

I had not thought about that you can actually call the constructor yourself. And of course you can.

Máté Kocsis 11:08

Yes, me neither. I just saw somewhere probably in a previous discussion about immutable objects. That's the advantage of having write once properties. You could by using write once properties, yeah, you can prevent accidental modifications from the outside or from the inside too. And that's the main purpose.

Derick Rethans 11:32

Your main purpose wasn't lazy loading but more immutable value objects.

Máté Kocsis 11:36

Yes, yes. Right. I proposed right fans properties first, to pave the road for immutable objects because this is my main goal.

Derick Rethans 11:46

Okay, but you're going step by step. I think that's actually a wise way and Nikita have said something similar that it is nicer to take things little by little so that it is easier to convince people that this is a good feature or not.

Unknown Speaker 11:59

Actually it was Nikita's idea to split the two proposals.

Derick Rethans 12:03

That make sense. Okay, Máté, thank you for taking the time this morning to talk to me.

Máté Kocsis 12:08

Thank you for having me.

Derick Rethans 12:11

Thanks for listening to this instalment of PHP internals news, the weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening and I'll see you next week.


PHP Internals News: Episode 43: Syntax Tweaks

PHP Internals News: Episode 43: Syntax Tweaks

In this episode of "PHP Internals News" I chat with Nikita Popov (Twitter, GitHub, Website) about the RFCs. One on abstract methods in traits, and one about an improvement to the tokenizer.

The RSS feed for this podcast is https://derickrethans.nl/feed-phpinternalsnews.xml, you can download this episode's MP3 file, and it's available on Spotify and iTunes. There is a dedicated website: https://phpinternals.news

Transcript

Derick Rethans 0:16

Hi, I'm Derick. And this is PHP internals news, a weekly podcast dedicated to demystifying the development of the PHP language. This is Episode 43. Today I'm talking with Nikita Popov yet again about a few RFCs that he's produced for PHP 8. Good morning, Nikita. How are you doing?

Nikita 0:34

Good morning, Derick. I'm doing great.

Derick Rethans 0:37

I've given up on introducing you because we've done this so many times. Now, you don't need an introduction any more. The first RFC I wanted to talk about a little bit this morning is the abstract trait methods validation RFC. What are traits?

Nikita 0:51

We usually talk about traits as compiler assisted copy and paste. Basically, we just take all the methods and properties from a trait and copy them into the class that's using the trait. That's a bit over simplified, in particular, you can use multiple traits in the single class. And those traits might be defining the same method, in which case you have to resolve the conflict in some way. So that's where you have these insteadof or use annotations to specify precedents and aliases.

Derick Rethans 1:23

Traits has been in PHP for quite a long time. What is now the problem that you're trying to solve through this RFC?

Nikita 1:29

The problem is that traits are sometimes not self contained. So to give a specific example, we have in the logger PSR, we have a trait called logger trait, which has a bunch of methods like warning, error, info, notice, and so on. So just simple helper methods, which all called the log method with a specific log level and this trait only specified these helper methods but still requires the actual class to implement the log method. The way you'll usually indicate that is by adding an abstract method to the trait. You have all the methods you actually want to provide by the trait. And you have a number of abstract methods that the trait itself requires to work. This already works fine, but the problem is just that these methods are not actually validated, or they are only inconsistently validated. Even though the trait specifies this abstract methods, you could implement it in the class with a completely different signature.

Derick Rethans 2:30

Okay, just like any signature?

Nikita 2:32

Just like any signature right. The method still has to be present in some way. But the signature can be completely different. Could also be like different method type, like a static method, or an instance method.

Derick Rethans 2:43

Just basically checks for the name is what you're saying?

Nikita 2:46

Yeah, it only checks with the name.

Derick Rethans 2:49

Is this the only place, is this the only time where these abstract methods are not being validated. Or are there other situations where that could happen as well?

Nikita 2:57

No, I think this is the only place.

Derick Rethans 3:00

Are all the situations where these abstract methods in the trait will get validated. And also on signature?

Nikita 3:07

As I mentioned, it's not like the signatures are completely unvalidated. They are just inconsistently validated. It depends a lot on exactly how you use the trait. If you just use the trait and specify the methods of the same class, it doesn't get validated right now. If instead of the method is provided by the parent class, so it's inherited, then it does get validated. If you don't implement the method that makes the class abstract instead, then it's also going to get validated in the child class. It kind of already works halfway. And this RFC just tries to make it work always.

Derick Rethans 3:44

Okay, that seems like a reasonably good addition to almost a no brainer.

Nikita 3:48

I would say it's basically, a bug. Especially if you look at the implementation, there is clearly some validation code there. The conditions are just a little bit off, but so we do have to go through the proposal, because this is a backwards compatibility break.

Derick Rethans 4:02

Yes, I was about to ask if it's a bug fix, why bother with an RFC? But if it's a BC break then yeah, we still need to do it of course. I doubt there be many controversies about is?

Nikita 4:12

Actually there is one contentious point. Um, so something I didn't mention yet is that the RFC also allows you to define private abstract methods in traits. Normally private abstract is like a contradiction in terms because private means only visible in the same class. And abstract means it has to be implemented in the child class, you can't really have both. You can't have both with traits, because traits can see the private members in the class. I think that by itself is like not controversial. That's a reasonable thing to have a trait. The part that is controversial is what you do with existing visibility modifiers. This pattern already exists. So people already define abstract methods in traits but because right now private abstract is forbidden, the lowest they can use is actually protected abstract, even though they don't actually want that method to be publicly exposed, or even protectively exposed. So there is an argument there that we should maybe ignore the normal visibility validation that we do, and allow even implementing a protected abstract method from a trait with a private method inside the class, simply for backwards compatibility reasons.

Derick Rethans 5:21

Because if you wouldn't allow that then, how would it break things?

Nikita 5:26

It would break things because there is existing code, using these abstract protected methods simply because we don't support abstract private yet. So those code would start throwing visibility error, and I mean, could be fixed by just dropping the abstract method, but there's also not ideal.

Derick Rethans 5:45

Because people use it to make sure that, I mean it's there in the class that implements the trait pretty much. Do you have any idea when this is going to for vote?

Nikita 5:53

I think it can already go up for vote? Mainly I need to resolve that question about the visibility first.

Derick Rethans 5:59

I'm looking forward to seeing that showing up sometime soon then.

How do you call your second RFC?

Nikita 6:05

Object based token get alternative?

Derick Rethans 6:07

I think that's a great title. There's a few words in there that we might have to explain first. What are these tokens you're talking about?

Nikita 6:14

So the token_get_all function, which we already have, exposes a part of the PHP compiler infrastructure. PHP compilation generally has three steps. The first is the tokenization. The second part is the parser, and then the compiler. So the tokenizer converts the raw character stream into tokens, which encode higher level concepts, for example, that like the sequence of FUNC and so on is actually a function keyword, or that double code followed by characters is actually a string. So this part only recognises like not larger structures, like whole functions but at least the the atoms that make up language.

Derick Rethans 7:00

Would you say these are the words that make up the sentences?

Nikita 7:03

Yeah, that's that's the right analogy.

Derick Rethans 7:06

Why would you want to have access to them?

Nikita 7:08

For example, I have a PHP parser library, which converts these tokens into an actual syntax tree. And then on top of that, you can easily analyse PHP source code. So this is what all these static analyzers, like PHPStan or Psalm are based on.

Derick Rethans 7:27

Do they all use the tokens?

Nikita 7:29

Those two, in particular, use my PHP parser library, and that one uses the tokens internally. There is also other tooling that's more directly based on tokens, for example, code formatters or code style inspection tools like PHPCS. Those all directly operate on the tokens instead.

Derick Rethans 7:47

But as you say, these tokens only are words and they don't really provide a structure. How would these tools then convert that into a structure?

Nikita 7:54

If you're looking for, if you're looking just at formatting, then you may not really need a lot of structure. So you probably do need to write like that of extra code to recognise that, okay, the function token followed by white space, followed by an identifier, that's function declaration. For the more complicated tooling that builds a syntax tree, you need to implement a parser, either based in code generation, or based on recursive descent approach.

Derick Rethans 8:26

Why would you not want to have direct access to PHP's AST instead because that already provides a structure for you?

Nikita 8:33

We do have direct access to the AST through the AST PECL extension, which is not part of core yet. I don't know if there are plans in that direction.

Derick Rethans 8:43

Well you wrote it so you surely can make these plans.

Nikita 8:46

Yes, I can make them but I don't know if I should make them.

Derick Rethans 8:50

I think you should.

Nikita 8:51

I mean, the nominal advantage of the AST extension is that it's always up to date with PHP. In practice that really isn't an issue, because some of the userland tooling is also pretty quickly updated. The more practical advantage is that the extension is a lot faster than implementing this in userland code. Well, I mean, this is really one of the areas where C code is faster than PHP code. The AST extension only exposes the structure that PHP itself needs. PHP is not interested in like precise formatting, and things like that at all. So it throws away quite a few things. You can, for example, get accurate on position information. Like, where, exactly not just which line but of which column, something is defined. And that's something you're quite often interested in.

Derick Rethans 9:46

Also, from what I've known, it throws away all the comments unless they are doc bloc comments. How does the tokenizer currently return information about the tokens? I've played with this in the past and I didn't think it was the prettiest format to get back out of it.

Nikita 10:02

token_get_all returns an array of tokens. And there are generally two types of tokens. One is single character tokens, like a semicolon, or a comma, or whatever, which are just returned as a string. So it's a single character string. And then there are complex tokens, like the function keyword, like white space, like strings, which are returned as an array where the first element is the token ID, which is an integer. And we have constants defined for these integers. The second element is the actual string content of the token. So for the function keyword, that's always going to be function, but it could be written in different ways because the keyword is case insensitive, so it could be all lowercase, or uppercase, hopefully it's all lowercase.

Derick Rethans 10:52

You'll get the odd situation where the first letter is the capital, I suppose, but that's about it, hopefully.

Nikita 10:57

And finally, the last element is the line number. So the starting line number.

Derick Rethans 11:02

So if you want to look at the position on the line, you'd have to calculate it yourself?

Nikita 11:08

Right you would have to track that yourself. I mean, there are two problems. One is just that you have these single character tokens and the complex tokens using different structure. So all the codes using them as to always switch back between those; check if it's an array or a string, or a test to do some kind of normalisation itself. And the second problem is that arrays in PHP are fairly memory inefficient when it comes to storing a fixed amount of data. Storing three elements inside an array always means allocating an array for eight elements. Because its minimum array size, you have to use space to store the key, and so on. Generally, if you have a fixed structure, it's much much more efficient to store it inside an object. Using a class that has declared properties. So this makes a very large difference in some cases, especially if your array only has like two or three elements, you can save a lot of memory with it.

Derick Rethans 12:12

Have you done any benchmarks to see how much memory you'd actually save some likes some some particular scripts that you've parsed with how to tokenizer doesn't matter and how you proposing to do it?

Nikita 12:22

Yeah, I have here in the RFC, some numbers for some particular script that goes down from 14 megabytes to eight megabytes. So that's nearly half the memory usage. Well, actually, maybe I should first actually say what the RFC proposes. The RFCe proposes to instead return objects, an array of objects. And these objects have four properties. So first is again, the ID of the token, then the textual content, the line number, and also the starting position of the token in the string.

Derick Rethans 12:54

Is this something that the tokenizer extension and tracks for you?

Nikita 12:58

I mean, that's something that can easily do, so we can just as will expose it. And these objects are always used. So we no longer make the distinction between single character tokens and complex tokens. So we always return the uniform array of tokens, of token objects. Despite doing that, removing this optimization for a single character tokens, the end result is still that we use half as much memory, simply because objects are that much more efficient than arrays.

Derick Rethans 13:27

That's a clever trick. I'm sure people like that, that using less memory, at least I know I would. Is it also faster or doesn't particularly matter much?

Nikita 13:35

It's also faster, like maybe 30% or something, because memory usage and performance tend to be pretty heavily correlated. So if you use less memory, you also are faster.

Derick Rethans 13:46

That makes sense. Are you thinking of other things that you can add to the tokenizer extension to make working with them even easier?

Nikita 13:52

The way this new functionality is implemented is, we have a PHP token class and on it we have a static method getAll. So instead of calling the token_get_all function, you call PHPToken::getAll(). And one nice thing this allows you to do is to extend this token class. So you can say, MyPHPToken extends PHPToken, and then you call MyPHPToken::getAll() and then we will actually construct your extension class. That means that you can add whatever methods you like, in addition to what we provide by default.

Derick Rethans 14:29

Is that a pattern that we have in other places in PHP as well? Because I don't usually think that even if you'd call an inherited static method, why wouldn't suddenly return the inherited classes object? wDo we did it in other places?

Nikita 14:42

So this is somewhat uncommon in PHP internals. I think it's a pretty common pattern for userland where generally if you return new objects from static methods, you always use new static, not new self. This is essentially late static binding, which we did discuss quite recently. So, there is one limitation here namely that the constructor of the PHPToken class is final. So, you can extend the class and you can add extra methods, but you cannot modify the construction behaviour, because we would like to internally construct these tokens very efficiently by more or less directly writing the values into the right slot in memory and not doing slow constructor calls, becouse this functionality tends to be very performance sensitive. And the same trick where you can extend the class but not change the constructor is also used by the SimpleXML extension. Does exist but not very common in, generally where internal code is concerned, we usually do not really plan for extension. I think nowadays we mark nearly all internal all new internal classes as final simply because extension is such a pain to deal with. And for old classes who usually wish that we had marked them as final. I mean, this is also a general recommendation for userland that, like you should mark things final as much as you can get away with it. But it's much bigger concern for internals because dealing with userland extensions that do unexpected things is much harder for us.

Derick Rethans 16:23

You even need to make sure that your internal structures are properly constructed by the parent's constructor being called from inherited classes but in PHP, there's no such requirement that you do. Pretty sure I've had problems with that for the Date extension a long, long time ago, where people would extend from it, not call the constructor. And then because he didn't think of it, nothing is defined and everything just falls down.

Nikita 16:44

Yeah, so this is one of the common problems. And the other one is that internal classes often define custom object handlers. So that's something only internal classes can do. Just to give one example, they can define debug info handler that modifies the output of var_dump, but nowadays we also have the user land magic methods on get you back into and I think pretty much all internal classes are just going to ignore that, and always return their own internal debug information even if this method has been overwritten, simply because no internal class actually checks for that. And this kind of problem also exists for a lot of other magic, and generally no one takes it into account, and things are just more or less softly broken.

Derick Rethans 17:31

Very recently there was a pull request for Xdebug to change that as well because in Xdebug's debugging output get sent to IDEs. For internal classes always uses internal get debug handler, and for userland classes it uses whatever is userland defined; I mean if there's a magic method we'll use that. The pull request wanted to change Xdebug in such a way that it would also use the get debug info magic method for internal classes, whenever overridden. After some discussion about this, we figured out, this is probably a bad idea to do, and hence, we haven't merged that. Although we end up fixing some other things that the developer also found. That's a curious situation to be in. We would like us to be sort of work the same. But at the same time, you sometimes really want to see the internal information from the classes without developers having hidden the information behind it, right.

Nikita 18:20

Yeah, that's true.

Derick Rethans 18:21

And that is just from a from a debug perspective. And even from, let's make sure things don't crash perspective. I see that the RFC also rejected a few features that aren't part of the current iteration yet or might make sense to add it later. And one of them is about a lazy token stream. What would that be and what sort of different interface would it have?

Nikita 18:43

The lazy token stream basically just means that instead of returning an array of tokens, we return an iterator of tokens, which means that we do not have to store the full array in memory, which, like for the example, I used. The memory usage for the whole token array was eight megabytes, even after these memory usage improvements, which wasn't a fairly large file, but definitely not the largest file. You can encounter especially when it comes to generated files. So there is an advantage of processing tokens one by one as a stream, because then your memory usage is going to be basically O(1), not O(n). The problem is, I mean, the PHP lexer does indeed work one token at a time, so it can support it. The problem is that it has a lot of internal state. And in order to implement this iterator, we would have to backup and restore the state on each produced token to make sure that it's still possible to for example, include and compile other files in the meantime. So this is something that can be improved; we can make that cheaper, but that would be a larger effort. And I'm not really sure it's worthwhile because, while you can process one token at a time. And this is, for example, what the PHP parser does internally. Many practical applications in userland will generally want to have all tokens as an array. Because it makes it simply, makes things easier if you can always look ahead and look back. And I think it would be fairly hard to rewrite the existing libraries in terms of the latest tree. It may be a nice to have, but I'm not the most useful thing for it now.

Derick Rethans 20:32

What has been the feedback for this RFC?

Nikita 20:35

I think pretty good. This is something that we've already discussed years ago. Last time the discussion kind of got a bit got a bit sidetracked. Yeah, one of the dangerous when you start introducing object oriented interfaces. Well, actually, I just call this RFC object-based intentionally, because when you do object oriented then people would like to have their tokens, and their token streams, and their token stream factories, and the token stream managers. And this is basically held this the whole time. But generally everyone who is working on tokens, which is not a lot of people, but those who are working with them know that memory usage is a problem. And the current, current inconsistent structure is a problem, which is why most of them implement their own token objects, and basically do the same thing we propose here just themselves.

Derick Rethans 21:30

When it's this one going up for a vote at the same time?

Nikita 21:32

Soon.

Derick Rethans 21:33

Both of these RFCs that we spoken about today are both targeted to a PHP eight, I suppose?

Nikita 21:37

Yeah. So right now, I think all RFCs are targeted at PHP 8.

Derick Rethans 21:42

Thank you for taking the time with me today, Nikita to talk about a bunch of little RFCs that you've written. Perhaps by the time this podcast comes out, we've started voting on them and see what happens to them.

Nikita 21:52

Thanks for having me once again.

Derick Rethans 21:56

Thanks for listening to this instalment of PHP internals news. The weekly podcast dedicated to demystifying the development of the PHP language. I maintain a Patreon account for supporters of this podcast, as well as the Xdebug debugging tool. You can sign up for Patreon at https://drck.me/patreon. If you have comments or suggestions, feel free to email them to derick@phpinternals.news. Thank you for listening, and I'll see you next week.