It seems like the python maintainers always try to complicate things for themselves to cater for some obscure use case that nobody use. I mean really, when was the last time you ever accessed a namedtuple by index?? (For vectors i do see a benefit that you can throw it right into a matrix multiplication that is designed for arrays but really, you can do that equally easy in other ways). And constructing them on the fly doesn't make much sense either as it defats the benefits of a shared type across the code base.
All people ever wanted was just a way to write in one line ´Point3D = struct(x, y, z)´ that translates into the exact equivalent of:
class Point3D:
def __init__(x, y, z):
self.x = x
self.y = y
self.z = z
What would be the startup performance of parsing the above 5 line class compared to a namedtuple? Surely it should be faster to create it with a shorthand form built into the language, if the functionality is equivalent?
One advantage of namedtuple is that it is just a tuple! You can pass the namedtuple to code that expects a tuple, even C code that expects a tuple. This makes it easy to add names to tuples after the fact without breaking existing code.
In one project, I use a Point2D class as a namedtuple. Because it is just a tuple, I can easy convert it to a numpy array when working with a collection of points.
> I mean really, when was the last time you ever accessed a namedtuple by index?
Yesterday. I use this quite a bit. I like non-named tuples as a data structure and use them regularly, and often partially upgrade a tuple to a named tuple.
Changing these sorts of things is what caused the Python 2/3 split, I don't think we as a community want that again for a while, so backwards compatibility for things like this that aren't _that_ uncommon should be a priority.
Agreed, I use the fact that "namedtuples are tuples" constantly. For one thing, you can pass them to 3rd party libraries that take tuples while maintaining readability on your end. I would be horrified if they dropped index access. It would break everything.
It means that if you have a data structure that is represented by a tuple, you can switch the creation call with a namedtuple of the same size, keeping the tuple API, therefore not breaking any older code that may rely on it, while at the same time providing a cleaner interface (using the attribute names) to newer code.
Of course, old code should be upgraded to use attribute names, but it's really convenient to be able to upgrade it painlessy.
As a concrete example, in Python 1.5.2, urlparse.py:urlparse returned a 6-element tuple:
# Parse a URL into 6 components:
# <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
# Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
# Note that we don't break the components up in smaller bits
# (e.g. netloc is a single string) and we don't expand % escapes.
def urlparse(url, scheme = '', allow_fragments = 1):
...
tuple = scheme, netloc, url, params, query, fragment
_parse_cache[key] = tuple
return tuple
This API was around for years, with people using tuple indexing either explicitly or through destructuring assignment.
This was replaced with a namedtuple. The following is from Python 2.7.10:
class ParseResult(namedtuple('ParseResult', 'scheme netloc path params query fragment'), ResultMixin):
__slots__ = ()
def geturl(self):
return urlunparse(self)
def urlparse(url, scheme='', allow_fragments=True):
"""Parse a URL into 6 components:
<scheme>://<netloc>/<path>;<params>?<query>#<fragment>
Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes."""
tuple = urlsplit(url, scheme, allow_fragments)
scheme, netloc, url, query, fragment = tuple
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
return ParseResult(scheme, netloc, url, params, query, fragment)
The old tuple-based API still works, so it doesn't break backwards compatibility, while it also gives people the ability to migrate to an attribute-based API.
That said, it's very hard to remove the old API. The latest version of Python, in url/parse.py:urlparse, still supports tuple-based indexing.
Perhaps Python 4 will remove that behavior? Probably not. At the very least there would need to be a namedtuplemigration type which gives a warning if the old tuple API is used when only attribute lookup is expected.
Hmm, I think meant to say something like "using either explicit tuple indexing or through destructuring assignment". Words, mixed up, they were. But yes, what I wrote was wrong.
> All people ever wanted was just a way to write in one line ´Point3D = struct(x, y, z)´ that translates into the exact equivalent of [... a regular class with just an x, y, and z attributes].
You're giving up a lot:
* The namedtuple is dramatically more space efficient (which will be important if you have a lot of points).
* The namedtuple has a nice self-describing __repr__() (which will can important when you're debugging):
>>> focal_point
Point3D(x=10.2, y=8.5, z=13.1)
* The namedtuple defines an equality relation (which is important if you ever want to recognize that two points are the same or need to find-out whether a given point is in a container):
>>> new_point in visited_points
True
* The namedtuple is hashable (this will be important if want to pass points into a function that uses functools.lru_cache() for example).
* The namedtuple is unpackable. This will important with loops like such as:
for x, y, z in points:
print(x*y - y*z)
* The namedtuple works with just about every existing tool in the language that expects regular tuples (such as old-style string formatting).
Maybe exact equivalent was an exaggeration, but anyway..
Half of those examples are just valid because x,y,z in particular have a de-facto ordering that everybody follows, for any other record type it's just confusing (what does slicing an EmployeeRecord mean??!). Csv module will produce named records based on the header so i don't see the use there. With sqllite3 you can set row_factory=Row which will produce results which already has access by name. If you for some reason need to convert to a dictionary, you could always use __dict__
Comparability, space, hashing and immutability are important though. Altough frankly, i can't take any immutability argument seriously since python has refused to add a constants for decades, it's just hypocrisy to one day have the philosophy that everything is dynamic and then then the next day say it's not.
I agree. Change the class slightly and several of those methods become a big negative. As I've argued, I don't want to include API features "for free" if I don't really need them, because they will become a support headache for me in the future
Here's what I mean. Let's start with the Point(x, y, z) class. If I want to tweak it a bit to add a pre-computed 'r' attribute, I can do the following in a normal class:
class Point:
__slots__ = ("x", "y", "z", "r")
def __init__(self, x, y, z):
self.x, self.y, self.z = x, y, z
self.r = (x*x+y*y+z*z)**0.5
(If I want to enforce immutability I would use the attrs library.)
I can't simply tweak a namedtuple to support that, because the "r" will leak in to the __init__, _as_dict(), __iter__, __repr__, and other methods. Nor can I write "for x, y, z in points" any more.
Now, consider a point expressed in spherical coordinates (r, theta, phi). The two points (0, 0, 0) and (0, 45, 180) are identical, because both are at the origin. The two points (10, 0, 0) and (10, 360, 0) are also identical, because both are at (10, 0, 0) in Cartesian coordinates. (Angles expressed in degrees.) Either the __init__ needs to normalize the values to some standard range, or the __eq__ needs to check for degenerate cases. The default __eq__ and __hash__ for namedtuple are valid if all coordinates are in standard form, but that's incomplete.
(Huh, and I see that the tuple behavior is that (nan,) == (nan,) even though nan != nan. This doesn't makes sense for some cases.)
I disagree with the proposal that slicing is "useful for computing projections or partials to 2D points". In addition to the weirdness in slicing an Employee record, how do you project along the y axis using slicing? Are people really supposed to write [0::2]? Or [2:-1:-2] if they want the projection on the other side? That seems like a horrible API. Plus, the result is an anonymous tuple, when a Point2D makes more sense.
For that matter, in spherical coordinates there is no meaning to point[::-1], so why do I want the API to support that in the first place?
Next, consider a point expressed in homogeneous coordinates x, y, z, w. Slicing here is worthless as a way to give a projection. What does point4[2:] mean?
In a real system of course, you project given a view point and orientation, and are not limited to a projection along the axis, which is all that a slice might be able to give you. Of course, in real system you store all of the points in an array because that "is dramatically more space efficient".
Then there's the _make() method, which I'm struggling to understand why it's useful. It seems to save only a few characters over either of the following:
for emp in (EmployeeRecord(*x) for x in csv.reader(file)):
print(emp.name, emp.title)
for row in csv.reader(file):
emp = EmployeeRecord(*row)
print(emp.name, emp.title)
The latter is also, IMO, easier to understand and maintain for a wider number of people.
On a deeper level, I can't see using this because most code would expect EmployeeRecord.age to be an integer. It's not easy to change the example code to, say, list those which are legally allowed to sell alcohol in the US:
for emp in map(EmployeeRecord._make, csv.reader(open("employees.csv", "rb"))):
if emp.age >= 18:
print(emp.name, emp.title)
Even changing it to "18" gives a problem because "100" < "18".
At least the sqlite example will (likely) put the right types in the object, but again it's only saving a couple of keystrokes at most. Why learn a specialized API function for that?
As for the "Many other little things", those appear to be solutions in search of a problem. As Too points out, if I needed an _asdict() I could often do obj.__dict__.copy(). In practice, I use __dict__ access in debugging - rarely - and don't recall ever using that in production code. Why is _asdict() so important that it should be part of my API? If it is so important, why don't more people include it with their own class definitions?
I don't want to litter my API with functions that are useless, which is what namedtuple invites me to do by having me pretend that my data is "super" tuple-like.
>It seems like the python maintainers always try to complicate things for themselves to cater for some obscure use case that nobody use. I mean really, when was the last time you ever accessed a namedtuple by index??
Sure, but if you want an immutable object with iter, reasonable equality operators baked in, and named access, you can either define iter, eq, neq, and le on a new style slotted object with a bunch of getter descriptors, or you can make a namedtuple.
I suppose if python had object destructuring like newish javascript (I think that's what it's called) it would be trivial. I don't know what you'd do instead in this case that doesn't involve adding additional methods or something.
Object destructuring is tuple unpacking in python which is exactly what the code example does. What the parent was remarking on was caring about the performance of that rather than the ability.
Your version is mutable, whereas a namedtuple is immutable. a very important difference for my field. I use namedtuples all the time.
Also, I can't expand an instance of your Point3d as arguments to a function, DrawPoint(*p), I would have to write DrawPoint(p.x, p.y, p.z), did I preserve order correctly?
Wouldn't it be so convenient if Python provided a way to define immutable classes? Also, you can always provide a `draw_point(*p.args)` which isn't bad at all.
You can get close. Set __slots__ and then define each as a property with no setter and no deleter. However, an object can still be modified if it stores a mutable object (list, dict) and, while I'm not in front of a computer atm,I believe you can overwrite __slots__ to do whatever.
Your Point3D class is _not_ functionally equivalent to a NamedTuple. The class is so minimal that I wouldn't bother calling it a class as much as I would call it a glorified dictionary.
I do like the concept of Structs. I first used them in Ruby, so I figured that all "very high" languages had the concept. Python, however "struct" has a very different meaning in Python. sigh It's intended for encoding/decoding binary data from strings/bytes.
Namedtuple is used a lot in the standard library to return struct-like results which used to be returned as an actual tuple, so making them non-indexable would break a lot of code, some of which was written before namedtuple even existed.
There are 13 locations in the standard library where a namedtuple was used to replace a tuple, where making them non-indexable would break code.
There are 8 locations where it's used as a shorter way to write a simple class. My point in that thread was that namedtuple should never be used this way.
>It seems like the python maintainers always try to complicate things for themselves to cater for some obscure use case that nobody use. I mean really, when was the last time you ever accessed a namedtuple by index??
Notice how when someone asks something like that, some bizarro practitioners of such methods would always appear and say how they are invaluable for them...
Because the whole point of namedtuples is to have them as a convenient quick notation for records.
But I was referring to the parent's general observation that about "catering to some obscure use case that nobody uses", which I agree with -- not about their specific example.
Order is important for "pure" Python tuples, so I see no reason why it shouldn't be the case for namedtuple. Mathematical tuples are ordered as well:
"In mathematics a tuple is a finite ordered list (sequence) of elements. An n-tuple is a sequence (or ordered list) of n elements, where n is a non-negative integer. There is only one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair." (Wikipedia)
Should order be important for a Tuple under a system that defines a Tuple as an immutable ordering?
That is, whatever your philosophical issues with namedtuple ordering, in python, a tuple is an immutable ordered list of things. In practice, tuples have taken the role of being type-heterogenous, known length, records.
One might then reasonably expect a namedtuple to be a type-heterogneous, known length, record, where attributes are named.
Tuples are ordered by definition. We use names for things to convey what they are, not what we want them to be. If you want a construct in which order does not matter, call it a set or a dict.
It's not about the storage (besides Python also has order dicts).
Namedtuples are for the convenience of creating a record type (though indeed as others say, treating it as tuple where things expect a tuple is also a benefit).
I would expect that construction to result in identical behavior for dict() and certainly not anything calling itself a tuple. NamedTuples behaving in the manner you expect violates LSP with respect to tuple().
In the relational model the order shouldn't matter:
A relation is defined as a set of n-tuples. In both mathematics and the
relational database model, a set is an unordered collection of unique, non-
duplicated items, although some DBMSs impose an order to their data. In
mathematics, a tuple has an order, and allows for duplication. E. F. Codd
originally defined tuples using this mathematical definition.
Later, it was one of E. F. Codd's great insights that using attribute names
instead of an ordering would be so much more convenient (in general) in a
computer language based on relations. This insight is still being used today.
Though the concept has changed, the name "tuple" has not.
It's invaluable in several use cases, especially when working with external code that expects a tuple, when unpacking a tuple (a, b = c) especially when iterating over them in a loop (for a, b in c).
You have used some of these methods before. So are you bizzaro?
No. Nobody is bizarro for mentioning this, it's extremely handy and it's kind of in the name: namedtuple. Without this they are utterly pointless.
Although I can't tell if the moral of this is "if the api allows, it will be done, and you can't break the api later", or if the moral is "just because you think it's useless doesn't mean it really is".
That would be the bizarro practitioners who expect subclassing to actually be subclassing. Maybe there is a sane argument that namedtuple shouldn't be a subclass of tuple but that ship sailed a long time ago.
Python has really painted themselves into a corner. The language is so extremely dynamic with everything being allowed and people actually relying on such things that you can't even look at the language without breaking backwards compatibility. https://xkcd.com/1172/ is a very accurate description of the current state.
Python has painted itself into the top position on the most popular list,and has been in the top three for as long as I can remember, mostly because of its inherent internal dynamism. People relying on decades old language features is the opposite of a problem.
People use python because it has nice syntax, batteries included standard library, easy to make code for both windows and Linux, no compilation, big community, and it's the closest arm to reach out to when you realise your shell scripts are getting too big.
Not because you can monkeypatch the _attrs_ function on the fly somewhere on a class in the standard library.
> What would be the startup performance of parsing the above 5 line class compared to a namedtuple?
I think https://news.ycombinator.com/item?id=15135147 explains why you can't optimize namedtuple like that, since dynamic dispatch makes calls to them not "the exact equivalent" of that code. The proposal to move them to a top level data structure could solve this problem though, I think.
Checkout [attrs](http://attrs.readthedocs.io/en/stable/) for a very powerful, yet equivalently short way of defining classes. Personally, I prefer defining them with the decorator rather than the shorthand, but attrs is definitely my library of choice for value semantic types.
The main issue I see is caring too much about other implementation of Python.
While that is a very kind behavior, it brings more complexity to the issue.
The CPython developers should, imho, define/consolidate the API and then choose the implementation that yields the best performance for their implementation.
Developers of other Python implementation will surely find a way to implement that API with optimal performances with regards to the peculiarities of their own implementations.
This article is ostensibly about how the current implementation of namedtuples has had serious consequences for the startup time of Python, because namedtuples are used in the compilation of classes (roughly). However, somehow this buries the lede...the most "interesting" discussion is kicked off by the first comment:
Issues around the performance of Python and programs
written in it have far wider consequences than startup
time. During all the time any Python program is running,
its host machine is consuming power that typically depends
on pumping CO2 into the atmosphere. If most of that power is
wasted, the effects go far beyond extra money to buy
it, or to operate extra servers, or users who wait a
little longer. The carbon footprint of a Python program
that runs throughout a data center, or many data centers,
adds up.
There was an article earlier on HN about the energy consumption pattern of Bitcoin/Ethereum and presumably any blockchain that implements a proof-of-work protocol/scheme, and between that article and this comment - I've started to notice a growing unease (I am probably waaay behind on the uptake) about the "world-eating" capacity of software.
I wonder how quantifiable implementation decisions like the ones exhibited by namedtuples in Python, which one might argue is an unfortunate/accidental side-effect vis-a-vis energy consumption, versus ones like proof-of-work, which I would argue are explicitly designed to be expensive.
And if anything should come of that quantification, namely, does optimizing code really become a moral imperative, and if so are there some usability and refactorability metrics that are often held in high regard that we ought to consider abandoning in the name of "energy efficient" software.
Obviously, this isn't a simple tradeoff, software that is difficult to write because it is highly optimized is difficult to maintain, and it might be the case that performance derived energy savings are outweighed by the energy cost of maintenance (literally, the energy cost of debugging and testing).
I do believe that we should consider a moral obligation to optimize the sh$$ out of widely used software. Inefficient algorithms running in millions of billions of instances (cloud data centers, smartphones, home routers, smart TVs...) create a considerable environmental footprint fast. This is especially true as long as we are powering them using non-renewable energy sources.
The amount of effort that can be spent on optimizations on this kind of widely used software is enormous before the balance becomes negative. If an optimization that shaves off 1 second of CPU time used per year on each of a million devices will result in a net reduction in energy usage within the first year even if it takes 30 eight hour work days to develop.
This train of thought becomes particularly nasty when you realize that all the computational overhead introduced by the widespread usage of encryption (https etc.) must necessarily lead to environmental damage. To put it in the most provocative way I can think of right now: which is more important: the security of your personal data now or the safety and wellbeing of future generations of humankind?
There is a drive to improve performances of dynamic languages. Probably not a moral imperative but business imperatives. Everybody remembers what happened to JavaScript, from slow to fast in a few years. The same forces are at work on Python (or Ruby, which gets faster at every iteration.)
But those tradeoffs are difficult to assess. Example: which is greener, unoptimized code that took 1 hour to develop or optimized code that took 1 day? I guess it depends on how long that code has to run. And what if the company spending 10x to optimize all code is forced out of business by another company that got all the customers while they were optimizing code? Then all that optimized code is wasted.
Concentrating optimizations in the compiler or interpreter scales better IMHO.
> The same forces are at work on Python (or Ruby, which gets faster at every iteration.)
Python performance has been one step forward two steps back. For example, dict() has significantly evolved over time and some other things became faster as well, but most stuff just becomes slower as more stuff is added or previously native code is replaced by pure Python (see io in 3.0, which was reverted, or import).
Ditto for Python application performance. Python has exactly no zero-cost abstractions, so when something is refactored to use classes or big methods are split up, the result is almost always slower than before.
As long as ridiculous concepts like fossil-fuel burning cars are legal and in wide use, I think it is premature to optimize for program power consumption.
It's interesting to run some numbers here. Say you're a software company that's totally dedicated to removing CO2 from the atmosphere: you run a business, but that's just to fund your CO2 reduction efforts. Should you rewrite your python code in C? Sure, it'll reduce your carbon footprint, but it'll deduct from your profits, which directly reduce atmospheric CO2.
I'm trying to compute the best case scenario for a rewrite here, and I've tried to err in that direction whenever possible.
US power consumption in 2015 was 4,144.3 TWh, and CO2 production in 2016 (presumably higher than 2015) was 1821 Tg. Electricity costs in Oaklahoma (cheapest state in the US according to [0]) are 6.8c/kWh
I'm assuming that you're reducing CO2 by buying carbon offsets: if you have a better way, that makes the case for a rewrite weaker. Poking around a bit leads me to [1], and taking one of the higher values on that page gives me $15/ton (assuming imperial to favor a rewrite). Multiplying all that out give me that for every $100 spent on electricity, you need to spend an additional $10 to cancel it out.
That's... something. But of course, datacenters don't spend all their money on electricity: one of the higher figures I could find [2] gives me 20%. On the other hand, the rest of the datacenter generates CO2 as well. I found tracking down datacenter-specific data really hard here. The CO2 impact of computers is mostly in manufacturing, for example [3]. On the other hand, servers see much more use than a typical consumer box, which complicates things. Ideally we'd break down the CO2 emissions for each component of the datacenter. But that seems hard, so let's instead assume the rest of the datacenter produces as much CO2/dollar as the electricity does. This seems really high: utilities as a whole make up 1.6% of the US GDP[4] but account for 29% of the US's CO2 output[5]. Still, even with that assumption, your datacenter externalities are 10%.
If you're right on the borderline here, then maybe that's enough to push you to do the rewrite. But the mantra that "computers are cheap, developers are expensive" still holds when you take externalities into account.
Now of course, most companies don't exist to reduce atmospheric CO2. But if you're going to exhort a company to reduce CO2, it's almost certainly cheaper for them (or more effective for you, depending on how you look at it) to just buy carbon offsets then rewrite all their python in <faster language here>.
This is entirely off-topic; But you've misread that table. those are the stats for the north island, The total split is actually:
57% hydro, 16% gas, 16% geothermal, 5% wind, 4% coal. As an aside, the South Island generates 98% of it's electricity via hydro, so if he's in the south island it would be fair to say his PC is powered entirely by hydro.
Why don't languages like Python have the concept of a C-like struct? Seems like it should be straight forward to have with no downsides that I can think of.
I had this question for awhile too, but I realized after hacking on the Python interpreter that it breaks the execution model:
- Python compiles a single module to bytecode at a time (.py to .pyc)
- It interleaves execution and compilation -- imports are dynamic, you can put arbitrary computation between them, and conditionally import, etc.
If you had structs, then you would need to resolve names to offsets at compile time -- e.g p.x is offset 0, p.y is offset 4. That's a form of type information. Types can be defined in one module and used in another, and Python really has no infrastructure for that.
Nor do any other dynamic languages like Ruby, Perl, JavaScript, Lua, or PHP, as far as I can tell. They are all based on hash tables because deferring name lookup until runtime means that I can start executing "now" rather than looking all over the program for types. It probably helps a bit with the REPL too.
The need for type information in a translation unit is also what gives you C's header file mess, so it's not a trivial problem.
Actually, Python does have that :) It's called __slots__:
The presence of __slots__ does several things. First, it restricts the valid set of attribute names on an object to exactly those names listed. Second, since the attributes are now fixed, it is no longer necessary to store attributes in an instance dictionary, so the __dict__ attribute is removed. Instead, the attributes can be stored in predetermined locations within an array. Thus, every slot attribute is actually a descriptor object that knows how to set/get each attribute using an array index. Underneath the covers, the implementation of this feature is done entirely in C and is highly efficient.
It still involves a dictionary lookup to access the member doesn't it? It's just that the dictionary is located at the class level instead of the instance level, to fetch the offset of the member within the instance.
The goal of using __slots__ is to save memory; instead of using a .__dict__ mapping on the instance, the class has descriptors objects for each and every attribute
So python still has to look at the class for each attribute access on an instance of Foo (to find the descriptor). Any unknown attribute (say, Foo.ham) will still result in Python looking through the class MRO to search for that attribute, and that includes dictionary searches.
If I have time I'll write an example program to test this. Since I've been curious for awhile.
I've actually forked the Python interpreter to make my highly compatible bash shell faster and smaller, while not completely rewriting it [1].
-----
Even if __slots__ did what you think it does, it doesn't completely solve the problem.
A lookup like "return os.path.exists(foo)" can't use slots, because it's a module and not a class.
In principle, local variable lookups are also slower in Python than say Dalvik (a JVM without a JIT). LOAD_FAST still involves a dict lookup (as opposed to the even slower LOAD_ATTR). It just happens at the beginning of the function once, not on say every iteration of a loop inside the function.
In contrast, a Java compiler will resolve the attribute access at compile time.
Also, if slots did what you think it does, it would be 10x faster than normal attribute lookup, but it's not (according to the benchmark above, and at least one other one I saw.)
EDIT: My last two points might seem like a tangent, since they're not really related to structs. But my overall point is that Python doesn't do very much at compile time, so some of that work has to be done at runtime. slots could easily be faster than it is; it's mainly a memory optimization.
So python still has to look at the class for each attribute access on an instance of Foo (to find the descriptor). Any unknown attribute (say, Foo.ham) will still result in Python looking through the class MRO to search for that attribute, and that includes dictionary searches.
Yes, it has to lookup, but the lookup is cached (see typeobject.c).
Where in typeobject.c? I looked, and there's a lot of stuff related to slots, but I don't see any caching.
Also I did a speed test on my machine. My results are the same as the ones I've found through Google: __slots__ is not much faster. It's maybe 20% faster, whereas if it was really a dict lookup vs. index lookup I would expect it to be an order magnitude faster, or at least 5 times faster.
To be clear, I'm claiming that every time you do "p.x" with a class with __slots__, at least one dictionary lookup occurs. It's not like p.x in C or Java.
I could be wrong, but I've never heard the claim you're making, and I don't see any evidence it's true.
Hm I did a test with an coverage-instrumented Python binary, and you are right.
I counted the number of times lookdict_string in dictobject.c is called -- call that number D. And then I made N attribute accesses like 'p.x'.
Without slots, if N goes up, then D goes up. With slots, as N goes up, D stays constant.
Thanks for the information -- I'll look into this more.
(However, as mentioned, slots are not more than 20-30% faster on my machine, in this limited benchmark. I feel like this is a double mystery. Or maybe it's just that interpreter loop overhead or some other overhead is the bottleneck, and not dict access.)
I'm talking to myself at this point, but I think the problem with the caching is that every attribute lookup p.x has to consult the type, not just the object. That's not true in Java.
So you avoid the dictionary lookup, but you're still following extra pointers.
> C header file mess is caused by its designers ignoring the work outside AT&T and not bothering to implement a module system.
To be fair to them, they wanted to create a language whose compiler did fit into the limited hardware they had.
It's not their fault that UNIX had so much success and the world largely went there ignoring most of the progress made in programming tools and languages during the 70s.
That success was mostly caused by AT&T not being able to sell UNIX (at least during the first years of UNIX), so they just made the source code available to everyone willing to pay for a symbolic license, which was a tiny fraction of what vendors were charging for their own systems.
An OS and systems programming language available almost for free, including source code, versus what a commercial mainframe, its OS and SDK would cost without source code included, sure recipe for success.
modules are unloved children, I recently used scheme48 which has a very nice module system (funny considering how old, but anyway) even though other don't have one..
> and Python really has no infrastructure for that.
Nor do any other dynamic languages like Ruby, Perl, JavaScript, Lua, or PHP, as far as I can tell.
This is not correct.
Common Lisp easily supports structs, and is a dynamic language. No need to specify types for structs, although you can optionally do it in CL if you want.
Here's a textbook example (straight from the CLTL2 book...)
Here a struct is defined (only one line needed), and then the struct is used.
;; structure definition
(defstruct door knob-color width material)
;; Automatically creates the following function:
;; make-door
;; And accesors (can be used for getting and setting)
;; door-width
;; door-knob-color
;; door-width
;; door-material
;; ----------------------
;; 'instancing' the structure into "my-door"
(setq my-door
(make-door :knob-color 'red :width 5.0))
;; accessing
(door-width my-door) => 5.0
;; setting
(setf (door-width my-door) 43.7)
;; accessing
(door-width my-door) => 43.7
(door-knob-color my-door) => red
I don't doubt that CL could do it -- if there's any dynamic language that can support efficient C-style attribute lookup for structs, it's Lisp.
(Although I honestly doubt that idiomatic Lisp does this. Idiomatic Lisp might be even slower than idiomatic Python, because I've seen people use alists as structs rather than dictionaries.)
Either way, your example doesn't really address what I'm talking about. Python has a struct module too, and it has namedtuple, etc. My claim is that Python doesn't have C-style structs, where attribute names are resolved to indexes AT COMPILE TIME, because it breaks the model of the bytecode compiler and dynamic module system.
Lisp is flexible about compile time vs. runtime, so I won't be surprised if it can do that. But again there's no evidence in your example that it does that.
CL implementations offer the indexed access, but include the data in the struct only for few data types.
Any good Common Lisp implementation will use fixed indexes. The Common Lisp compiler can inline the accessor and will be using fixed indexes.
What Lisp compiles usually not do is inline the data in the struct. Thus the struct slots will contain a pointer to the slot value.
Exceptions: small integers, characters, ... Those will fit into the slot and will be stored directly in the struct.
That's the point of structures in Common Lisp.
Below is an example in Clozure Common Lisp on a ARM processor. As you can see, the compiler can inline the accessor and uses fixed indexes to access the slots...
It doesn't have to be exactly like C in that it's just a pack of binary data with offsets. It can be like named tuples without the immutability and [index] access.
Python does have the ability to create and use packed byte objects which are interoperable with C structs. The interface is even in a module in the standard library called... 'struct'.
The main issues are:
1. The C-struct interoperability is cumbersome, since really it's just byte objects with some helper functions to pack and unpack them according to a given definition (so you need both the bytes and the correct format definition in order to turn it back into native Python objects), and
2. Most people already are using one of a few different struct-like things depending on their exact use case and preference, and getting everyone to agree on a One True Struct Object to build in to Python or its standard library is probably never going to happen. Just look at namedtuples -- they were an attempt to take the most struct-like thing in Python (the tuple) and add extra optional struct-like behavior to appease people who were frustrated with the lack of structs, but of course people rant about them for having a weird internal implementation and for having behavior that makes them backwards-compatible with plain tuples.
The tuple was Python's answer to the C-Struct, really. If I want to duct-tape 4 values together and hand them around as 1, the convention was to use a tuple.
Tuples don't name values, but that's actually important for performance - Python is an old-fashioned dynamic language that still uses hashtable access for everything, but tuple indexing is a simple array offset. Considering Python's age, reducing that endlessly-nested series of hashtable hits was important back in the day.
Seems to me that classes generally replace structs in such languages. I know they're not the same thing, but classes have the features of structs (and much more) at the cost of performance.
A lot of people (my whole company, for sure) are using namedtuples essentially just as immutable structs. This is a case where classes aren't sufficing. To be fair, it's at least a little bit the functional-programming, immutability bandwagon, but either way it's helping maintain large python apps with many devs.
Python36 gradual typing also helps here though, and am definitely interested to see if that changes the game a bit
> classes have the features of structs (and much more) at the cost of performance.
Not true; C++ classes have zero performance overhead over C/C++ structs (because they're just another word for the same thing).
Of course Python classes have a lot of costly features that C++ classes don't have, but the notion of a "class" (with methods, constructors etc) is not one of them.
And also makes them look more complicated with more syntax. C++ has all those features that I will never know without adding performance. But people don't use Python for performance but ease of use.
In C++ ease of use is given up for the sake of performance. In python it is the other way.
vtable is always "hidden". It's only there when there is at least one virtual member defined. Sole difference between class and struct in C++ is default visibility.
class A {
int a; // private
}
struct A {
int a; // public
}
I'd be happy to see it optimized - namedtuple looks like a convenient way to quick-and-dirtily define a data structure, but several times I've ended up changing back to plain tuples because using namedtuple was much slower, especially when pickling.
Maybe they could add a variant that isn't a tuple as well.
The subtext is jealousy of JavaScript object creation/destructing, it's the thing I like most about JavaScript I think, really makes your code feel fluid. I couldn't tell you what they preserve order though. There's even precompiler extensions to destructure immutable.js Maps.
Seconded. When using JavaScript Es6, I really love how object and array destructuring, spread operator and default values work together to make my code so dense.
I have yet to encounter a time when a list of 2 items didn't suffice for tuples. I'm sure applications exist, as tuples are immutable, but I have never encountered the need for them.
> Either way (or both) would be implemented in C for speed. It would allow named tuples to be created without having to describe them up front, as is done now. But it would also remove one of the principles that guided the design of named tuples, as Tim Peters said:
> > How do you propose that the resulting object T know that T.x is 1. T.y is 0, and T.z doesn't make sense? Declaring a namedtuple up front allows the _class_ to know that all of its instances map attribute "x" to index 0 and attribute "y" to index 1. The instances know nothing about that on their own, and consume no more memory than a plain tuple. If your `ntuple()` returns an object implementing its own mapping, it loses a primary advantage (0 memory overhead) of namedtuples.
Post-decree, Ethan Furman moved the discussion to python-ideas and suggested looking at his aenum module as a possible source for a new named tuple. But that implementation uses metaclasses, which could lead to problems when subclassing as Van Rossum pointed out.
> Jim Jewett's suggestion to make named tuples simply be a view into a dictionary ran aground on too many incompatibilities with the existing implementation. Python dictionaries are now ordered by default and are optimized for speed, so they might be a reasonable choice, Jewett said. As Greg Ewing and others noted, though, that would lose many of the attributes that are valued for named tuples, including low memory overhead, access by index, and being a subclass of tuple.
> Rodolà revived his proposal for named tuples without a declaration, but there are a number of problems with that approach. One of the main stumbling blocks is the type of these on-the-fly named tuples—effectively each one created would have its own type even if it had the same names in the same order. That is wasteful of memory, as is having each instance know about the mapping from indexes to names; the current implementation puts that in the class, which can be reused. There might be ways to cache these on-the-fly named tuple types to avoid some of the wasted memory, however. Those problems and concern that it would be abused led Van Rossum to declare the "bare" syntax (e.g. (x=1, y=0)) proposal as dead.
From what I've read, v8's implementation of objects in Javascript goes basically like this: when you call a constructor function and assign properties to your object, it makes up struct types and ties the object to that type or something.
Like this:
function Point2D(x, y) {
this.x = x;
this.y = y;
}
let p = new Point2D(1.0, 2.0)
initially you'll have an empty object, which will be an empty struct. 'this.x = x' will change the type to the 'X' struct, and 'this.y = y' will change the type to the 'XY' struct. If you do this again with another object, they'll share these underlying structs.
Now this is perhaps easier with a JIT, and perhaps not. But it bears thinking about. Why not just make it so that (x: 1, y: 0) - which would be the best syntax IMO as it fills out the {set, dict; tuple, ???} square - creates an object that shares its class with every other namedtuple that has exactly the x and y properties in exactly that order?
It really frustrates me when I read 'Those problems and concern that it would be abused led Van Rossum to declare the "bare" syntax (e.g. (x=1, y=0)) proposal as dead.' I mean come on, I know it's a different environment in Python than in V8, but seriously this is a solved problem. Those problems? Those problems are a solved problem that a solution was already proposed for in the thread. Just do that.
>He elaborated on the ordering problem by giving an example of a named tuple that stored the attributes of elementary particles (e.g. flavor, spin, charge) which do not have an automatic ordering. That argument seemed to resonate with several thread participants.
I don't want to be too harsh, but this is nonsensical rubbish. Dictionaries preserve order in Python. This ship sailed a long time ago. Namedtuples also already preserve order. Tuples preserve order. Lists preserve order. Dictionaries preserve order.
What doesn't preserve order? Like, I get that it's not strictly defined that dictionaries preserve order, but they do, and people do rely on that, and so it's never going to actually be changed.
>This is exactly why I scream at relational databases. If you can't tell the difference between a set and a list, and especially if you want to store a list in a set-based paradigm, you are going to have ALL SORTS of grief ...
Unrelated but I found this comment funny. This guy has heard of an index, right?
> Dictionaries preserve order in Python. This ship sailed a long time ago.
That ship only sailed last December with the release of 3.6. It is only an implementation detail for CPython. Other implementations can and will use the old behavior. Nobody should be writing code that depends on this behavior.
The reason the syntax was rejected is that you can implement the exact same feature without special syntax. You could write: nt(a=1, b=2) and dynamically create a struct to store the data. In 3.6 this even preserves the order of the keyword arguments.
I don't want to write nt(a=1, b=2). That's ugly. I want to write (a=1, b=2).
Being able to implement it without changing the syntax is irrelevant to the users of the language, mostly. It's an implementation concern. I've always felt that being a little harder to implement isn't a point against a feature, if it's good for users, because even a small improvement for users is usually much less effort in total than even a large implementation effort.
More syntax makes a language harder to read. The point isn't that making this special syntax would be harder, because it's not that hard. The point is that you don't need new syntax. In this case, the new syntax saves a few characters at best. I'm also don't think this syntax is any easier to read than a function call; I actually think it looks so similar that it would be easy to confuse with a function call making the language harder to read.
Maybe this comparison doesn't make sense since Typescript isn't an implementation, just a type model, but anyway. In Typescript it is also similar in that it's only the list of members that define which type of object you have, there is no class name. You only have "interfaces" defining the required members which classes can implement, which makes it compatible with duck typing. So you can have an XY interface which both Vector2D(x,y) and Point2D(x,y) will implement and they will both also implement the X-interface and Y-interface at the same time.
Obviously, practicality should beat purity and very clever and reasobable syntax
(x=1, y=2)
which could be used/generalized for a procedure arguments and effecient implementation in C (we have C-based arrays (lists) and hash-tables, so why not generalized records/tuples?) should be accepted.
The problem with a "pure democracy" is that majority cannot be smarter than the top 5% individuals, so really good ideas almost never getting thorough the bulk of a bell curve.
I think a strong argument against that syntax is that namedtuples are, essentially, an abstract class (or maybe a metaclass). An anonymous namedtuple doesn't really make sense. You want the type information.
As of python 3.6, this is already possible:
class Point2D(typing.NamedTuple):
x: int
y: int
p = Point2D(x=2, y=4)
If you allow anonymous namedtuples you lose one of the big values of a namedtuple, which is that if I want a Point2D, you can be sure I'm getting a Point2D, and not a Point3D. With anonymous namedtuples, there's nothing stopping you from passing a (x=2, y=3, z=4), when you wanted a (x=2, y=3). And maybe that will work fine, but maybe not (or the reverse).
All this is to say, an anonymous namedtuple is an oxymoron. NamedTuples should be named. This isn't a "really good idea".
> The problem with a "pure democracy" is that majority cannot be smarter than the top 5% individuals, so really good ideas almost never getting thorough the bulk of a bell curve.
Python is governed as "benevolent dictatorship" (with Guido as BDFL and maintainers as your "top 5%"), definitely not "pure democracy", so your argument doesn't really hold.
When it comes to the syntax, just because something is "very clever" doesn't mean that it's necessarily a good idea. Syntax changes should be carefully considered and introduced only if the benefit clearly outweighs the cost of extra complexity.
Why is (x=1, y=2) more practical than nt(x=1, y=2)? Also, we have a very efficient implementation for tuples. collections.namedtuple is quite fast and the article shows cnamedtuple (which I am the author of) which is a C implementation that is even faster. None of this needs a change to the language itself.
All people ever wanted was just a way to write in one line ´Point3D = struct(x, y, z)´ that translates into the exact equivalent of:
What would be the startup performance of parsing the above 5 line class compared to a namedtuple? Surely it should be faster to create it with a shorthand form built into the language, if the functionality is equivalent?