Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As to your slicing question, yes. Data is lazily loaded. With Parquet as our data store we can do even more (but haven't yet): load only the columns referenced.


I dream of being able to tell the file system to be ready for a range of requests when they are first anticipated.

In a extreme, I think about getting a NV store to load the pointers to the physical data, in cache and to not flush the cache until the context is released by the application thread. Going further into this reverie, anticipatory QoS for bandwidth is a personal desire. I'd even like to have a PCIe lane reserved, in my fascistic moments.

Now I'm regretting my wordy comments above, because the explanation I wrote wasn't needed.

But this dream I have to gain programmatic cooperation for data performance is surely not unique to me.

Maybe twenty years ago it made perfect sense to leave the task of resources sharing to the operating system and subsystems, but now not only do many people have a high performance headroom to exploit, but the knowledge and experience of writing concurrent schedulers and load balances is far more commonplace, just by virtue of the needs presented by the crunch expansion of the Internet.

The only point from my above comments that I think is under appreciated by the end users of software and worth approaching, is how much you can improve your software when you have a truly controlled environment. Talking straight through the operating system stack you can obtain uncoloured measurements from which you can build optimal performance applicable to all installations. I may dream, but I grew up with the worshipping of vendor benchmarks as my bete noir, and I have been set against waste created by supporting unrestrained variations, ever since. That makes sense for a commercial operating system, but the opportunity to work with completely homogeneous stacks is something that ought to be recognised for the value potential, by management. I'm convinced that this will be a critical commercial advantage for whoever first finds a solution that isn't a service only but customer deployable.


If you're talking about lazy loading of data, for your R implementation you might want to (if you're not already) look into creating a custom dplyr backend that only loads data when needed (similar to the dplyr SQL backends).


Another wish of mine, is the reporting of the installation conditions on every individual installation.

It's a long time past, but I can't forget my experience in small businesses, where I had to disbelieve any reports about performance problems, until I had full access, or even on site immediacy to get a idea of the circumstances.

I'm fed up with Microsoft getting the advantage of being the only one who collects metrics aggressively or at all.

The open source community should be the first to get the data out of customers from the production systems

How can we do this?

I'm just feeling a personal sense of futility from my experiences optimising code as well as installations to sometimes minimal effect, bounded by the hardware budget. I certainly learned a valuable part of my skills in that way, but I have ever since felt sceptical how much of the development effort would be better allocated, if true installation performance instruments were reporting the whole user base.

(one assumes a R user will be well equipped, but I expect it to be a broad spectrum of hardware, from students in India to multinational developers working with the latest generation.

Its often true that the most impoverished users gain the most from optimised software and the ethics of this result are impeccable. I'm simply asking for the concerted collection of instruments metrics to support optimisation efforts in open source software. Putting the authors, creators and hackers first, is something that I wish was done as part of the process of promoting FOSS and the provision of quality insights into how their work is used, I think should be the basic foundation of our responsibility and gratitude for their work. Not to mention the improvements in our work which will result. Surely this is not impossible to solve. I have often thought about a package manager taking snapshot performance characteristics and reporting to the developers by way of a public page update. But I've not even seen the idea anywhere else, and don't understand what gives...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: