Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Half million 'Words with Spaces' missing from dictionaries (linguabase.org)
132 points by gligierko 20 hours ago | hide | past | favorite | 222 comments
 help



> “Boiling water” isn’t “water that happens to be boiling.” It’s a hazard, a cooking stage, a state of matter

I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"


Yeah, if "boiling water" is one word, what about boiling sugar? Boiling milk? Boiling volcano? Boiling soup?

Adding two words together creates a new and different concept. The permutations necessary to represent every concept ever formed by combining two or more different words would be endless.

Some of them on the list, like black hole, do make sense. That's a very distinct thing. It's not a hole in the conventional sense and it's not really black. Boiling water, though, is water. And it's boiling.


[To be clear, the below is me agreeing with you]

Norwegian is almost as compound-happy as German, and we could've filled many volumes with compounds. But what generally happens for one of the compunds to enter the dictionary is that the compound needs to have a meaning that is non-obvious from the individual parts, at least to some people, and typically that the compound has a non-obvious meaning if interpreted as two separate words.

E.g. "akterutseilt" is an example. "Akterut" means behind, aft. "Seilt" means sailed. "Behind sailed" helps as a way to remember it, but it's not obvious whether it's strictly a sailing term, or means that you've been left behind or have left someone else behind.

In this case if you say someone has been akterutseilt, it means they've been metaphorically left behind, often by their own failure to keep up.

Those kinds of compounds deserve dictionary entries whether they are actually written in two words or one, because they function as a single unit however it is written.

I think black hole is a perfect example in English. And in fact, this is a compound that is written in two words in Norwegian as well, but is in Norwegian dictionaries despite that[1] as "svart hull".

[1] https://ordbokene.no/bm/svart%20hull


Fun fact: I looked this up in the online version of the Duden (the predominant German dictionary). It does have an entry "Black Hole" (so the English term!) but not for "schwarzes Loch", which is the normal German term for it.

(In the printed versions, you might need to go to the Universalwörterbuch or so to find the English entry, it might not be in the normal "Die deutsche Rechtschreibung"; I have not checked.)


The Duden is not official since 1996.

Since 2004 the official guidelines for the german speaking countries (Germany, Austria, Swiss, Belgium, South Tirol, Liechtenstein, Romania, Hungary - see this founding document with the list: https://www.rechtschreibrat.com/DOX/wiener_erklaerung.pdf) are covered by the Rechtschreibrat (https://www.rechtschreibrat.com/).

The official german dictionary is here: https://grammis.ids-mannheim.de/rechtschreibung/6774


> Duden

Just the name gives me flashbacks to German-lessons in highschool.


> Adding two words together creates a new and different concept. The permutations necessary to represent every concept ever formed by combining two or more different words would be endless.

May I introduce you to the German language?

We have "gesundheitszeugnis" (health certificate) and "bärenstark" (strong as a bear), and of course "[der] Donaudampfschifffahrtsgesellschaftskapitän" ([the] Danube Steamship Navigation Company Captain) and "[Das] Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz" ([the] cattle marking and beef labeling supervision duties delegation law).


And we don't expect dictionaries to contain every compound word you could come up with in German either.

Not the ones that someone could come up with, just the ones people do use.

[dead]


The issue with German as well as Norwegian is that a space creates a semantically distinct structure, so it's not that they remove the space, but that one wasn't there in the first place, and some of those compounds then become important enough for the dictionary.

Absolutely not all - there's a near unbounded set of possible compounds.

In Norwegian, we in fact have a compound for the incorrect separation of compounds: "orddelingsfeil" (word separation error). Actually, we have two - technically it's "særskrivingsfeil" (separate writing error), but "orddelingsfeil" is more common... We take this seriously.

The problem is that while some are definitely wrong, others change meaning.

E.g. "en norsk lærer" means "a Norwegian teacher" but "en norsklærer" means "a teacher of the subject Norwegian". There's an infinite set of possible -lærer compounds: If you create a new subject then a teacher of that subject is a <subject>lærer. Obviously they can't all go in the dictionary.

Some other examples:

"Røyk fritt" means "smoke freely" while "røykfritt" means "smokefree". "Steke ovn", means "to fry an oven", while "stekeovn" means "oven". These two belong in the dictionary because they are so common and that though technically you can use "ovn" and "fri"/"fritt" to form a near infinite number of other common forms as well, in practice the number of common forms that use them is quite limited.

The key part is that most compounds in languages like German or Norwegian will only have one valid way of writing them. Add spaces, and you usually end up with something ungrammatical or with an entirely different meaning.

Whereas in English whether or not a word can be written with a space, with a hyphen, or combined much more often changes over time, and can differ in different places at different times, as the <separate words> -> <hyphenated> -> <compound> pipeline in English is slow and arbitrary and not necessarily reflecting a change in meaning.


Boiling water is not a word. The phrase contains two words. While German has no word for "boiling water", it uses two words too, an adjective and a noun, the German language has the principle of composite words. As a consequence, there is an infinite amount of German words.

"Hackernewsleser" would be a word I just made up but every German can understand. A reader of Hackernews. Obviously this makes a dictionary tricky. And it has been a big problem for spell corrections in early MS Word Software.


I would write it Hackernews-Leser for better readability but both goes.

Boiling point?

To me it boils down to (pun intended)

> Traditional dictionaries skip almost all such phrases, because they contain spaces.

Yes, because they're phrases, not words. I don't even understand what's surprising about this. Sure, the entire article talks about how dictionaries contain _some_ phrases; but it's clear it's not many of them. Dictionaries are for words, not phrases.


Technically they are both phrases and words. You can call them lexemes if you want to avoid confusing the computer programmers who do not understand that life isn't binary.

While this is certainly outside my wheelhouse, what I see in various locations is that (at least for English)

- A multi-word phrase is a phrase, not a word

- A lexeme is a basic unit of meaning in a language, like a word (and it's forms [1]) or phrase.

- Every place I was able to find described a lexeme as a "word _OR_ phrase", making it clear those two are different things.

- Dictionaries, in general, focus on words. Many do include phrases also. This point is less definitive; and just my understanding from looking at dictionaries and how they describe themselves. That being said, every source I can find that discussed something close to the topic seems to support this

[1] A word with all it's forms, in that "walk", "walked", and "walks" are all a single lexeme (with each form being a distinct word) OR a phrase

Side note: I'm not looking to "correct" anyone; just pointing out what information I'm able to find on the topic. I'm open to being corrected, but that correction would need to include reasonable sources.


While not all phrases are words, the specific phrases we are talking about are a type of word known as an open compound word.

Oh. Thank you for this. I learned a new term today :)

> to me

Your "to me" is actually problematic, because it legitimizes this nonsensical idea and turns words and their meaning into something purely individualistic, which cannot end well for the current, but even more so for the next generation.

I can confirm that "boiling water" definitively is "water that's boiling" and that two words, which are supposedly one word, definitely are not one word.


Yeah, but the nice thing about natural language is, it doesn't matter what you think. People talk because they want to communicate something. You can try to talk your pet language at other people, but you will fail at communicating. So things have a happy way of sorting themselves out.

Actually it does matter what I think, since I am user of language and, unlike many others, I actually care.

When it comes to using, spreading and understanding language, every single human being matters, because every single human being acts as a multiplier.

What also matters, is that there are far too many people who are, despite having graduated schools, barely able to read written words.

These are the very same people who also want to convince everybody else that it does not matter.


> I can confirm that "boiling water" definitively is "water that's boiling" and that two words

Which are the two words?

["water that's", "boiling"]

["water", "that's boiling"]

["water that", "s boiling"]

Something else?


I think the two words are ["stop being", "pedantic"]

The point of this discussion is pedantry.

Boiling water is mostly same as boiling anything. So I would just have "boiling". No need for "boiling water". I see no reason why boiling water could not just be covered by whatever general boiling entry covers.

The reason is the same reason for why the word "hot water" is found in the dictionary: Because it has picked up other meaning.

The word "boiling water" is not currently found in the dictionary because the meaning has not been considered widespread or significant enough to justify inclusion. The article is pondering what line exactly defines widespread or significant.


Some other words that are sorely missing from dictionaries: "Warm water", "hot water", "cold water", "dirty water"

As an idiomatic expression, "Hot water" = "trouble".

Are there idiomatic expressions for warm/cold/dirty water, which mean something other than a literal adjective describing the temperature or condition of water?


> hot water - n. a difficult or dangerous situation

https://www.merriam-webster.com/dictionary/hot%20water

> warm water - n. an ocean or sea not in the arctic or antarctic regions

https://www.merriam-webster.com/dictionary/warm%20water

> cold water - n. depreciation of something as being ill-advised, unwarranted, or worthless. e.g. threw cold water on our hopes

https://www.merriam-webster.com/dictionary/cold%20water

Seems that what makes sense to be in dictionaries is already there.


> dirty water

Depending on the context you got sewage, slush, runoff, murk, waste etc.


Agree. You can of course treat "Boiling water" in its gerund form where it functions as a noun:

  "Boiling water should be performed in a metal pot".
> It’s a hazard, a cooking stage, a state of matter

All of these are ancillary and depend on context, but in every one of these downstream cases the same underlying process is happening: the water is boiling.


> the water is boiling.

Not necessarily. It might refer to heating water to bring it to a boil.

Q. What are you doing over there?

A. Oh, just boiling water.


That's using it as a [verb] [noun], not a gerund. If you are using it as a open compound word (or a gerund) - the "boiling water" IS in a boiling state.

I would have agreed with you before they pointed out that "frozen water" gets a word: ice. Honestly, I think it's reasonable: people deal with frozen water far more than they do boiling water, but it changes it from a case of "what are they talking about?" to "okay, where do we draw the line?" for me.

Well, being pedantic, my favorite hobby:

Frozen water represents a state change and that different state commonly gets its own word: ice/water/steam equates to solid/liquid/gas

Boiling/freezing water represents the state of the liquid, not the transition. Its descriptive. Water boils away into steam, or freezes into ice.

Should we consider luke-warm water also singular? What about body-temperature water? cool water? It makes sense not to treat adjectives/descriptive words combined with the subject as singular because the definition already exists in the root of the words (meaning of adjective word + meaning of subject word). Blue clay is another example, why would that be a singular?

It really only makes sense to me in the rare cases where the combination words represent something different or non obvious than the combined meanings of the two words (i.e to 'give up')


Ice, slush, sleet, snow, graupel, hail... And within there is a subtype "black ice", a compound noun that isn't really just a description (it's not black, it's nearly invisible - a similar sense as another one, "black hole", which you'd never figure out from the components alone).

We have a lot of words for "frozen water" because it takes a lot of forms. As far as I know "boiling water" is only one thing so we've never needed additional words to distinguish it.


But water that has boiled into gas also gets a word: steam.

As far as I'm aware, there is no separate word for freezing water -- i.e. water that is very cold and will, if it continues to get colder (and has something to crystallise around), turn into ice.

So the symmetry seems complete: ice -> freezing water -> water -> boiling water -> steam.


Freezing water is already at or below 0, it doesn't need to get "colder" to turn into ice, it simply needs to exchange the energy with the environment and rearrange in crystals.

Basically as it gets colder water exchanges energy with the environment and gets colder.

But once it reaches freezing temperature, it can no longer get colder and all the energy is used for the formation of crystals.


> Basically as it gets colder water exchanges energy with the environment and gets colder.

> But once it reaches freezing temperature, it can no longer get colder and all the energy is used for the formation of crystals

Water at freezing temperature can get much colder without freezing. https://en.wikipedia.org/wiki/Supercooling:

“Water normally freezes at 273.15 K (0.0 °C; 32 °F), but it can be "supercooled" at standard pressure down to its crystal homogeneous nucleation at almost 224.8 K (−48.3 °C; −55.0 °F).”


So, I got the physics wrong. Apologies and thanks for the correction.

But the semantic point still stands. Boiling water is still water -- in the specific sense of H2O in its liquid state -- while ice is not. The complaint that frozen water has a single-word synonym while boiling water does not is making a false equivalence.


Yes, is that not the same with boiling water? It doesn't need to get "hotter" to turn to steam, it needs to exchange the energy with the environment to gasify

Steam?

Yep, all of the following make perfect sense to me, they're just non-idiomatic:

- Don't put your hand in water that's boiling,

- Add the pasta to water that's boiling,

- That saucepan is full of water that's boiling.

If "boiling water" were a distinct word, all of these sentences would change meaning compare to their idiomatic counterparts.


I’m so glad I’m not going insane. I don’t see any examples on that site that I agree are ‘one word’. Sure they’re singular concepts but so what? Are we going to have singular words to describe all adjective noun pairs now?

Really? none are one word? How about "of course"?

I do see your point on that one, but phrases have an origin.

Of course is like an abbreviation of something like ‘in the natural course of things’. Which has become more like just ‘yes’ over time. In the usage of ‘yes’ it’s easier to argue it could be one word.


... which is in fact in both the OED and MW dictionaries.

What's ice cream then?

And even more confounding, what's "water ice?"

https://www.ritasice.com


Yeah, this article brings up a good point per se, but then defeats itself with nonsensical analysis and examples

It used to be iced cream, which is more descriptive.

Ice cream is a shortened pronunciation.


When was called it "iced cream"? The first published recipe for ice cream in 1718 called it "ice cream", not "iced cream". The first recorded mention in English at all was in 1671 and there, again, it was "ice cream", not "iced cream".

Note that in the 1718 text it is not actually called "ice cream", but the recipe is titled "To ice cream". I.e. "ice" is used as a verb, the result presumably being "cream that has been iced". In the same work, there is also Chocolate-Cream so there was a choice not to write Ice-Cream there.

There are some attestations to it from 1732 onwards: https://archive.org/search?tab=fulltext&query=%22iced+cream%...

The attestations for ice cream (or often ice-cream, as these open compound words used to often be hyphenated -- the loss of that hyphen eventually leading to articles like this one) are much, much more and much messier, not least because someone tagged every edition of The Gentleman's Magazine as being published in 1731 -- the Internet Archive is a fantastic resource but I wish they'd allow crowd sourcing corrections for metadata. Excuse the m-dashes.

You may be right that it was mostly called ice cream at first and eventually at last. To be honest I took the Wiktionary etymology at its word.


"a state of matter", no boiling water is not a "state of matter"

It's a state that matter can be in. Which is not the same as the technical compound word "state of matter".

Which is why "state of matter" is, itself, often in the dictionary, possibly to the dismay of the Team Single Word in this comment section.


I never heard about "boiling water" as a state of matter. Boiling water has two states of matter. Liquid and Gas, including a phase change. There are many of states of matter. I, as a chemist, would not be able to tell you most of them out of my head. Bose-Einstein condensate being one of them. Boiling water is not a state of matter. It may be a description of water, like cold water, flavoured water, carbonated water.

That's exactly the point. It's a state (that of being boiling) that matter (some water) is in. Which is not the same as "state of matter", the compound word that is in the dictionary.

In talking about the validity of the suggested compound word "boiling water", an example of exactly what the article is talking about arises: when exactly does a sequence of invididual words (state, of, matter) become more than the sum of its parts?

A further question raised by your comment is does the existence of a compound word with a specific meaning then rule out use of the same words in a less specific manner? Perhaps for maximum clarity of expression, it's confusing, but is it wrong? It's an interesting point because if you didn't know the special meaning of the compound word "state of matter" then there is a word out there that is, completely unknown to you, invalidating your writing which would otherwise be correct both syntactically and semantically.

The general consensus among the HN crowd here seems to be quite vehement that "boiling water" has not reached the point where it "deserves" a dictionary entry. But there are words in many dictionaries like "cherry blossom" that I would say are little more deserving.


Surprised that no comment mentioned that there is a standard term (not a word :P) for the set of words that denominates a particular concept: nominal syntagm. Such as "boiling water" and also "that green parrot we saw yesterday over the left branch".

Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.

Edit: Obligatory reference to Borges's Tlön: https://en.wikipedia.org/wiki/Tl%C3%B6n,_Uqbar,_Orbis_Tertiu...


Funnily enough, "nominal syntagm" is, itself, not in the OED or Wiktionary. But Wiktionary has "syntagme nominal" as the French translation for "noun phrase".

You really have to love the human messiness of language!


A nominal syntagm is a somewhat overlapping concept, but deviates slightly from the direct discussion taking place. The more appropriate standard term here is: open compound word. Or, as one might say casually: word.

[dead]


> "'I love you' isn't opaque, but it's tight enough to put on a tile."

The problem with introducing phrase/sentences into a word game (let's take Scrabble) is that you'd spend half the night with your friends arguing over what is and is not acceptable with the only litmus test being its... corpus frequency?


I thought that sentence seemed out of place when I read it. Didn't realize this was all AI slop. It all makes sense now.

In addition to what others have pointed out, many of these aren't actually missing from traditional dictionaries: they're just inflected differently. So your example lists phrases like "operating systems", "immune systems" and "solar systems" as missing from traditional dictionaries, but at least the online OED and M-W have "operating system", "immune system" and "solar system" in them. It's just that your script is apparently listing the plural as a separate phrase.

On languages other than English: in general, different languages do word division very differently. At least in German and Dutch, many of those phrasal verbs are separable, meaning that they are one word in the infinitive but are multiple words in the present tense. So for example, where in English you would say "I log in to the website", in Dutch it would be "Ik log in op de website". "Log in" is two words in both cases, but in Dutch it's the separated form of the single-word separable verb inloggen ("I must log in now" = "Ik moet nu inloggen"). The verb is indeed separable in that the two words often don't end up next to each other: "I log in quickly" = "Ik log snel in".

Dutch, like German, has lots of compounds. But there are also agglutinative languages, which have even more complex compound words, perhaps comprising a whole sentence in another language. Eg (from Wikipedia) Turkish "evlerinizdenmiş" = "(he/she/it) was (apparently/said to be) from your houses" or Plains Cree "paehtāwāēwesew" = "he is heard by higher powers"; and these aren't corner cases, that's how the language works.


A compound word isn't just a phrase. The latter is a group of words that indicate a single concept. The former is a new word that has a distinct meaning from the subwords that compose it. "I love you" is an example of a clausal phrase. The meaning is entirely evident from the words that compose it. In contrast, a "hot dog" is not a particularly warm canine, and has its own OED entry [0] as a compound word.

And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.

[0] https://www.oed.com/dictionary/hot-dog_n

[1] https://www.oed.com/dictionary/goodnight_n


> But roughly 15% are plausible: “wooden chair,” “morning coffee.” That’s still 30 billion sensible pairs.

(1) Who counted those? Whence those numbers?

(2) The examples are normal two-word phrases with one word modifying the other, often categorised as an adjective. The examples are counter-examples to the very claim made in that article.

(3) Using Clause to brainstorm s.t. is a weird thing to say...

(4) I would say the use of 'lexicalized' is wrong or at least uncommon. It usually refers to specialised semantics of something that could be interpreted generically, too. Like 'sleeping bag'. Or indeed 'cold feet'. Lexicalisation may involve deleting spaces, like 'hotdog'. And I am pretty sure lexicalised phrasal words are usually intensionally listed in dictionaries. And so 'ice' is not lexicalised 'frozen water', but it is not overtly a phrase but is a separate atomic word.

=> I don't get the point.


The author of this article just hasn’t been taught how to use a dictionary. The words aren’t “missing”, they’re just indexed under one of their parts. For example “wait upon” would be located within the entry for “wait”.

There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.”

I would hope that none of those examples were taking up space in a dictionary.


It's quite interesting that "boiling water" in many Slavic languages is actually a separate word (and not derived from "water", but from "boiling"; similar how the author mentions "ice" being used instead of "frozen water").

Japan is similar with 熱湯 boiling、お湯 heated、白湯 boiled once then cooled down、水 cold

It was mentioned in other comments but boiled water is steam, and frozen water is ice. We do not have separate words for freezing water or boiling water.

in the slavic languages do they have a different way to describe boiling or freezing milk, or any other liquid?


We have the word slush to mean a mixture of ice and water. A single word for boiling water would occupy a similar conceptual space.

While these are not separate states of matter, they ARE special thermodynamic systems, with the particular property that they tend to remain exactly at the phase transition temperature while heat is added or removed from the system.

This is a somewhere esoteric technical distinction, but it has practical everyday consequences. It's why boiling food works so consistently as a universal cooking option.

You don't need to control the temperature of boiling water, it is an exact temperature that depends only on ambient pressure. As a consequence recipes work by only specifying time, sometimes with a single adjustment for people at higher altitudes.

This is remarkable given the wide variety of containers and heat sources used, and it is used practically by virtually every cooking tradition, even if it's reason for working is not common knowledge.

It shouldn't be surprising it'd acquire a single word as a unified concept.


> somewhere esoteric technical distinction

When those technical distinctions are important we use specific technical terms for them (of which there are a few different ones for the phase transition - depending on discipline).

The cooking term is "rolling boil" which is a nice two word combo with a specific meaning.


but what about boiling milk? or boiling oil? I get your point, I just don't understand why we would have a word for boiling water but then still need boiling-x for everything else that boils.

edit: In those other languages is it like how we use ice? where water is the default, but it could mean any frozen liquid?


As you note, we have a specific word for solid water. Additionally, in colloquial speech “water” almost exclusively refers to the liquid state. Finally we have “steam” to refer to the gaseous state.

So English arguably has three unique words for the three common states of H2O.


It's a great question, and is tough to answer intuitively without speaking a native language that actually has such a word.

I would agree that "boiling milk" and "boiling oil" are very unlikely to get separate words, unless one of them happens to be an extremely common thing that people encounter a lot and that has special practical implications.

Milk might be a special case, in that it essentially is just water with some other stuff dissolved. It is to water as salt water is to water... but more so.

My guess would be that the single word might get pressed into service like "ice" does, but I think we'd have to find languages that include this word and survey native speakers. It could vary.

Nearly everyone encounters boiling water in everyday life, but do most people ever see other liquids boiling, even once, and especially during the historical periods that shaped our current languages? If not we might be getting into something like technical language, where daily life lines up poorly and terms and jargon get formalized.


I could see boiling oil being it's own word because when used as a weapon it's unlikely to actually be boiling and yet still be called "boiling oil".

What do you mean by "why we would have"? Dictionaries aren't prescriptive, they're descriptive. If by "we" you mean English speakers, clearly you don't have a word for that. But if you mean some Slavic languages, they do. Likewise, English has "ice", while other languages simply call it "frozen water". Or take the example from the linked article, "at home", which some languages do have a separate word for. I don't think many languages have a distinct word for "at work" though, or "at the shop". That simply reflects that being at home is a more common and generally important concept, just like boiling water is more important in some sense than boiling milk.

Asking for the "reasons" behind a certain word existing is sort of like asking why the human body looks the way it does. Sure, scientists may have good theories why it was evolutionary advantageous to have five fingers and no tail, but in the end the only answer that's for certain is, "because it evolved that way". So the answer is, "we" have a word for boiling water because people found it useful to have such a word.


In Norwegian, we have "isvann" - ice water - which can both mean water implied to be cold enough to feel like it has recently melted, or specifically water with ice in it.

If you're asking for isvann at a restaurant, you'd expect to get water with ice, not just very cold water.

But if you're talking about having gone bathing in isvann one spring, it specifically means in water that - whether or not there is actually ice in it - is cold enough that it might have recently melted.

(I'm a native speaker, but had to look up the precise nuance there to be sure I wasn't just making stuff up)


> but boiled water is steam > We do not have separate words for freezing water or boiling water.

I don't know how it is in other languages but in English "boiled water" and "boiling water" refer to different things - boiled water may be steam or water that has underwent some boiling, e.g. for sanitation, on the other hand "boiling water" refers strictly to water that is in the process of boiling.

I can see why some languages may have a separate word for one of these concepts to avoid some of the ambiguity.

I'm not a fan of extending the language with new words unless they are compound (with or without spaces) but extending the dictionaries with more and better descriptions is a no-brainer, there's a lot missing from them.


Yes, and substitution of boiled for boiling water has produced many terrible cups of tea.

It depends on the tea, but some cannot be well made with a metal pot of water that's taken a few minutes to get from the kettle to the table.


Arguably it depends more on the atmospheric pressure to get boiling water as close as possible to 100°C.

The general rule of thumb is that black tea (i.e. fermented tea leaves) should be brewed at 100°C, green tea (non-fermented tea leaves) should be brewed around 80°C to avoid it being bitter and white tea (young, non-fermented tea leaves) is best at around 70°C.


I'd argue that boiled water very specifically refers to the water left ofter after boiling water, not steam. Steam is no longer water, at least not in common parlance.

Boiled water does have the extra connotation that it is presumed to be mostly sterile, which, while not hard to derive from the fact it has been boiling, is not immediately clear. After all the past tense does not tell us how recently it was boiled.

For that reason I'd argue that if one of boiling water and boiled water should be in the dictionary, it should be boiled water. Of the two, it is the term that potentially carries extra information.


I mean it’s interesting that this is generally the case with many (or even most) words across languages… But I’d wager it’s more the norm than the exception, so I don’t know if “boiling water” is that interesting of an example.

Some are better than others. Many semi-transparents could get legit coverage. And many are good fodder for word game content.

The rest of the article did a good job explaining that. I just think those were terrible examples for the introduction. I think "shut up", "good night", and "hot dog" would have really got the point across better, but those might already be in dictionaries.

They're clearly a bit over-zealous bout what examples they think have meaning. They cite substitution as a good test for a phrase but double down on boiling water.

> Lexicographers used a substitutability test: if you can swap synonyms freely, it’s not a lexical unit. “Cold feet” (meaning fear) can’t become “frigid feet”—so it gets an entry. But the test cuts both ways. You can say “boiling water” but not “seething water” or “raging water.” The phrase resists substitution too.

These aren't failures for substitution because "Raging" isn't' a synonym in this case. where frigid would be a reasonable.

I wonder perhaps if the author is confusing the idiom "hot water" which is in there https://en.wiktionary.org/wiki/hot_water and would fail the substitution test.


I removed that sentence/claim, I see the point that "boiling" and "raging" was a bad example.

Cool, going back over them I'm actually surprised at the strength of the substitution test, thus far I haven't really encountered one that strongly goes against the test if a suitable synonym is picked.

There are a few things for which English simply doesn't have anything to substitute and those are harder to assess. boiling is one but so would "blood" in "blood pressure", obviously replacing it with another liquid has basically the same meaning eg water pressure, oil pressure but as far as I can tell there's literally no synonym for blood.

I those cases I try to use a stand in from another language to see of the substitution works. for for example "sangre" in Spanish so "sangre pressure" which doesn't seem to affect it's meaning much so I'd argue it's exclusion.

Conversely "Red tape" cannot be "roja tape" and a "caliente dog" is one trapped in a car not a food.


"Simmering H20", for all that simmering isn't quite the same as boiling, is pretty clearly more or less identical to boiling water.

Yeah, the good examples are usually in dictionaries as headwords, the moderate examples are usually in dictionaries as phrases within the entry for one (or more) of the words that comprise them, leaving fairly weak examples actually “missing” if you want to use “missing words with spaces” as the basis for content.

Fair point. I just rewrote the intro w/ the naming-function argument first.

'hot dog' belongs in a thesaurus, not a dictionary. It's just a type of sausage.

If people assume it's "just a type of sausage" it suggests a dictionary entry is needed to explain otherwise.

It's a term referring to a small set of types of sausages served in a specific small set of ways. In some places, a hot dog can be used as a synonym for the predominant type of sausage most common in hot dogs in that place, but the term is still more commonly referring to the assembly of a wiener or frankfurter wrapped in a bread of some sort.


> the term is still more commonly referring to the assembly of a wiener or frankfurter wrapped in a bread of some sort

I had that disagreement in an alpine resort once. A seller was vending some sort of sausage stuffed in a bread, i was hungry so I walked up to them with money in hand and said "A hot dog please" while pointing at the only thing they were selling. The lady was mortified by my utterance, and was not willing to accept the money until I agreed with her that it is a bratwurst and not a hot dog. :D The disagreement felt a bit academical, but given that she was holding the hot dogs hostage and money does not taste that good she won the argument.


Personally think a bratwurst is borderline, in that it is "close enough" that I can see someone calling a bratwurst in a bread a hot dog, and I wouldn't react if a shop listed them as a type of hot dog on a menu.

But, yeah, some places "hot dog" also carries a connotation of potentially using lower quality sausages, so I can also totally see a bratwurst vendor taking offense...


A dictionary is an enumeration of words. A thesaurus is a mapping between existing words.

Every word in a thesaurus belongs in a dictionary.


In the US, if you ordered a hot dog and got a sausage (or vice versa), it would be very reasonable to return the item and ask for something else. They are culturally completely different, the same way Cheerios in milk is not another cold soup like gazpacho is.

All words in a thesaurus would generally also be in a dictionary? The difference between a thesaurus and a dictionary is what each tells you about a word.

It’s a type of sausage, but they are definitely not synonymous. At least not in American English.

The first two I kind of understand what the author means. But "help me" and "severe pain" made me think that I'm just not the right public for this text.

I don’t see how boiling water could ever be a single word. Would that mean we need entries for every other liquid boiling?

i guess Saturday night could have some extra details explaining the context around our standard work week. But even that is a stretch.


A single word for boiling water would be like the single word "slush" we have for ice in water.

It likely could apply to other liquids in the same mixed state, but would be assumed to refer to water (or solutions or colloidal mixtures primarily consisting of water) in common speech.

Water is extremely common, and has anonymously high heats of crystalization and vaporization, so it is the most common example of a mixed phase system and the only one most people encounter in everyday life.


Inuktitut / Kalaallisut (Greenlandic):

  qanik -- snow falling
  aput -- snow on the ground
  pukak -- crystalline powder snow (like salt)
  aniuk -- snow used to make water
  maujaq -- deep soft snow you sink into
  piqsirpoq (verb) -- drifting snow / blowing snow
Central Alaskan Yup’ik

  qanuk -- falling snow
  aput -- snow on the ground
  nevluk -- wet snow
  aniu -- snow for drinking water

>"Boiling water" ... I would hope that none of those examples were taking up space in a dictionary.

Yeah, I agree! Fuck ICE!


Off the top of my head, peanut butter, black hole, and amusement park are concepts that can't be easily intuited by just combining the two singular terms, but I also wouldn't consider them as phrases.

"Peanut butter" would be dealt with by including a reference under the "butter" entry. Something like:

'N, culinary. A paste made of ground up nuts, sometimes with additional oils and other ingredients. E.g. "peanut butter", "almond butter".'

"Amusement park", same. Falls very much under the "place of recreation" definition of "park".

"Black hole" is maybe a bit different, because it's a scientific term - and certainly in a science dictionary would be included as a two-word item - but, for consistency, in a regular dictionary should be handled identically to the above, with a note on the word "hole".

While including noun phrases as singular entities in a word game is entirely appropriate, I don't think the OP has formed a rigorous definition of the concept that they are trying to describe. I agree with the other comment which suggests that they need some instruction / practice using a dictionary.


The word splitting in English is an accident of history, not a linguistic reality though. This is hilariously obvious to Swedish speakers :P

One of the axes this analysis seems to be missing is the subtle spectrum from "multi-word expressions" to "idioms". Traditional lexicographers have long published separate idioms books, such as the Merriam-Webster New World American Idioms Handbook and the Oxford Dictionary of Idioms.

Wiktionary doesn't need to make that distinction between MWEs and Idioms and tends to conflate MWEs and Idioms as there is no separate "Wikidiom". Arguably, that multi-book confusion runs deep on the internet because Urban Dictionary should probably be fully titled the Urban Dictionary of Idioms and Slang.

It's not just page limits but also categorical limits and classic lexicographers would build multiple books/volumes, not just settle on one "dictionary". Classic scholars would often have a "reference shelf" with multiple dictionaries, books of idioms, thesauri, and more. The CD-ROM and then the internet has kind of tunnel visioned that this entire shelf can be merely "one app".


I'm currently reading Cormack McCarthy's Suttree (my first of his novels) — just an exceptional polymath capable of painting complicated scenery with words dozenly scattered throughout paragraphs [0].

My favorite adjective he's coördinated is "burntwing", used to describe moths spiraling downwards after passing through candleflames. If I had crafted such a descriptive contraction, my former styling would've been "burnt-wing", had I even been capable of generating such concise imagery [1].

McCarthy's stylings have helped me to reduce hyphenations in my own writings — reducing their usage mainly to contractedwords which might be all-too-confusing without them.

[0] pg104 has ten words that I do not know their definitions, yet through context they work to advance the storyline of character racists (book is set in 1950s).

[1] decades ago, during college burnout, I was searching for the essense of "burntwing" — reduced to writing a professor about "feeling like a burning airplane in tailspin." My trajectory back then was definitely burntwing.


Thank you for sharing this. It makes me question the extent that a dictionary is meant to make a person more literate.

Wait til you read Blood Meridian. The imagery he created with words, some of them his own creations, is just ... beyond compare. I'm reading The Road now, which comes from the same place. I can only read either in small doses. It's very intense, and the passages deserve to be read carefully.

Another contemporary writer who worked with new words in a very creative way was Gene Wolfe in The Book of the New Sun. Some were inventions using Greek, French, or Latin roots. Others were forgotten terms which he resurrected. Someone compiled a dictionary, Lexicon Urthus, which discusses the origins of certain terms and their placement within the series.


>I can only read either in small doses. It's very intense, and the passages deserve to be read carefully.

Absolutely. Similarly, I read the Tao Te Ching 4x annually, by reading the same single passage both before and after bed, daily. Both Laotzi's and McCarthy's density of construction is just soooooo human condition.

[Suttree book world] Harrington just found the eyeball in the junkyard vehicle — in a single paragraph humanity just oozes, including his toying with viscosity and shock, and re-toying again. Washes hands. The drunk boss having previously joked "yeah the driver only scraped his shinbones."

I am hooked. McCarthy's books jumped to the top of my bookqueue after reading a HN article a few months ago about his library/collections being catalogued, post-humously.

----

I've just read Dave Wallace's three major novels (Jest & King & Broom, ~2000 pages) and McCarthy is absolutely the better author, not requiring hundreds of footnotes to say less with more esoteric bullshit [0]. DFW just seems like a bully to me ("wow I'm so smart"—DFW, probably), and honestly his samizdat is about 800 pages too long (myself a former bored addict prodigy with poor family comms) [1].

Mostly I read DFW because he's my judge-brother's favorite fiction author — it felt like a challenging obligation/chore, much like our personal relationship. With both, I've felt mostly emptiness. For powerful shortform pieces, both are quite capable of emotional stirring (This is Water).

I laugh when I see this book on others' shelves, because they probably haven't read it and it isn't really worth the time to read [all of] it. A few simple questions of the "reader" verifies this. My own bullying is that "I have" [snooty], however much I wish for all that reading time back. Bullies making bullies =D

----

By page 100 of Suttree you are hooked. By page 100 of Jest you are bored [2]. I've yet to read more than six pages of McCarthy in one day. For Wallace my eyes would constantly glaze over dozens of pages and just think: what happened here?! why did author include [all of] this!?

Although I am tired after reading either author for twenty minutes, McCarthy's doesn't feel like the author is just wasting my time.

----

My McCarthy readlist is structured so: Suttree (current); Blood Meridian; The Road — is this advisable?

----

[0] DFW's footnotes == even more of his esoteric bullshit

[1] If you do read Infinite Jest, absolutely use a study guide(s) (specifically Aaron Swartz's incredible breakdown... which can reduce the book just just a few hundred pages). If you've ever suffered an addiction (whether yourself or crazybestfriends's), you probably don't need to suffer through any longform DFWallace.

[2] I understand this is part of DFW's "style" : the frenzied passages of speed addicts, thirty pages into killing a dog (e.g.) when three pages would have done better, more respectful of reader's time (addict or not).


This feels like ragebait (rage bait?) for people that enjoy language and words. The leading example is nonsense.

Is nobody going to mention that "taco [N WORD]" is one of the words there? (Third page from the end)

Oh, geeze. The progressive transparency effect on the words towards the "obscure" end of their spectrum made the later pages impossible for me to read.

I suspect the entire list was produced by an AI entity which had not been prompted to avoid giving offense. I predict a range of (tedious) opinions about whether a prohibition on that particular word is an appropriate inclusion in a system prompt.

That's also not a term I've - thankfully! - ever heard, so I've no idea if it's hallucinated. This is not an invitation, HN, to define or explain it to me.


[deleted]


You should implement an option (e.g. [checkbox] "Urban Dictionary Entries") that doesn't censor these words. Understandable that the default criteria doesn't include this, but language construction shouldn't be removed of offendables).

Just as an example: some people find the word "jerry-rigged" to be racist, as it previously replaced "nig___-rigged". Same with "rule of thumb" (due to origins in caning people). There could be no Huckleberry Finn written in a world without pottylanguage.

Who defines the boundaries of acceptable language? I'm not advocating for ebonics classes, but a language's entire purpose is to be useful, to convey information (including hatred).


It appears to me that the author is trying too hard to make a point: "merry-go-round" is a single compound word that several dictionaries contain; "canned goods" is not commonly used[1] (more of a bureaucratic jargon), and people would just say "cans"(US) or "preserves" (UK); "household chores" is simply "chores", as the word is no longer commonly used outside the house context; "coffee break ritual" is not a concept in English-speaking countries so it would make no sense to have it in a dictionary, and so many of the examples are exactly that.

[1] I wonder how many here have ever been told something like "Prithee, husband, bring back a dozen canned goods from the market, for in the meanwhile I shall do my household chores".


While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"? So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.

The very bottom of the slider is there to illustrate where LLM artifacts and Wiktionary noise live — it's not presented as legitimate vocabulary. The slider lets you see the full quality gradient, including where it breaks down.

That's not really mentioned in the article, though. As far as the article is concerned, the right side of that slider is valid-but-possibly-too-rare-to-be-interesting, when in fact it's just garbage. This does not sell the concept well.

If the first example was "monkey wrench" instead of "boiling water", we'd never have seen the article.

"Monkey wrench" is a word already found in the dictionary, so it wouldn't be a useful example. It already met the bar.

The article is questioning why some words don't meet the bar for inclusion in the dictionary. The word "boiling water" is one such word that it sees as being on the fence. The comments here demonstrate exactly why it is on the fence, but it remains unclear exactly what would be necessary for it to tip towards inclusion.


Sure, but monkey wrench is in the dictionary. Heck, it's even in my printed copy of the Shorter Oxford English Dictionary.

[dead]


Hey Michael, great project! If you don't mind me testing you, as a word game builder, what do you think about the latest developments of international policies?

The name for these are "collocations".

Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.


And fittingly enough, "collocation dictionary" is not in "the" dictionary. At least not the OED.

Presumably if the word thesaurus was actually "synonym dictionary" it would likewise be absent.


No. This article shows a distinct lack of understanding of the basic building blocks of the English language.

"Words" don't have "spaces."

Phrases are made of words separated by spaces.

"Boiling Water" is not a word.

"Water" is a word. A noun, the subject.

"Boiling" is a word. An adjective, in this case. Which modifies the subject.

I don't know if you're trying to be clever, but you're not.


Dictionaries containing spaced compounds were not scalable with print media. The printed OED was encyclopedic in scale. Compound dictionaries are more than feasible now. Arguing whether a collection of commonly used words are expressions or concepts or even single "spaced words" is beside the point. Simply identify these differences and classify them in the compendium.

As far as my limited knowledge of linguistics goes, the technical term is actually "collocations."

To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.

I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!

(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)

Examples of collocation dictionaries:

https://www.freecollocation.com/

https://ozdic.com/.


AIUI, collocations are just "words that often go together". It doesn't signal any unconventional meaning to the construction, that would make it a proper idiom.

If that were the case, there would be no need for collocation dictionaries :)

Two related compound words from a Norwegian dialect, both mean "fish food":

Fiskemat Fiskmat

The latter means food made from fish, the former means food for fish. Standard varieties of Norwegian only use the former to mean both, to the annoyance of many old fishermen.

This maybe illustrates why the author's examples such as boiling water aren't so weird. Yes, in English it means water that's boiling, but you have to know that. It could for instance have meant water for boiling, like "cooling water" means water for cooling say in a nuclear reactor, not water which is in the process of getting cool.


I disagree these belong in a traditional dictionary.

I could, however be convinced these could be documented/defined in a separate document, especially from the perspective you are coming from (word games).


This boils down to an "is Pluto a planet" debate.

We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.


This is a great comparison. We're arguing about the definition of "word", and attempting to expand it to include edge cases where two words with separate meanings have a different atomic meaning when combined.

We could have a similar debate about whether common suffixes and prefixes should be regarded as individual words.

Much like "planets" don't really exist as a separate natural object, words don't really exist in natural languages. They are artificial concepts, and therefore we will always have edge cases.

I would argue that it is still a useful discussion, as it sheds light on the nature of language (or of celestial bodies), even if the definitions defy the same rigour as mathematical concepts.


in German, they just remove the spaces and keep the word, and this problem is solved:

Entschädigungsleistungen - compensation benefits

Wiederbeschaffungskosten - replacement value

Kraftfahrzeughaftpflichtversicherung - motor vehicle liability insurance

Donaudampfschifffahrtsgesellschaftskapitän - Danube steamboat captain

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz - beef labeling regulation law


How often do these compound words get listed in German dictionaries?

I'd point folks to the concept of "Construction Grammar", which is related to this problem: https://en.wikipedia.org/wiki/Construction_grammar

I don't think 'Words with spaces' is a thing.

I think maybe the word the author is looking for is 'phrase'


It’s probably a thing, especially with loan-words (eg.: “avant garde”), and there are probably much better examples… But the examples in the article make no sense to me.

The difference between phrases and "words with spaces" is addressed.

The confusion might be that this seems to be a spectrum rather than a binary phenomon.

We have single words at one extreme, ordinary sentences at the other, and in the middle we have idiomatic assemblies of words that span a range of substitutability.

"Hot dog" and "Saturday night" are arguably great examples, because they exist at the opposite extremes of the spectrum. Saturday night can retain some of the original meaning following substitution, whereas hot dog almost deserves a hyphen.


I disagree that "saturday night" ever means anything other than the literal meaning of the nighttime of the day of saturday.

You can argue that there's a connotative association with the phrase. Sure. Just like "beach weather", or "blizzard conditions". But that doesn't make "saturday night" special in any way.


I am with you on the literal definition there.

I wonder if the connotative association is exactly what we are trying to capture here though, and if those other phrases also fit in at the "separate words but slightly special" end of the spectrum.

There is meaning being communicated in all of those phrases that would be obvious to most or all people who are embedded in the language and culture where they are used, and which transcends the definitions of the individual words themselves.

It seems that there are several axis here -- how explicit is meaning, how atomic, how literal, how substitutable are the individual words -- and all vary continuously.

That might all seem needlessly pedantic for the question of "should it warrant a dictionary entry", but if you are trying to extract all information encoded in a verbal exchange, they might be useful concepts.


It's an evocative phrase. It definitely means different things to different people though. Teenager vs adult, single vs married, employed vs not.

Or how about "Sunday morning"? It's evocative for sure. But very differently for different groups.

Or "island breeze". Stirs up images and feelings. But the definition is literal and the connotations are somewhat personal.

I'd argue that none of these phrases belong in a dictionary. Possibly explicitly because the "missing" meanings are the associative connotations, but those vary for different people, so what's the canonical definition?


I think 'phraseme' is closer: https://en.wikipedia.org/wiki/Phraseme

Imagine configuring your word separator like this: " `~!@#$%^&*()-=+[{]}\|;:'",.<>/?"

Examples of "obscure" compound words include "list uids", "beg pos", "sync binlog", "gfp mask", "av fetch", "str idx", "seq ptr", "ai family", "fmt vuln", "ai socktype", "curr tok", "nbits set", "ini get", "s1 s2", "in addr", "num get", "res init", "sess ref", and "ai addrlen".

Well I can't even.


This is just so hilarious. They'll eventually have to add "man man" to the list.

On my first encounter with a Unix machine (an SGI IRC) I had heard that system help was available with the "man" command, so I typed "man"; the resulting "apropro what?" made me laugh out loud, and the other people in the room looked at me like I was some kind of idiot.

There are an infinite number of describable concepts that don't get a specific word. That doesn't mean the whole description is a "word with spaces."

It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together.

Even though there isn't a specific word for that, I wouldn't say "It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together" actually is one big word with spaces in it.

It's a bunch of words together that carry a more specific meaning when put together in that order.


Isn't this the difference between a dictionary and an encyclopedia?

I was afraid that no one would bring this up. I’m developing a strange relationship with Wikipedia over the commonplace role it serves as an online resource. But I appreciate how it normalized the practice of looking things up and to get a general overview of a thing, leading to internal and external references. Credit is due to search engines also in this case I reckon.

I imagine that languages like german that create composites of nouns have less of a problem with this:

English: cream of mushroom soup

Spanisch: sopa cremosa de champiñones

German: Champignoncremesuppe


I just checked, Champignoncremesuppe is not in my dictionary ;)

It has some compound words. But including too many of them would quickly get out of hand


Cremesuppe is in the first dictionary that I looked in. But including every kind of Cremesuppe would have been too much.

You are right! So the situation for german is worse: Millions of words are missing... ;-)

but can't you basically make anything a composite noun in German? That it's a single word doesn't really help you decided if it has enough presence unto itself to be defined in the dictionary.

Seems like they would have just as much of a problem since the issue is delineating when a "phrase" becomes a "word"


not anything. As a German I see no way to compound "boiling water". It remains two words: "kochendes Wasser".

'Boiling' isn't a noun.

true, but you'd be wrong to assume that Germans only compound words if both parts are nouns, e.g. "Gehweg" (walk way) and "Springseil" (jump rope) use the base of a verb. We do actually have "Kochwasser" ("kochen" means "to cook", "kochend" means "boiling") but that's not boiling water ("kochendes Wasser") but for water used for cooking.

More to the point, how to German dictionaries handle this?

Is there a distinction between words that get enumerated and compound nouns that do not?

It does seem, though, that German speakers might be more comfortable with the fuzziness that apparently exists at the edges of what the word "word" means.


In general, transparent compounds, i.e. those whose meaning can be derived from the elements, are not in the dictionary. Mushroom soup is transparent; Krankenhaus, which means hospital, but is literally sick-people home, isn't.

In Dutch we indeed happily do this even for English loanwords like "creditcard" or something more obscure like "lockpick". When in doubt, remove the space.

That happens often with domain names, but then you get expertsexchange.com, penisland.net, whorepresents.com, therapistfinder.com, a Dutch pre-match analysis site voorspel.nl, or a site about the game overspel.nl.

Peter Norvig - The Unreasonable Effectiveness of Data

https://www.youtube.com/watch?v=yvDCzhbjYWs&t=1477s

Not to mention Tobias Fünke’s analyst + therapist web site, analrapist.com.


I'm such a simple man. Can't help but laugh out loud everytime I see these examples mentioned.

If the compound words all have single word entries in the dictionary that when combined mean the same thing what is the point?

Water: transparent, odorless, tasteless liquid

Boiling: having reached the boiling point

Boiling Water: transparent, odorless, tasteless liquid which has reached the boiling point

If Boiling Water had some other completely different meaning that has nothing to do with the individual words then sure, maybe, otherwise this is completely redundant and opinionated.


As an English native, I'd rather see proper nouns in a dictionary before seeing "compound words".

Personally, I don't agree that "boiling water" is a word (with a space) - I would refer to it as a phrase if it had specific meaning, but it just seems like an ordinary pairing of adjective and noun. Also, if a word can contain a space, then what is the meaning of "words" as there doesn't seem an easy way to distinguish between a "compound word" and a common phrase. Is "barking dog" a pair of words, a compound word or a phrase? (It's a pair of words in my mind)


“Hospital bills”. That’s very country specific. Also, that’s two words.

Hospital bills feels like a pretty ordinary compound to me - not like "good morning" or "ginger ale" where you can't just use what you know about the two words to figure out what the compound must mean.

Some cases are basically impossible "Crash blossoms" you don't stand any chance without knowing why we call them that

Some are middling difficult, "Home Secretary" requires that you know every meaning for the two words and then you happen to pick the correct obscure meaning, a "Secretary" could be in charge, and "Home" could mean the entire country as distinct from everywhere else.

But "Hospital bills" doesn't seem even marginally difficult


I had to look up "crash blossoms"! But that's just an idiom, which is always tricky in translation. It might also be slang. Idioms and slang are borderline dictionary material, different editors make different choices, and they change over time.

But "ginger ale" seems straightforward to me. It's an ale, flavored with ginger. Not even idiomatic, just descriptive. Root beer. Grape soda. Orange chicken.


> But "ginger ale" seems straightforward to me. It's an ale, flavored with ginger

Ginger ale is in fact, not an ale, it's a soft drink. It is distantly related to Ginger Beer and some variants of Ginger Beer are alcoholic like ales, but Ginger ale was conceived as a soft drink and today continues as a soft drink.


My (likely ignorant) understanding was that the non-alcoholic "ale"s and "beer"s were fermented like their traditional versions, but the process was stopped before the ethanol level became significant.

Mass-market ginger ales and root beers are not made that way today, of course.


There seems to be a lot of overlap between this compound word concept and idioms. Both are largely atomic, defy analysis via individual word definition, and fairly language (and culture or dialect) specific.

Dictionaries are also language specific. We don't necessarily expect a 1:1 mapping of words between languages. I have personally always wondered if this subtley shapes thoughts in different languages as well.


I think it's more than overlap -- they are the same thing.

I.e. AFAICT, all compound words that defy literal interpretation are idioms. And it's that simple.

The argument then becomes that idioms should be in the dictionary. Some of them are of course, but idioms and slang are a) fast-moving, and b) often dismissed by the sorts of people who edit dictionaries.


I tend to agree. The definitions overlap perfectly.

At the same time, I am having intuitive issues seeing "hot dog" as an idiom, vs just an ordinary noun. It certainly seems to follow noun rules, and fit into speech as one.

I don't know for sure that it's NOT an idiom though. I could just be wrong here, and have intuition in need of calibration.


No, I think you're right -- "hot dog" started out as a colloquial name for a type of sausage (apparently as something of a joke, because dog meat was sometimes eaten in the area, and it was a low-quality product), and it is now the accepted name.

So it was an idiom, now it's canon.

Another good one might be "hot dish", which has an idiomatic meaning in the midwestern US, and is slowly spreading. Not sure if it's made it to the dictionary yet. (which dictionary becomes an important question -- I'd expect to see it in M-W before, say, OED)


I wonder where "sausage dog" fits into this lexicon

In most English speaking countries it's a far from common phrase (ie. it's very USA-centric).

OK. But is the meaning any less literally-obvious than "grocery bills" or "electricity bills"?

Maybe you don't have "hospital bills". I don't have "landscaping bills", but I know exactly what they are.


Sure, but my main intent was to raise the question as to why it was singled out in the article/blog post as something that needs to be in the dictionary.

As you've pointed out, the word "bills" clarifies what it is. I don't see why every combination needs to be in a dictionary. The list would be incredibly long, eg. "phone bills" or "power bills", etc.


I think we agree then. I assumed you were arguing for inclusion in a dictionary because its meaning was not obvious.

What does it mean?

It's what your insurance gets from the hospital after they provided a service to you.

At the moment, I'm in the hospital. I've been here since 0500 Friday morning. I should be released tomorrow, Tuesday. During those five days I've had services from doctors, nurses, technicians and [everybody else necessary to run a hospital]. There were multiple uses of CT scanners, ultrasounds and many machines which go Bing!. Also, a surgical operating suite, of which I remember about 60 seconds of very bright white lights and very large-screen monitors suspended from ridiculously heavy-duty supports. Like, you could safely dangle four football players of your choice (gridiron, rugby or association, doesn't matter) from them.

A team of people will compile a bill for all of those services. The bill will be presented to the insurance company whose card I showed Friday morning. It will likely be less than a million dollars, but it could easily be more than a hundred thousand dollars. That's the right order of magnitude to consider: a good percentage of a house, maybe a very large nice house.

The insurance company will claim that some of these charges are too much. The hospital knows this, and there are three mechanisms in which they justify their prices. First, although the two Tums antacids have a street value of eighteen cents when you buy it over the counter in quantity fifty, the hospital buys them in blister-packs so to avoid cross-contamination until they reach the patient. Second, it is customary to pretend that only the services which a patient actually used can be charged for, so the in-house plumber, the gas plumbers, the cryogenic fluids specialists, the oxidizing gases technicians, the potable water testers, and the electricians among a cast of thousand all need to be paid for.

And third, there's emergency care for the uninsured.

The US is cruel, but not stupid. No, I lie, it is frequently both cruel and stupid, always to people already disadvantaged in some other way. As a matter of law, a hospital can't turn away or discharge a person who is likely to die without treatment, even if they can't pay. But the government doesn't provide money to pay for that.

Finally, most hospitals or hospital systems in the US are run by for-profit private companies. I won't mention organized crime in the same sentence, but one can reasonably presume that the two are interchangeable in terms of law-abidingness and willingness to trade down ethics for an increase in profits.

So, having created the bill and sent it to an insurance company, they will argue back and forth and finally some portion of the money will eventually be transferred and everyone will be more or less happy, right?

No. Because in the US, the standard for healthcare insurance is to avoid the moral hazard of people attempting to get too much healthcare by having the insurance company bill the patient.

Remember the bill that started out as the same scale as a house? 10% "coinsurance" is often considered generous. 20% is pretty normal. Some specific services will be called out with specific fees, and others may be "disallowed" -- and sent through entirely to the patient.

That's on top of the monthly payments that have already happened.

But I work for a tech company with an unusually enlightened attitude, so I expect that my family's fiscal impact from this bout of medical intervention will be limited to the parking fees that my wife paid when she came to visit me.

It's privilege, but I'd rather that the system be reformed so that everybody got it.


Hah, I wonder how thick a German, Dutch or Afrikaans dictionary would be if it included all possible spaceless compound words. Literally any concept can be compounded together to make a new word.

Roovleisslaghuisinspekteur =

Rooi = red

Vleis = meat

Slag = butcher

Huis = house

Inspekteur = inspector

"Inspector who controls the quality of red meat in butcheries"


sometimes singular semantic concepts can take multiple syntactic words to express. Why not call this idea something other than “word”?

We could call it a "phrase".

Dictionaries are a mixed bag at best. If you apply David Kaplan’s character/content distinction from Demonstratives, you have to ask: should pure indexicals, which are essentially 'contentless' pointers be treated the same way as standard words? Let alone the thousands of rigid designators in this dataset that map directly to specific objects in the real world. At a certain point, is there no room left for encyclopedias?

I got into solving the NYT crossword during Covid. I couldn’t solve a Monday when I started; now I do Mondays downs-only and look forward to Saturdays. Along the way, I developed a sixth sense for when an answer will be more than one word. I’ve thought a lot about it and can’t really describe how I do it. (Some other puzzles clarify if an answer spans multiple words, but I find the ambiguity adds to the fun.)

Do you think this comes from a gradual internalization of a real linguistic concept? Or it more a familiarity with common (if unspoken) conventions of the puzzle makers?

I suspect the answer isn't binary, but it's interesting to think about.

This "sixth sense" phenomenon seems to pop up a lot. Crosswords are a great example. The sense some people are getting for detecting LLM output might be another.


    > Got a word           Didn’t
    > frozen water → ice   boiling water
Freezing water doesn’t have a word. Boiled water does have a word.

A mixture of melting ice and water suitable for drinking has a word: ice water. It's not a adjective noun phrase. It has a more specific meaning than just the two words together. You can order an ice water at a restaurant

Freezing water doesn't have a word, it only gets one after water has changed phase. Boiling water also gets a word once it has changed phase: steam.

ice - water - steam


Steam is liquid water droplets suspended in gas; water in the gas phase is “water vapor” which also doesn't have a single word.

This is also an interesting case because “vapor” without a qualifier also refers to a suspension of solid or liquid particles in gas (of which “steam” is a particular example).


"Steam" is very definitely the gas phase of water. Water vapor is too. If we are talking about chemistry they are essentially synonyms.

If we are talking engineering, the term steam generally implies water vapor that is at or above the saturation temperature.

In every day usage they are usually drawing a distinction between visible and invisible water vapor, usually caused by the presence of liquid droplets, with "steam" being essentially "fog", but hotter.


Nope, water vapour is the gas phase of water mixed with other gases while steam is just the gas phase of water. Water vapour can condense into tiny droplets which can freeze into ice crystals, both of which are visible as 'clouds'. Steam is not visible until it condenses into droplets at which point it no longer is steam but water suspended in another medium, usually air.

"Steam is liquid water droplets suspended in gas": You clearly did not work on steam-powered ships (or land-based steam power plants). I was Main Propulsion Assistant on a steam powered destroyer, and I can assure you that every effort is made to prevent droplets being suspended in the steam--because such droplets erode the blades on steam turbines. To that end, steam coming out of the stem drum (the upper part of the boiler) is run through superheaters, which raise the temperature of the incoming steam to evaporate any droplets. On our ship, the steam coming off the steam drum was a bit over 1200 psi and 600 some degrees Fahrenheit. After it goes through the superheaters, it's about the same pressure but 975 degrees.

And there's effectively no other gas in the steam, because dissolved air in the boiler's feedwater (particularly oxygen and carbon dioxide) has to be removed to prevent corrosion. To that end, water going into the boiler is first run through a deaerator, to remove any air that dissolved in the water as it came through the condensor.


> You clearly did not work on steam-powered ships (or land-based steam power plants

Well, that's true, I haven't, BUT still I went back and forth writing and deleting and rewriting and eventually deleting a whole digression about the special case of the jargon of steam power and how it uses “wet steam” (or “saturated steam”) for “steam” in the general use sense and “dry steam” for “water vapor” and “superheated steam” for dry steam created by heating wet steam away from contact with water, before deciding that was way too much, but, yeah, that's all true. (And, in details about the actual processes used, a lot more than I knew or would have gone into even if I had and had decided to keep the digression.)


this is an interesting distinction that i was unaware of.

Right. (I’m not sure if you’re aware but that’s exactly what I said.)

Almost but not exactly, 'boiled water' can go two ways: phase changed to steam (at which point is is no longer 'boiled water') or boiled and cooled again. Pedantic? Sure. Fits right in here? Absolutely.

These are under-respected for non native English speakers.

Can you say more on this?

Consider phrasal verbs like "shut up", "get lost" or "kick off". Knowing what the parts mean doesn't let you understand the whole.

In your native tongue you take these for granted, but in a second language you have to learn that the sum is more (or different) than the parts.


Phrasal verbs are listed under the main verb. I never ever had a problem with that. As a native speaker sometimes I still have to search for some in some strange context.

These are called idiomatic phrases, and many (all natural?) languages have them, and, yes, they are pitfalls for language learners.

These particular examples are figures of speech, so "shut" in "shut up" still means the same thing it would mean in "shut the door." And "up" is used the same way as "cover up."

So the issue is just that this is figurative language, and you have to know that a kickoff is the beginning of certain sports, for example. It's more of a cultural issue than something a dictionary needs to fix.


They don't get into enough learning lists, and from my perspective, they are great additions to word games because the more transparent compounds are unique and legit words that can more than double the accessible vocabulary.

"to be" is a very weird example because that's just the full infinitive of "be" which is definitely in dictionaries: https://www.merriam-webster.com/dictionary/be

these are called phrases

>Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.

"I used to smoke marijuana. But I’ll tell you something: I would only smoke it in the late evening. Oh, occasionally the early evening, but usually the late evening -- or the mid evening. Just the early evening, mid evening and late evening. Occasionally, early afternoon, early midafternoon, or perhaps the late-midafternoon. Oh, sometimes the early-mid-late-early morning... But never at dusk." -Steve Martin


"book steaks" is in the list, but I don't think it' real. Maybe it was supposed to be "stack".

With Twain in mind, might I suggest we adopt the simple expedient of snake casing such terms.

Finally, someone who actually thought about where to draw the line instead of rejecting words with spaces entirely.

Fascinating! I’d add “word nerd” to the list to describe the authors.

Clearly those Irish monks are to blame.

Very cool project! Reminds me Chiang's great short story 'The Truth of Fact, the Truth of Feeling':

> “If you speak slowly, you pause very briefly after each word. Thatʼs why we leave a space in those places when we write. Like this: How. Many. Years. Old. Are. You?” He wrote on his paper as he spoke, leaving a space every time he paused: Anyom a ou kuma a me?

> “But you speak slowly because youʼre a foreigner. Iʼm Tiv, so I donʼt pause when I speak. Shouldnʼt my writing be the same?”


On another note, I always wished "never mind" was spelled "nevermind"

"Each other" is like that for me, and according to search results, a lot of other people. I pronounce it ee-chother.

"Eachother" feels as natural as "somebody", "nobody", "anybody" to me


Because of the vowel following consonants, my mind splits that word into these syllables: Ea-cho-ther, so not as natural as those other words. Then again, it's English, the only rule is there's no rules.

"Opaque MWE"? Does no one know the word "idiom"?

[dead]


I'd really love to see the prompt(s) you used with Claude. The way the article was written I mistakenly thought you would expand upon that in a footnote or sidebar.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: