cross-posted from: https://thelemmy.club/post/17993801

First of all, let me explain what “hapax legomena” is: it refers to words (and, by extension, concepts) that occurred just once throughout an entire corpus of text. An example is the word “hebenon”, occurring just once within Shakespeare’s Hamlet. Therefore, “hebenon” is a hapax legomenon. The “hapax legomenon” concept itself is a kind of hapax legomenon, IMO.

According to Wikipedia, hapax legomena are generally discarded from NLP as they hold “little value for computational techniques”. By extension, the same applies to LLMs, I guess.

While “hapax legomena” originally refers to words/tokens, I’m extending it to entire concepts, described by these extremely unknown words.

I am a curious mind, actively seeking knowledge, and I’m constantly trying to learn a myriad of “random” topics across the many fields of human knowledge, especially rare/unknown concepts (that’s how I learnt about “hapax legomena”, for example). I use three LLMs on a daily basis (GPT-3, LLama and Gemini), expecting to get to know about words, historical/mythological figures and concepts unknown to me, lost in the vastness of human knowledge, but I now know, according to Wikipedia, that general LLMs won’t point me anything “obscure” enough.

This leads me to wonder: are there LLMs and/or NLP models/datasets that do not discard hapax? Are there LLMs that favor less frequent data over more frequent data?

  • jacksilver@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    1 month ago

    Not the original commenter, but to add some more context. The words usually removed in traditional NLP applications are called “stop words” and are usually more “non-valuable” words like “the, and, but”.

    However, LLMs don’t skip stop words, they actually need them to better understand the context of the sentence. That being said, LLMs are not great for statistical analysis and a simple word count would be more consistent and faster.