Page 1 of 1

Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Tue Oct 08, 2024 4:02 pm
by Fiona T
I did some analysis on the commonality/differences between ODP and CSW before and after the ODP cull to see if it helps or hinders...

I've only considered words between 2 and 9 letters long

If I've got this right...

Pre-cull

ODP - 130262 words

32417 are not in csw21
97845 are in csw21
484 added by csw24 (and 2 removed)

Post-cull

ODP 90707 words

5767 are not in csw21
84940 are in csw21
410 added by csw24 (and 2 removed)


CSW21 - 162192 words

Not in ODP post cull 77352
In ODP post cull 84940
Not in ODP pre-cull 64347
In ODP pre-cull 97845

CSW24 - 1137 new words, 4 removed

new words not in ODP pre-cull 653
new words in ODP pre-cull 484
new words not in ODP post cull 727
new words in ODP post-cull 410

2 words removed both pre and post cull (TRANSMAN, TRANSMEN)


In conclusion :)

75% of the pre-cull lexicon is valid in CSW24
94% of the post-cull lexicon is valid in CSW24

60% of CSW24 is valid in pre-cull lexicon
52% of CSW24 is valid in post cull lexicon

So post-cull you're safer risking your countdown word in scrabble, but less safe risking your scrabble word in countdown. HTH :)

e&oe

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Tue Oct 08, 2024 9:12 pm
by Gavin Chipper
Good info.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Wed Oct 09, 2024 7:09 am
by Jon O'Neill
Cool.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Wed Oct 09, 2024 9:27 am
by Fiona T
I was kinda interested in the 78 words that CSW added this year that ODP zapped at the same time

A lot of them look like proper modern words, with the scrotum featuring disproportionally!
Wonder if we'll see some of them return over the next months...

(Full list if anyone wants to play re-add bingo)

ABYED
ACIDAEMIA
AGGY
AMBIGRAM
AMBIGRAMS
ANGSTING
BASA
BASAS
BAWBAG
BAWBAGS
BIOSECURE
COULDA
CRYPSIS
EMPING
FONIO
FONIOS
GATEKEPT
GLAMP
GLAMPED
GLAMPS
GOETTA
GOETTAS
HOMEGOING
MAGSTRIPE
MASCULISM
MASULAH
MASULAHS
MEGAPOLIS
MEMBRILLO
METAPHONY
MIDDER
MONOMYTH
MONOMYTHS
MULTIHIT
NATTO
NATTOS
NUTBALL
NUTBALLS
NUTSACK
NUTSACKS
OMNICIDE
OMNICIDES
ONGLET
ONGLETS
PANDEIRO
PANDEIROS
PANTSING
PEATED
PIZZAIOLO
PNICOGEN
PNICOGENS
PNICTOGEN
POMPOMMED
PSALTERER
QUINCH
ROUTABLE
SALINATE
SALINATED
SALINATES
SARCODINE
SIMIT
SIMITS
SKEEZY
SKUNKBUSH
STACHE
STACHES
STUFFIE
STUFFIES
SUPERTASK
UNMALTED
WAGWAN
WOULDA
ZAATARS
ZEDONK
ZEDONKS
ZEEDONK
ZEEDONKS
ZEPPOLIS

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Wed Oct 09, 2024 10:18 am
by Jon O'Neill
Fantastic.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Wed Oct 09, 2024 11:58 am
by Fiona T
Jon O'Neill wrote: Wed Oct 09, 2024 7:09 amCool.
Jon O'Neill wrote: Wed Oct 09, 2024 10:18 am Fantastic.
Image

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Thu Oct 24, 2024 6:22 pm
by Gavin Chipper
Fiona T wrote: Tue Oct 08, 2024 4:02 pm

Pre-cull

ODP - 130262 words

...

Post-cull

ODP 90707 words

By the way, a lot was made of the new additions (back in 2015 or whenever) like it would completely change max scores etc., and I got the impression that the wordlist was increasing by 10 times or something. But assuming that the post-cull list is something like what it was originally, it's not that big a difference at all. Less than 44% increase.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Thu Oct 24, 2024 6:32 pm
by Gavin Chipper
OK, so according this Apterous ticket from 2015, the number of headwords went up from about 140,000 to about 600,000 so approximately a quadrupling. The 10 times thing in my last post was an exaggeration, but this 4 times figure seems like what I remembered from the time. So why the big difference?

Could it relate to words longer than 9 letters which the Apterous ticket might have included but not this thread? I don't see any reason why the ratio difference would be so much.

Edit - I had been aware of the discrepancy previously but only recently as Graeme posted this analysis, but I didn't get round to posting about it at the time, and this thread reminded me. I feel like I've been under completely the wrong impression about the Countdown dictionary for nearly 10 years. It's still a massive load of words, but nothing like what I thought.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Thu Oct 24, 2024 7:01 pm
by Thomas Cappleman
Ray's comment there is "potentially bringing the word count from the ODO's ~140000 entries to the ODE's whopping ~600000 entries" - if they'd taken everything from the full OED. But it was just some (semi-)random selection of it, and then partially reverted in some instances.

If it had been like you'd thought, you'd have about 80% of maxes being words you'd never heard of before the update.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Thu Oct 24, 2024 7:11 pm
by Gavin Chipper
True, but I don't think there was ever any rowing back of the implications by anyone. I don't recall anyone ever saying "Oh, it's not that big after all" or putting the actual numbers (until they were removed again). I feel misled by the whole thing.

Re: Effect of dictionary cull on ODP/CSW commonality/differences

Posted: Thu Oct 24, 2024 9:12 pm
by Fiona T
There were 96,178 removals as per https://www.apterous.org/ticket_view.php?ticket=6969, more than half the lexicon

So the majority were > 9 letters and not included in my analysis. But yeah CSW is far crazier.