AnotherGoodName 1 day ago

For the answer of is "is softmax the only way to turn unnormalized real values into a categorial distribution" you can just use statistics.

Eg. Using Bayesian stats, if i assume an even prior (pretend i have no assumptions about how biased it is), i see a coin flip heads 4 times in a row, what's the probability of it being heads?

Via a long winded proof using the dirichlet distribution Bayesian stats will say "add one to the top and two to the bottom". Here we saw 4/4 heads. So we guess 5/6 chance of being heads (+1 to the top, +2 to the bottom) the next time or a 1/6 chance of being tails. This represents that the statistical model is assuming some bias in the coin.

That's normalized as a probability against 1 which is what we want. It works for multiple probabilities as well, you add to the bottom as many different outcomes as you have. The Dirichlet distribution allows for real numbers and you can support this too. If you feel this gives too much weight to the possibility of the coin being biased you can actually simply add more to the top and bottom which is the same as accounting for this in your prior, eg. add 100 to the top and 200 to the bottom instead.

Now this has a lot of differences with outcomes compared to softmax. It actually gives everything a non-zero chance rather than using the classic sigmoid activation function that softmax has underneath which moves things to almost absolute 0 or 1. But... other distributions like this are very helpful in many circumstances. Do you actually think the chance of tails becomes 0 if you see heads flipped 100 times in a row? Of course not.

So anyway the softmax function fits things to a particular type of distribution but you can actually fit pretty much anything to any distribution with good old statistics. Choose the right one for your use case.

1
programjames 1 day ago

There's a rather simple proof for this "add one to the top, n to the bottom" posterior. Take a unit-perimeter circle, and divide it into n regions for each of the possible outcomes. Then lay out your k outcomes into their corresponding regions. You have n dividers and k outcomes for a total of n + k points. By rotational symmetry, the distance between any two points is equal in expectation. Thus, the expected size of any region is (1 + # outcomes in the region) / (n + k). So, if you were to take one more sample

E[sample in region i] = (1 + # outcomes in the region) / (n + k)

But, the indicator variable "sample in region i" is always either zero or one, so this must equal the probability!