Well, information theory isn't much more than the logarithm of probability theory, so it doesn't hurt to learn it anyway. The only thing you need to know is that given a probability distribution P there exist a compression scheme to encode a value X with a message of P_length(X) = log(1/P(X)) bits. This can be summarised as BITS = log(1/PROBABILITY). Entropy is just the average number of bits you need to encode a random value from distribution P with the compression scheme of distribution P, i.e. E_P[P_length(X)]. The KL(P,Q) divergence is when you encode a random value from distribution P with the compression scheme of distribution Q. Say you're compressing english text but you're using a compressor tailored to spanish. The KL divergence is how many extra bits you need (on average) compared to encoding the english text with the english compressor:
KL(P,Q) = E_P[Q_length(X)] - E_P[P_length(X)]