Interestingly we initially thought that prompt length would play a big factor in the performance of this approach. In practice, though, we discovered that it's actually not as big a factor as we predicted. For instance, Prompt #3 was 410 tokens long, while Prompt #5 was only 88 tokens. The estimation for Prompt #3 aligned fairly well with the IG approach (0.746 cosine similarity, 0.643 pearson correlation), while the estimation for Prompt #5 seemed to underperform (0.55 cosine similarity, 0.295 pearson correlation). Meanwhile, Prompt #2 was 57 tokens long and performed quite well (0.852 cosine similarity, 0.789 pearson correlation).
Re: our definitions of average/long/short prompts -- we weren't really rigorous with those definitions. In general, we considered anything under 100 tokens "short", 100-300 average, and 300+ large.
Our intuition here is that the relationship between performance of the estimation and the prompt structure is less about length, and more about "ambiguity". Again, we don't really have a rigorous definition of that yet, but it's something we are working on. If you take a look at the prompts in the analysis notebook you might get a sense of what I mean: prompts 1-3 are pretty straight forward and mechanical. Prompts 4 & 5 are a bit more open to interpretation. We see performance of the estimation degrade as prompts become more and more open to interpretation.