undefined | Better HN

0 pointscateye2y ago0 comments

One way is trying to sneak in a specific structure/pattern that is difficult for a human to notice when reading, like using a particular sentence length, paragraph length, or punctuation pattern. Or use certain words in the text that may not be frequently used by humans etc.

Watermarking needs to be subtle enough to be unnoticeable to opposing parties, yet distinctive enough to be detectable.

So, this is an arms race especially because detecting it and altering it based on the watermark is also fun :)

0 comments

nonethewiser2y ago

> One way is trying to sneak in a specific structure/pattern that is difficult for a human to notice when reading

This seems like a total non-starter. That can only negatively impact the answers. A solution needs to be totally decoupled from answer quality.

thewataccount2y ago

The paper I linked in the parent's comment as the "Simple proof of concept" on page 2, and like you said outlines it's limitations as both negative to performance and also easily detectable and determinable.

Their improved method instead only replaces tokens when there's many good choices available, and skips replacing tokens when there are few good choices. "The quick brown fox jumps over the lazy dog" - "The quick brown" is not replaceable because it would severely harm the quality.

Essentially it's only replacing tokens where it won't harm the performance.

It's worth noting that any watermarking will likely harm the quality to some degree - but it can be minimized to the point of being viable.

yttribium2y ago

You can do this by injecting non visible unicode (LTR / RTL markers, zero width separators, the various "space" analogs, homographs of "normal" characters) but it can obviously be stripped out.

j / k navigate · click thread line to collapse