You can port the algorithm from TFA pretty straightforwardly. For example with AVX2 and u64, load 4 elements from i and from j, compare the 4 elements swapped leftward with the threshold, use a lookup table of shuffles and compute shuffle(xs, lut[movemask(compare_result)]). store this value on the left and advance the pointer by popcnt(movemask(compare_result)). store the raw value previously loaded on the right and advance the pointer by 4.
You face some additional trouble when the right region of already-partitioned elements is smaller than 4. I think this is solvable but I don't have a good proposal off the top of my head.
Edit: well, one solution is to look at the first 8 elements. If at least 4 belong on the right, you can process these 8 carefully and then process the input forwards without worrying about your loads overlapping. If fewer than 4 belong on the right, you can swap these 8 with the last 8, process those 8 carefully, and process the input backwards without worrying about your loads overlapping.
All of this is what we have to do anyway when targeting AVX2, even if filtering some elements and discarding others or partitioning to 2 output buffers, because there are no compressing instructions.