I've looked at several existing NLP frameworks (Open NLP, Stanford NLP) and none of them are accurate enough -- they fail on things like adjectives and old english second person pronouns. This makes them practically unusable for proper sense diambiguation, lemma and part of speech based rules, etc.
The Open NLP tokenizer is also terrible at tokenizing title abbreviations ("Dr", etc.) and things like the use of "--" to delimit text, which is frequently found various Project Gutenberg texts. You can train the Open NLP tokenizer, but it works on what it has seen, so you need to give it every variation of "(Mr|Mrs|Miss|Ms|Rev|Dr|...). [A-Z]" for it to tokenize those titles; the same for other tokens.
I find it substantially better than other tools as PoS tagger.
Also worth noting the that your assertion that you need these features to classify genres isn't obviously true to me at all.
For detecting uses of nouns like werewolf/werewolves, or vampire/vampires, I at least need the lemma to avoid writing different cases or a regex for each noun. Likewise, lemmatization can be used to handle different spellings (e.g. vampyre, or were-wolf). Similarly for verbs.
Lemmatization works best when it is coupled with part of speech tagging, so you avoid removing the -ing in adverbs for example.
Part of speech tagging also helps avoid incorrect labeling, such as not tagging 'bit' in "a bit is a single binary value" as the verb "to bite".
That's for the simple case.
Then there are more complex cases, like generalizing "[NP] was bitten by the vampire.", where NP can be a personal pronoun (he, she, etc.) or a name. There can also be other ways to say the same thing, e.g. "The vampire bit [NP] neck." where NP is now the object form (his, her, etc.) not the subject form. With UniversalDependencies or similar style dependency relations, you could match and label sentence fragments of the form "verb=bite, nsubj=vampire, obj=NP" (like in the first sentence) and "verb=bite, nsubj:pass=NP, obj=vampire" (like in the second sentence).
Without NLP, it becomes even harder to detect split variants like "cut off his head" and "cut his head off", which are the same thing written in different ways. I want to detect things like that and label the entire fragment "beheading", including other noun phrase variants.
With more advanced NLP features -- like coreference resolution (resolving instances of he/she/etc. to the same person), and information extraction (e.g. Dracula is a vampire) -- it would be possible to tag even more sentences and sentence fragments.
Basically because of the slow pace of review and publication the letters column became a way to talk about recent results or problems, and then follow up letters (i.e. comments on the blog posts) became common. So the editors decided to hive it off and speed up its publication schedule.
Arxiv is vital for quickly developing research fields.
https://en.wikipedia.org/wiki/English_possessive#Nouns_and_n...
The other issue is, if you do focus on LLMs, it's too hyped your research would be too overlapping/competing especially as you've got a dissertation to write. It's a hard problem.
If you are in the field of Information and Communication Technology (ICT) there are hardly any area in the field which their fundamentals do not have Shannon's hands in it.
Leonard Kleinrock once remarked that he has to focus on the exotic queuing theory field that later leads to the packet switching and then Internet because most of the fundamentals problems in electrical and computer engineering (older version of ICT) have already been solved by Shannon.
There are plenty of research directions that are outlined in this document that don't require huge compute budget.