Hi, I'm working on plagiarism detection and I need some help on text extraction from pdfs. I've tried PDFTextStream which really works well for extracting text from pdfs. I need to be able to extract the text into a strutured format where i could query thing like title, chapters,etc. Would appreciate it if I could get pointers to achieving this task. Thanks
I would like to boost my programming skills by contributing to an open source project in python. I can say I'm at an intermediate level of programming(wow, still loads to learn).Not a web developer but wouldn't mind contributing to parts that don't require a deeper knowledge of web dev. Thanks