Join our research team to solve information extraction 🙂You need to be an ML, NLP, and LLM expert**PhD or Master 2 RequiredWe are looking for a Research Scientist to create VLMs such as NuExtract3 to power the https://nuextract.ai/ platform.
Your job will involve creating datasets, training LLMs, performing experiments / ablation studies, and so on. Check the list of typical topics bellow.
We release our models with open-source licenses and occasionally publish papers about them.
You will join a team of brilliant ML scientists supervised by our CEO (https://www.linkedin.com/in/etiennebcp/).
We are a 3-years-old AI startup with 12 employees located in Station F, Paris. We did YCombinator.
We have a hybrid work model -- you should be able to work from our office regularly (at least once a week).
Requirements
- Research Master 2, or PhD.Strong ML/NLP/LLM background.
- Self-driven, creative, passionate about ML/NLP/LLMs.
- Knows how to fine-tune an LLM (both SFT and RL). Up to date with LLM research.
- Researcher and builder mindset.
- Enjoy startup environment (fast pace, frequent changes of directions)
Responsibilities
- Training task-specific LLMsRunning experiments/ablation studies
- Creating datasets
- Developing software related to LLMsStaying up to date with relevant LLM & NLP research
- Typical R&D topics we are working on (non exhaustive list):
-
- Extraction Confidence
- Users of NuExtract.ai want to be able to quickly verify the validity of extracted values in the JSON output.
- To do so, they need to know which values NuExtract is confident about, and which ones it is not.
- We want to figure out how we can get an uncertainty score for the extraction values of NuExtract.
- This is not trivial due to multiplicity of correct answers and correlations between answers.
-
- Extraction Localization Users of NuExtract.ai want to be able to quickly verify the validity of extracted values.
- To do so, they need to know where, in the document, the information is coming from (or deduced from).
- We want to figure out how to do this best.
-
- Long Document Extraction
- LLMs have a limited context length which limits document size.
- We want to figure out how NuExtract could extract information from documents much longer than its context length.
-
- Reasoning for Structured Extraction
- We want to train NuExtract able to reason via private chain of thoughts about its extraction.
-
- Extraction Agent We want to provide a reasoning NuExtract the ability of using tools (e.g. zooming on document or performing a web search) in order to improve extraction quality.
-
- Structured Extraction Benchmark
- There is no public benchmark for structured extraction.
- We want to create such benchmark and make it public.
- Links:
- Platform: https://nuextract.ai/Blog posts: https://about.nuextract.ai/blog
- Hugging Face: https://huggingface.co/numind
- Github: https://github.com/numindai
- Discord: https://discord.com/invite/3ts
- EtJNCDeNuNER paper: https://arxiv.org/abs/2402.15343