Architecture, challenges and solutions
This project has three main moving parts and each of these has involved several challenges:
- Our web crawler, written in Java and based on crawler4j, scours the internet in search of healthcare related articles and retrieves them into our database. The main challenges here were building a pipeline that can process large volumes of data. For each article, we check the language, the relevance to the medical field, we then detect keywords and key phrases, we match for duplicate, near-duplicate and similar articles and finally we index into the search engine.
- The recommendation system was built on top of Solr. The main challenge here was to find a balance between article relevance relative to the user’s profile, article quality and article novelty. A second challenge of this system was raised by the issues of improving natural language processing and profile generation speed. We are currently running NLP tasks in batches in order to reduce overhead and are using a blend of search weights for Solr in order to generate recommendations in a timely fashion for our users.
- The website itself where the main challenge has been keeping the number of API / database requests and round-trip-times to a minimum.