Extracting the Main Content from HTML Documents

  author={Samuel Louvan},
A modern web document typically consists of many kinds of information. Besides the main content which conveys the primary information, a web document also contains noisy contents such as advertisements, headers, footers, decorations, copyright information, navigation menus etc. The presence of noisy contents may affect the performance of applications such as commercial search engines, web crawlers, and web miners. Therefore, extracting main contents from web document and removing noisy contents… 

