The Automatic Extraction of Web Information Based on Regular Expression

@article{Li2017TheAE,
  title={The Automatic Extraction of Web Information Based on Regular Expression},
  author={Ji Li and Guangyu Jiang and Ai-jun Xu and Yunzhen Wang},
  journal={J. Softw.},
  year={2017},
  volume={12},
  pages={180-188}
}
Based on search engine , this paper built a Web information retrieval matching and structure extraction model. And realized the algorithm of locating and automatically extracting multi-web Baidu news information. Getting the standard mathematical expression of URLs by analyzing the search results URLs and analyzing the DOM tree structure of web pages, this article designed the key tags regular expression. Finally, the method of multi-page location retrieval and structured extraction based on… 

Figures and Tables from this paper

Information Classification and Extraction on Official Web Pages of Organizations
TLDR
After locating the active blocks in the Web pages, the structural and content features are proposed to classify information with the specific model and the extraction methods based on trigger lexicon and LSTM are proposed, which efficiently process the classified information and extract data that matches the attributes.
A General Web Page Extraction Method Aiming at Online Social Networks
TLDR
An automatic particular regular expression generation algorithm is proposed which can make the content extraction method apply to many similar structure online social networks without re-writing the entire extraction process facing to each web site.
Social Media Contact Information Extraction
TLDR
This work proposes a system capable of automatically extracting named entity information from web site snippets, using Thai celebrities as the sample named entity group and then compares the system with popular celebrity websites.
Thai Celebrity Information Extraction Based on Association Rule Measures
This study aims to develop a system to automatically extract and select celebrity information from websites, as traditionally celebrity information is gathered and selected by hand, which is rather
Contextual assistant framework for the Sinhala language
  • D. DasanayakaN. Warnajith
  • Computer Science
    2020 International Research Conference on Smart Computing and Systems Engineering (SCSE)
  • 2020
TLDR
A deep learning Intent Mapping model is used to map the consumer response to a predefined “Intent” and a Feature Extraction Mechanism to extract related information from the input text to show that the implemented system performs with higher accuracy in linear conversations.
An Approach of Web Scraping on News Website based on Regular Expression
TLDR
It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.

References

SHOWING 1-10 OF 19 REFERENCES
Regular expression and its applications to web information extraction
TLDR
The regular expression is used successfully in the whole process of web information extraction, such as webpage collecting, webpage optimization, rule learning and information extraction.
Design and Realization of Template-Based Web Crawler
TLDR
Experimental results show that this system is able to complete the extraction of recruitment information and customized search and has the high platform portability and very good convenience.
Study on the Web Information Extraction Technology Based on the Ontolgy and DOM Tree
TLDR
This article introduces lots of basic knowledge about ontology, then puts more emphasis on discussion about the realization process of web extraction technology based on ontology and DOM tree.
Keyword Search on XML Data: A Survey
Research on Critical Technologies of Semantic Retrieval Based on Rule Reasoning
TLDR
A semantic rule modeling method is proposed, a new rule reasoning algorithm based on closed world assumption backwards reasoning chain is given and a new ordering algorithm is proposed based on feature similarity to get higher inference efficiency compared to most semantic inference engines.
Domain-oriented structured analysis of Web texts
TLDR
This method first accords to the structural characteristic of the semi-structured text and the level characteristic of Html text to construct the Html tree, and uses the related methods and thoughts of ontology to build the domain ontology.
Results Ranking Approach of XML Keyword Search Based on Keyword's Structural Relationships
  • Wei Ke
  • Education, Economics
  • 2013
TLDR
An XML Key words query results ranking approach based on relationships between SLCAs, which has the high precision, and can efficiently meet the user's needs as well is proposed.
Distributed Search Engine System Productivity Modeling and Evaluation
TLDR
The half-WAN scheme, which consists of a WAN-based crawling system and a multi-cluster indexing system, is proved to be the best choice for a large-scale highly-efficient Web search engine.
Web Information Extraction
...
...