Corpus ID: 18535801

The Anatomy of a Search and Mining System for Digital Archives

@article{Harris2016TheAO,
  title={The Anatomy of a Search and Mining System for Digital Archives},
  author={Martyn Harris and M. Levene and Dell Zhang and D. Levene},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.07150}
}
Samtla (Search And Mining Tools with Linguistic Analysis) is a digital humanities system designed in collaboration with historians and linguists to assist them with their research work in quantifying the content of any textual corpora through approximate phrase search and document comparison. The retrieval engine uses a character-based n-gram language model rather than the conventional word-based one so as to achieve great flexibility in language agnostic query processing. The index is… Expand
1 Citations
Finding Parallel Passages in Cultural Heritage Archives
  • 4
  • PDF

References

SHOWING 1-10 OF 72 REFERENCES
Character N-Gram Tokenization for European Language Text Retrieval
  • 348
  • PDF
Evaluating verbose query processing techniques
  • 87
A study of smoothing methods for language models applied to information retrieval
  • 1,291
  • PDF
An Algorithm that Learns What's in a Name
  • 823
  • PDF
A Comparison of Document Clustering Techniques
  • 2,921
  • PDF
Statistical language modeling for information retrieval
  • 71
A survey of named entity recognition and classification
  • 1,996
  • PDF
Mining Text Data
  • 549
  • PDF
Search User Interfaces
  • 701
  • PDF
Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond
  • 78
...
1
2
3
4
5
...