Artwork

Content provided by Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Episode 117: Full Text Search

34:26
 
Share
 

Manage episode 156001535 series 1173626
Content provided by Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

  • Yikerz! – Super fun magnet game

Webapps – 16:12

Full Text Search – 22:11

  • Options
    • Google Custom Search
      • Commercial
      • Benefits
        • Super fast to setup
        • Easy to implement
        • Ability to add adsense into search results
      • Downsides
        • Unable to adjust content ranking and do custom integration
        • Mainly for just indexing HTML pages, not search queries and other text.
    • Sphinx
      • “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
      • Open source with commercial support
      • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
      • The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
      • API for:
        • Java, PHP, Python, Ruby, Perl, C, and other languages.
      • Written in C++
      • Stats
        • 60+ MB/sec per server
        • 500+ queries/sec
        • Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
      • Companies using Sphinx
    • Lucene
      • Done by the Apache foundation
      • Open source
      • Written in Java
      • Search types
        • ranked searching — best results returned first
        • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
        • fielded searching (e.g., title, author, contents)
        • date-range searching
        • sorting by any field
        • multiple-index searching with merged results
        • allows simultaneous update and searching
      • Stats
        • over 95GB/hour on modern hardware
        • small RAM requirements — only 1MB heap
        • index size roughly 20-30% the size of text indexed
    • Solr
      • Lucene is a library where Solr is a server that supports XML, REST
      • Benefits over Sphinx
        • Solr is easily embeddable in Java applications.
        • Solr can be integrated with Hadoop to build distributed applications
        • Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
      • Companies using Solr
  continue reading

10 episodes

Artwork
iconShare
 
Manage episode 156001535 series 1173626
Content provided by Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Jade Robbins and Mark Sanborn, Jade Robbins, and Mark Sanborn or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

  • Yikerz! – Super fun magnet game

Webapps – 16:12

Full Text Search – 22:11

  • Options
    • Google Custom Search
      • Commercial
      • Benefits
        • Super fast to setup
        • Easy to implement
        • Ability to add adsense into search results
      • Downsides
        • Unable to adjust content ranking and do custom integration
        • Mainly for just indexing HTML pages, not search queries and other text.
    • Sphinx
      • “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
      • Open source with commercial support
      • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
      • The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
      • API for:
        • Java, PHP, Python, Ruby, Perl, C, and other languages.
      • Written in C++
      • Stats
        • 60+ MB/sec per server
        • 500+ queries/sec
        • Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
      • Companies using Sphinx
    • Lucene
      • Done by the Apache foundation
      • Open source
      • Written in Java
      • Search types
        • ranked searching — best results returned first
        • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
        • fielded searching (e.g., title, author, contents)
        • date-range searching
        • sorting by any field
        • multiple-index searching with merged results
        • allows simultaneous update and searching
      • Stats
        • over 95GB/hour on modern hardware
        • small RAM requirements — only 1MB heap
        • index size roughly 20-30% the size of text indexed
    • Solr
      • Lucene is a library where Solr is a server that supports XML, REST
      • Benefits over Sphinx
        • Solr is easily embeddable in Java applications.
        • Solr can be integrated with Hadoop to build distributed applications
        • Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
      • Companies using Solr
  continue reading

10 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide