Best practice indexing hdfs data into solr using hive
Here,based on the requirement especially how typically your data gets updated, volume and architecture.
- Run a MR job to index data using solrj.
- Create Lucene index using mr job and duplicate to the appropriate shards.
- Use Hbase indexer to populate Solr.
Properly Size Index:
- Understanding what to index typically requires deep business domain expertise on the data.
- This yields better indexing plan and increases accuracy for searching data.
- Not all data will be indexed but for an organization user have new data,Needs classification of all data untill it is understood what value it brings to the business.
- It implies is that data needs to be re-indexed so it is a good practice to store raw data somewhere low cost, often in HDFS or in the cloud object storage.