| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138 |
- We have a movie data set in JSON, Solr XML, and CSV formats.
- All 3 formats contain the same data. You can use any one format to index documents to Solr.
- The data is fetched from Freebase and the data license is present in the films-LICENSE.txt file.
- This data consists of the following fields:
- * "id" - unique identifier for the movie
- * "name" - Name of the movie
- * "directed_by" - The person(s) who directed the making of the film
- * "initial_release_date" - The earliest official initial film screening date in any country
- * "genre" - The genre(s) that the movie belongs to
- Steps:
- * Start Solr:
- bin/solr start
- * Create a "films" core:
- bin/solr create -c films
- * Set the schema on a couple of fields that Solr would otherwise guess differently (than we'd like) about:
- curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
- "add-field" : {
- "name":"name",
- "type":"text_general",
- "multiValued":false,
- "stored":true
- },
- "add-field" : {
- "name":"initial_release_date",
- "type":"pdate",
- "stored":true
- }
- }'
- * Now let's index the data, using one of these three commands:
- - JSON: bin/post -c films example/films/films.json
- - XML: bin/post -c films example/films/films.xml
- - CSV: bin/post \
- -c films \
- example/films/films.csv \
- -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
- * Let's get searching!
- - Search for 'Batman':
- http://localhost:8983/solr/films/query?q=name:batman
- * If you get an error about the name field not existing, you haven't yet indexed the data
- * If you don't get an error, but zero results, chances are that the _name_ field schema type override wasn't set
- before indexing the data the first time (it ended up as a "string" type, requiring exact matching by case even).
- It's easiest to simply reset the environment and try again, ensuring that each step successfully executes.
- - Show me all 'Super hero' movies:
- http://localhost:8983/solr/films/query?q=*:*&fq=genre:%22Superhero%20movie%22
- - Let's see the distribution of genres across all the movies. See the facet section of the response for the counts:
- http://localhost:8983/solr/films/query?q=*:*&facet=true&facet.field=genre
- - Browse the indexed films in a traditional browser search interface:
- http://localhost:8983/solr/films/browse
- Now browse including the genre field as a facet:
- http://localhost:8983/solr/films/browse?facet.field=genre
- If you want to set a facet for /browse to keep around for every request add the facet.field into the "facets"
- param set (which the /browse handler is already configured to use):
- curl http://localhost:8983/solr/films/config/params -H 'Content-type:application/json' -d '{
- "update" : {
- "facets": {
- "facet.field":"genre"
- }
- }
- }'
- And now http://localhost:8983/solr/films/browse will display the _genre_ facet automatically.
- Exploring the data further -
- * Increase the MAX_ITERATIONS value, put in your freebase API_KEY and run the film_data_generator.py script using Python 3.
- Now re-index Solr with the new data.
- FAQ:
- Why override the schema of the _name_ and _initial_release_date_ fields?
- Without overriding those field types, the _name_ field would have been guessed as a multi-valued string field type
- and _initial_release_date_ would have been guessed as a multi-valued pdate type. It makes more sense with this
- particular data set domain to have the movie name be a single valued general full-text searchable field,
- and for the release date also to be single valued.
- How do I clear and reset my environment?
- See the script below.
- Is there an easy to copy/paste script to do all of the above?
- Here ya go << END_OF_SCRIPT
- bin/solr stop
- rm server/logs/*.log
- rm -Rf server/solr/films/
- bin/solr start
- bin/solr create -c films
- curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
- "add-field" : {
- "name":"name",
- "type":"text_general",
- "multiValued":false,
- "stored":true
- },
- "add-field" : {
- "name":"initial_release_date",
- "type":"pdate",
- "stored":true
- }
- }'
- bin/post -c films example/films/films.json
- curl http://localhost:8983/solr/films/config/params -H 'Content-type:application/json' -d '{
- "update" : {
- "facets": {
- "facet.field":"genre"
- }
- }
- }'
- # END_OF_SCRIPT
- Additional fun -
- Add highlighting:
- curl http://localhost:8983/solr/films/config/params -H 'Content-type:application/json' -d '{
- "set" : {
- "browse": {
- "hl":"on",
- "hl.fl":"name"
- }
- }
- }'
- try http://localhost:8983/solr/films/browse?q=batman now, and you'll see "batman" highlighted in the results
|